In this post we will look at how the Neural Tangent Kernel (NTK) can be used to derive training guarantees for sufficiently overparameterized neural networks.
{}[1]{{#1}}
% Figure reference, lower-case. % Figure reference, capital. For start of sentence % Section reference, lower-case. % Section reference, capital. % Reference to two sections. % Reference to three sections. % Reference to an equation, lower-case. % Reference to an equation, upper case % A raw reference to an equation—avoid using if possible % Reference to a chapter, lower-case. % Reference to an equation, upper case. % Reference to a range of chapters % Reference to an algorithm, lower-case. % Reference to an algorithm, upper case. % Reference to a part, lower case % Reference to a part, upper case
% Random variables % rm is already a command, just don’t name any random variables m
% Random vectors
% Elements of random vectors
% Random matrices
% Elements of random matrices
% Vectors
% Elements of vectors
% Matrix
% Tensor%
% Graph
% Sets % Don’t use a set called E, because this would be the same as our symbol % for expectation.
% Entries of a matrix
% entries of a tensor % Same font as tensor, without wrapper% The true underlying data generating distribution % The empirical distribution defined by the training set % The model distribution % Stochastic autoencoder distributions
% Laplace distribution
% Wolfram Mathworld says \(L^2\) is for function spaces and \(\ell^2\) is for vectors % But then they seem to use \(L^2\) for vectors throughout the site, and so does % wikipedia.
% See usage in notation.tex. Chosen to match Daphne’s book.
% MM commands % Defined commands%
% Caligraphic letters $$
Introduction and problem setting
The purpose of this post is to derive from first principles how the Neural Tangent Kernel (NTK) can be used to derive training guarantees for sufficiently overparameterized neural networks. As our goal is to highlight the key ideas we will work in the simplest setting possible, namely a single layer differentiable neural network. More complex and general forms of the argument presented here are widely present in the literature, starting with for instance (Du et al. 2019) and (Jacot, Gabriel, and Hongler 2020).
For now let \(f: \mathbb{R}^p \times \mathbb{R}^d \rightarrow \mathbb{R}\) be a parameterized function with parameters \(\theta \in \mathbb{R}^p\) mapping vectors in \(\mathbb{R}^d\) to a real scalar value. Consider an arbitrary training sample consisting of \(n\) pairs of points and their corresponding targets \(({\mathbf{x}}_i, y_i)_{i=1}^n \in (\mathbb{R}^d \times \mathbb{R})^n\), recall the least squares loss defined as \[\begin{align*} L(\theta) &= \frac{1}{2}\sum_{i=1}^n (f(\theta, {\mathbf{x}}_i) - y_i)^2. \end{align*}\] To solve the least squares problem we study the trajectory through parameter space under gradient flow, a continuous time simplification of gradient descent: simplifying our notation by using \(L(t)\) instead of \(L(\theta(t))\), then for \(t\geq 0\) \[\begin{equation} \label{opt2-eq:grad-flow} \frac{d \theta(t)}{dt} = - \nabla_{\theta} L(t). \end{equation}\] For now we assume that \(f\) is at differentiable with respect to its parameters. With \({\mathbf{u}}(t) = [f(\theta(t), {\mathbf{x}}_1), f(\theta(t), {\mathbf{x}}_2)... f(\theta(t), {\mathbf{x}}_n)] \in \mathbb{R}^n\) denoting the vector of predictions and \({\mathbf{r}}(t) = {\mathbf{u}}(t) - {\mathbf{y}}\) the vector of residuals at time \(t \geq 0\), then we denote the Jacobian as \[\begin{align*} {\mathbf{J}}(t) = \nabla_\theta {\mathbf{r}}(t)= \nabla_\theta {\mathbf{u}}(t) = \begin{bmatrix} \frac{\partial f(\theta(t), {\mathbf{x}}_1) }{\partial \theta_1} & \frac{\partial f(\theta(t), {\mathbf{x}}_2) }{\partial \theta_1} & ... & \frac{\partial f(\theta(t), {\mathbf{x}}_n) }{\partial \theta_1} \\ \frac{\partial f(\theta(t), {\mathbf{x}}_1) }{\partial \theta_2} & \frac{\partial f(\theta(t), {\mathbf{x}}_2) }{\partial \theta_2} & ... & \frac{\partial f(\theta(t), {\mathbf{x}}_n) }{\partial \theta_2} \\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial f(\theta(t), {\mathbf{x}}_1) }{\partial \theta_p} & \frac{\partial f(\theta(t), {\mathbf{x}}_2) }{\partial \theta_p} & ... & \frac{\partial f(\theta(t), {\mathbf{x}}_n) }{\partial \theta_p} \end{bmatrix} \in \mathbb{R}^{p \times n}. \end{align*}\]
A recipe for deriving guarantees based on analyzing the smallest eigenvalue of the NTK
The Jacobian and its gram, \({\mathbf{H}}(t) = {\mathbf{J}}(t)^T {\mathbf{J}}(t) \in \mathbb{R}^{n \times n}\), which is also referred to as the Neural Tangent Kernel (NTK) gram matrix, will play a critical role in what follows here and also when it comes to studying linearized neural networks. Note calling \({\mathbf{H}}(t)\) the NTK or NTK gram matrix is somewhat misleading as it can be studied more generally for any sufficiently smooth model, not just neural networks! However, as it is now accepted terminology we will stick with it. The following proposition illustrates the significance of this matrix for training.
Proposition 1 Assume \(f\) is differentiable with respect to its parameters \(\theta \in \mathbb{R}^p\) at all points along the trajectory of gradient flow. Then \[\begin{align*} \frac{d {\mathbf{r}}(t)}{dt} = - {\mathbf{H}}(t) {\mathbf{r}}(t). \end{align*}\]
Proof. Observe that as \[\begin{align*} \frac{\partial L(\theta)}{\partial \theta_k} = \sum_{i=1}^n \frac{\partial f(\theta, {\mathbf{x}}_i)}{ \partial \theta_k} (f(\theta, {\mathbf{x}}_i) - y_i) \end{align*}\] then collecting terms we have \[ \nabla_{\theta} L(\theta(t)) = {\mathbf{J}}(t){\mathbf{r}}(t). \] Noting that \(f\) is a function of \(p\) parameters which each depend on \(t\), then from the chain rule it follows that \[\begin{align*} \frac{d u_i(t)}{dt } = \frac{d f(\theta(t), {\mathbf{x}}_i)}{dt} = \sum_{k = 1}^p \frac{d \theta_k}{dt} \frac{\partial f(\theta(t), {\mathbf{x}}_i)}{\partial \theta_k} = \nabla_{\theta} f(\theta(t), {\mathbf{x}}_i)^T \left(\frac{d \theta(t)}{dt}\right). \end{align*}\] Again collecting terms and substituting the expression for gradient flow we have \[\begin{align*} \frac{d {\mathbf{u}}(t)}{dt} = {\mathbf{J}}(t)^T \frac{d \theta(t)}{dt} = - {\mathbf{J}}(t)^T \nabla_{\theta}L(t) = - {\mathbf{J}}(t)^T{\mathbf{J}}(t){\mathbf{r}}(t) = - {\mathbf{H}}(t) {\mathbf{r}}(t). \end{align*}\] To finish observe \(\frac{d {\mathbf{u}}(t)}{dt} = \frac{d {\mathbf{r}}(t)}{dt}\).
The following lemma hints at how we might be able to use Proposition 1 to derive guarantees for training, note here we use \(\lambda_i({\mathbf{A}})\) to denote the \(ith\) eigenvalue of a matrix \({\mathbf{A}}\in \mathbb{C}^{n \times n}\) where \(\lambda_1({\mathbf{A}}) \geq \lambda_2({\mathbf{A}}) \geq ... \lambda_n({\mathbf{A}})\).
Lemma 1 For some \(T\in \mathbb{R}_{\geq 0}\), suppose for all \(t \in [0,T]\) there exists a constant \(\kappa \geq 0\) such that \(\lambda_n({\mathbf{H}}(t)) \geq \frac{\kappa}{2}\). Suppose \(f\) is differentiable along the trajectory of the gradient flow. Then for all \(t \in [0, T]\) \[\begin{align*} L(t) \leq \exp(- \kappa t) L(0). \end{align*}\]
Indeed, Lemma 1 suggests that arbitrarily small training error can be guaranteed as long as we can bound the smallest eigenvalue of \({\mathbf{H}}(t)\) above zero for sufficiently long enough.
Proof. Observe by definition that \(L(t) = ||{\mathbf{r}}(t) ||_2^2\). Using Lemma 1 it follows that \[\begin{align*} \frac{d}{dt}||{\mathbf{r}}(t)||^2 &=2{\mathbf{r}}(t)^T\frac{d {\mathbf{r}}(t)}{dt} = - 2{\mathbf{r}}(t)^T {\mathbf{H}}(t) {\mathbf{r}}(t) = -2 ||{\mathbf{J}}(t) {\mathbf{r}}(t)||^2 \leq - 2\lambda_n({\mathbf{H}}(t)) ||r(t)||^2 \leq -\kappa||r(t)||^2. \end{align*}\] As \(||r(s)||^2\) and \(-\kappa\) are real and continuous functions of \(s\) for \(s \in [0,t]\), then the result claimed follows from Gronwall’s inequality, \[\begin{align*} ||{\mathbf{r}}(t)||^2 \leq \exp\left( - \int_{0}^t \kappa ds \right) ||{\mathbf{r}}(0)||^2 = \exp(-\kappa t) ||{\mathbf{r}}(0)||^2. \end{align*}\]
As a result, to prove convergence it suffices to uniformly lower bound the smallest eigenvalue of \({\mathbf{H}}(t)\) for all \(t\geq0\). Note as \({\mathbf{H}}(t)\) is a gram matrix then its eigenvalues are both real and non-negative. This observation leads to the following somewhat trivial corollary.
Corollary 1 Under the same conditions as Lemma 1 we have \(L(t) \leq L(0)\).
Before proceeding a small side point to make is that we actually only need to bound the smallest eigenvalue of the eigenspace of \({\mathbf{H}}(t)\) in which the residue lies. To be clear, suppose \({\mathbf{r}}(t)\) lies in the span of the top \(k(t)\) eigenvectors of \({\mathbf{H}}(t)\), then it would suffice to lower bound instead \(\lambda_{k(t)}({\mathbf{H}}(t))\). However, for neural networks, and indeed many other models, analyzing the spectrum of \({\mathbf{H}}(t)\) directly is difficult. In particular, before one even considers the dynamics, observe, due to the random initialization of the network parameters, that \({\mathbf{H}}(0)\) is a random matrix whose distribution is typically not easily analyzed. The approach we will instead pursue is as follows: i) substitute the analysis of the eigenvalues of \({\mathbf{H}}(0)\) with that of a simpler ‘proxy’ matrix \({\mathbf{H}}_{\infty}\) (the choice of notation here will soon become clear!), which we assume for now is positive semi-definite, then ii) derive conditions to ensure that \(\lambda_n({\mathbf{H}}(t))\) remains close to \(\lambda_n({\mathbf{H}}_{\infty})\) for all \(t \in [0,T]\) where \(T\in \mathbb{R}_{>0}\) is arbitrary.
Lemma 2 Let \({\mathbf{H}}_{\infty} \in \mathbb{R}^{n \times n}\) be positive semi-definite. Given a \(T \in \mathbb{R}_{>0}\), if \(||{\mathbf{H}}(t) - {\mathbf{H}}(0)||, ||{\mathbf{H}}(0) - {\mathbf{H}}_{\infty}|| \leq \lambda_n({\mathbf{H}}_{\infty})/4\) for all \(t \in [0, T]\), then \(\lambda_n(H(t)) \geq \frac{\lambda_{n}({\mathbf{H}}_{\infty})}{2}\) for all \(t \in [0, T]\).
Proof. For any square matrix \({\mathbf{A}}\in \mathbb{R}^{n \times n}\) we have \(\lambda_1({\mathbf{A}}) = -\lambda_n(-{\mathbf{A}})\). Therefore, and as \({\mathbf{H}}(t)\) and \({\mathbf{H}}_{\infty}\) are Hermitian by construction and assumption respectively, using a Weyl inequality we have \[ |\lambda_n({\mathbf{H}}(t)) - \lambda_n({\mathbf{H}}_{\infty})| = |\lambda_n({\mathbf{H}}(t)) + \lambda_1(-{\mathbf{H}}_{\infty})| \leq |\lambda_1({\mathbf{H}}(t) - {\mathbf{H}}_{\infty})| = ||{\mathbf{H}}(t) - {\mathbf{H}}_{\infty}||. \] Therefore, from the assumptions of the lemma and using the triangle inequality, it follows that \[ |\lambda_n({\mathbf{H}}(t)) - \lambda_n({\mathbf{H}}_{\infty})| \leq ||{\mathbf{H}}(t) - {\mathbf{H}}_{\infty}|| \leq ||{\mathbf{H}}(t) - {\mathbf{H}}(0)|| + ||{\mathbf{H}}(0) - {\mathbf{H}}_{\infty}|| \leq \frac{\lambda_n({\mathbf{H}}_{\infty})}{2}. \] Trivially the result of the lemma holds if \(\lambda_n({\mathbf{H}}(t)) \geq \lambda_{n}({\mathbf{H}}_{\infty})\), therefore assume \(\lambda_n({\mathbf{H}}(t)) < \lambda_{n}({\mathbf{H}}_{\infty})\): in this case rearranging the inequality derived above it follows that \(\lambda_n({\mathbf{H}}(t)) \geq \frac{\lambda_n({\mathbf{H}}_{\infty})}{2}\).
Based on Lemma 1 and Lemma 2 we therefore can use the following approach for deriving training guarantees when confronted with non-linear least squares.
- Identify a suitable ‘proxy’ matrix \({\mathbf{H}}_{\infty}\) which is positive semi-definite and is close to \({\mathbf{H}}(0)\), in particular \(||{\mathbf{H}}(0) - {\mathbf{H}}_{\infty}|| \leq \lambda_n({\mathbf{H}}_{\infty})/4\)
- Identify a parameter regime which ensures the NTK remains close to its initialization, \(||{\mathbf{H}}(t) - {\mathbf{H}}(0)|| \leq \lambda_n({\mathbf{H}}_{\infty})/4\).
For neural networks we will see that a good candidate for \({\mathbf{H}}_{\infty}\) is the expected value of \({\mathbf{H}}(0)\) which coincides with the infinite width limit of the network. By making the width of the network sufficiently large we will prove the above conditions are satisfied. Finally note for the bound on the loss to be useful we require \(\lambda_{n}({\mathbf{H}}_{\infty})>0\)!
Case-study: a differentiable, shallow neural network
To illustrate how the approach for proving convergence guarantees can be applied to neural networks we study potentially the simplest setting possible. Consider a two layer network \[ f(\theta, {\mathbf{x}}) = \frac{1}{\sqrt{m}}\sum_{j=1}^m a_j \phi({\mathbf{w}}_j^T{\mathbf{x}}). \] Here \({\mathbf{x}}\in \mathbb{R}^d\) denotes an input vector, \({\mathbf{w}}_j \in \mathbb{R}^d\) the weights of the \(j\)th neuron, \({\mathbf{W}}\in \mathbb{R}^{m \times d}\) the matrix of neuron weights stored row-wise, \({\mathbf{a}}\in \mathbb{R}^m\) the vector of output weights. The reasons for the explicit scaling of \(1/\sqrt{m}\) will be become apparent soon. With regards to the activation function, for now we only assume \(\phi:\mathbb{R}\rightarrow \mathbb{R}\) is differentiable, as a result of this both \(f\) and in turn \(L\) are differentiable with respect to the network parameters. We further assume the network parameters are initialized mutually independent of one another with inner weights \(w_{jk}(0) \sim N(0,1)\) and outer weights \(a_j(0) \sim U(\{-1,+1\})\) for all \(j \in [m]\), \(k\in [d]\). We assume for simplicity the outer weights \({\mathbf{a}}\) are frozen after initialization while the inner weights \({\mathbf{W}}(t)\) evolve according to gradient flow . We therefore define the trainable network parameter vector \(\theta = [{\mathbf{w}}_1^T, {\mathbf{w}}_2^T... {\mathbf{w}}_m^T] \in \mathbb{R}^{p}\) where \(p = dm\). Finally, for typographical clarity we assume that the data is normalized to lie on the unit ball, i.e., \(||{\mathbf{x}}_i ||_2 = 1\) for all \(i\in [n]\).
First we characterize the entries of \({\mathbf{H}}(t)\), as \[ \frac{\partial f(\theta(t), {\mathbf{x}})}{\partial w_{rc}} = \frac{1}{\sqrt{m}} a_r \phi'({\mathbf{w}}_r^T{\mathbf{x}})x_c \] then \[\begin{align*} H_{il}(t) &= \nabla_{\theta}f(\theta(t), {\mathbf{x}}_i)^T\nabla_{\theta}f(\theta(t), {\mathbf{x}}_l)\\ & = \frac{1}{m} \left(\sum_{k \in [d]} x_{ki} x_{kl} \right) \left(\sum_{r \in [m]} a_r^2 \phi'({\mathbf{w}}_r(t)^T {\mathbf{x}}_i) \phi'({\mathbf{w}}_r(t)^T {\mathbf{x}}_l)\right) \\ &= \frac{1}{m} \sum_{r \in [m]} \phi'({\mathbf{w}}_r(t)^T {\mathbf{x}}_i) \phi'({\mathbf{w}}_r(t)^T {\mathbf{x}}_l). \end{align*}\]
Assume now that \(\phi'(z)\leq C_1\) for all \(z \in \mathbb{R}\) and let \({\mathbf{w}}\sim N(\textbf{0}, \textbf{I}_d)\). As \(({\mathbf{w}}_r)_{r \in [m]}\) are mutually independent and identically distributed then \[ \mathbb{E}[H_{il}(0)] = \mathbb{E}[\phi'({\mathbf{w}}^T {\mathbf{x}}_i) \phi'({\mathbf{w}}^T {\mathbf{x}}_l)] \leq C^2 < \infty. \] Furthermore, by the law of large numbers \[ \lim_{m \rightarrow \infty} H_{il}(0) = \mathbb{E}[ H_{il}(0)] \] Now let \({\mathbf{H}}_{\infty}= \mathbb{E}[{\mathbf{H}}(0)]\). Recall \({\mathbf{H}}(t) = {\mathbf{J}}(t)^T {\mathbf{J}}(t)\) is a gram matrix and therefore positive semi-definite and thus \({\mathbf{H}}_{\infty}\) is also positive semi-definite. We now show via a concentration argument that if \(m\) is large then \(H_{il}(0) \approx \mathbb{E}[H_{il}(0)]\). This means given sufficient width we can bound the 2-norm distance between \({\mathbf{H}}(0)\) and \({\mathbf{H}}_{\infty}\) using the Frobenius norm of their difference.
Lemma 3 Assume \(\phi'(z)\leq C\) for all \(z \in \mathbb{R}\). For arbitrary \(\delta \in (0,1]\) and \(\varepsilon>0\), if \(m \geq n^2 \frac{2C^4}{\varepsilon^2} \ln \left( \frac{2n^2}{\delta} \right)\) then with probability at least \(1-\delta\) \[ || {\mathbf{H}}(0) - {\mathbf{H}}_{\infty}|| < \varepsilon. \]
Proof. Let \(i,l \in [n]\) be arbitrary and for typographical clarity let \(Z_r = \phi'({\mathbf{w}}_r^T {\mathbf{x}}_i) \phi'({\mathbf{w}}_r^T {\mathbf{x}}_l)\). Then by the bounded derivative assumption on \(\phi\) it follows that \(H_{il}(0)\) is the arithmetic average of a sum of \(m\) mutually independent and identically distributed random variables, which almost surely are supported on \([-C^2, C^2]\). Applying Hoeffding’s inequality \[\begin{align*}
\mathbb{P}\left( \left|\frac{1}{m} \sum_{r=1}^m (Z_r - \mathbb{E}[Z_r]) \right| \geq \epsilon \right) \leq 2\exp\left( -\frac{2m\epsilon^2}{4C^4} \right),
\end{align*}\] and therefore \[\begin{align*}
\mathbb{P}\left( | H_{il}(0) - \mathbb{E}[H_{il}(0)] | \geq \epsilon \right) \leq 2\exp\left( -\frac{m\epsilon^2}{2C^4} \right).
\end{align*}\] Applying the union bound \[\begin{align*}
\mathbb{P}\left( \bigcap_{i,l=1}^n \{| H_{il}(0) - \mathbb{E}[H_{il}(0)] | < \epsilon \} \right) & = 1 - \mathbb{P}\left( \bigcup_{i,l=1}^n \{| H_{il}(0) - \mathbb{E}[H_{il}(0)] | \geq \epsilon \} \right)\\
&\geq 1- \sum_{i,l=1}^n \mathbb{P}\left(| H_{il}(0) - \mathbb{E}[H_{il}(0)] | \geq \epsilon \right)\\
& \geq 1- 2n^2 \exp\left( -\frac{m\epsilon^2}{2C^4} \right)
\end{align*}\]
Let \(\delta \in (0,1]\) denote the failure probability, then \[\begin{align*}
\delta &\geq 2n^2\exp\left( -\frac{m\epsilon^2}{2C^4} \right) \Leftrightarrow
\ln \left( \frac{2n^2}{\delta} \right) \leq \frac{m\epsilon^2}{2C^4} \Leftrightarrow
m \geq \frac{2C^4}{\epsilon^2} \ln \left( \frac{2n^2}{\delta} \right).
\end{align*}\] Setting \(\epsilon = \varepsilon/n\), if \(m \geq \frac{2C^4n^2}{\varepsilon^2} \ln \left( \frac{2n^2}{\delta} \right)\) then with probability at least \(1-\delta\) \[\begin{align*}
|| {\mathbf{H}}(0) - {\mathbf{H}}_{\infty}||^2 < || {\mathbf{H}}(0) - {\mathbf{H}}_{\infty}||_F^2
=\sum_{i,l \in [n]} | H_{il}(0) - \mathbb{E}[H_{il}(0)] |^2
< n^2 \epsilon^2 = \varepsilon^2.
\end{align*}\]
To ensure closeness of the finite width NTK to its idealized infinite width limit at initialization we require a significant level of overparameterization! It is worth remarking that the tools we have used here are quite crude and that this dependency can indeed be relaxed, see for instance (Banerjee et al. 2023). Before we proceed to bound the dynamics we first bound \(L(0)\), in particular it will become apparent later that we need \(L(0)\) to scale sublinearly with \(m\)!
Lemma 4 Suppose there exists a \(C \in \mathbb{R}_{>0}\) such that \(\mathbb{E}[\phi^2(Z)] \leq C\) and \(|y_i| \leq C\) for all \(i \in [n]\) and \(Z \sim N(0,1)\). For \(\delta \in (0,1)\), if \(n \geq \delta^{-1}\) then with probability at least \(1-\delta\) we have \(L(0) \leq 2Cn^2\).
Proof. A naive approach to bounding \(L(0)\) might be to uniformly bound \(\phi({\mathbf{w}}_j^T{\mathbf{x}})\) for all \(j\in [m]\) and any unit norm input \({\mathbf{x}}\). This approach will clearly result in a bound which scales with \(m\) however. Due to the random initialization clearly \(L(0)\) is random, so our approach will instead be to show that the expectation of \(L(0)\) does not scale with \(m\) and then use a concentration bound. Note, having random output weights actually makes our life easier compared with say fixing the output weights according to some pre-determined pattern, e.g., half are negative and half positive! First observe \[\begin{align*} L(0) = \sum_{i=1}^n(f(\theta(0), {\mathbf{x}}_i)- y_i)^2 = \sum_{i=1}^n(f^2(\theta(0), {\mathbf{x}}_i)- 2y_if(\theta(0), {\mathbf{x}}_i) ) + ||{\mathbf{y}}||^2 . \end{align*}\] For typographical ease, letting \({\mathbf{w}}_j = {\mathbf{w}}_j(0)\) for all \(j \in [m]\) then analyzing the quadratic term we have \[\begin{align*} f^2(\theta(0), {\mathbf{x}}_i) = \frac{1}{m} \sum_{j,k=1}^m a_j a_k \phi({\mathbf{w}}_j^T {\mathbf{x}}_i)\phi({\mathbf{w}}_k^T {\mathbf{x}}_i), \end{align*}\] this contains \(m(m+1)/2\) distinct random variables which by inspection are not independent. This rules out using say Hoeffding’s bound, so instead we analyze the expectation and simply apply Markov’s inequality. First, as \({\mathbf{w}}_j \in N(0, \textbf{I})\) and \(||{\mathbf{x}}_i||=1\) then \({\mathbf{w}}_j^T{\mathbf{x}}_i = \sum_{l=1}^d {\textnormal{w}}_{jl} x_{li} \sim N(0, 1)\). Therefore, and noting the assumption on \(\phi\) that \(\mathbb{E}[\phi^2(Z)] \leq C < \infty\), as \(\mathbb{E}[a_j] = 0\) for all \(j \in [m]\) then by independence \[ \mathbb{E}[f(\theta(0), {\mathbf{x}}_i) )] = \frac{1}{\sqrt{m}} \sum_{j=1}^m \mathbb{E}[a_j]\mathbb{E}[\phi({\mathbf{w}}_j^T {\mathbf{x}}_i)] = 0. \] Furthermore \[\begin{align*} \mathbb{E}[f^2(\theta(0), {\mathbf{x}}_i) )] &= \frac{1}{m}\sum_{j=1} \mathbb{E}[a_j^2]\mathbb{E}[\phi^2({\mathbf{w}}_j^T {\mathbf{x}}_i)] + \frac{1}{m}\sum_{j\neq k} \mathbb{E}[a_j]\mathbb{E}[a_k]\mathbb{E}[\phi({\mathbf{w}}_j^T {\mathbf{x}}_i)]\mathbb{E}[\phi({\mathbf{w}}_k^T {\mathbf{x}}_i)]\\ &=\frac{1}{m}\sum_{j=1} \mathbb{E}[a_j^2]\mathbb{E}[\phi^2({\mathbf{w}}_j^T {\mathbf{x}}_i)]\\ &\leq C, \end{align*}\] therefore \[\begin{align*} \mathbb{E}[L(0)] &= \sum_{i=1}^n(\mathbb{E}[f^2(\theta(0), {\mathbf{x}}_i)]- 2y_i\mathbb{E}[f(\theta(0), {\mathbf{x}}_i)] ) + ||{\mathbf{y}}||^2 \leq 2nC \end{align*}\] by the assumptions of the lemma. As \(L(0)\) is a non-negative random variable then \[ \mathbb{P}(L(0) \geq 2Cn^2) \leq \frac{1}{n}. \] Therefore, for any failure probability \(\delta \in (0,1)\) if \(n\geq \delta^{-1}\) then with probability at least \(1-\delta\) we have \(L(0) < 2Cn^2\).
Before proceeding it is worth remarking that this bound is certainly not tight due to the fact we have used Markov’s inequality. One alternative would be to try to using a for example Chebyshev’s inequality, or alternatively change the initialization. In particular, an antisymmetrically initialized network, in which a positive and negative equally weighted copy of each neuron is present, ensures that at initialization the network is the zero function. In this setting \(L(0) = \frac{1}{2}||{\mathbf{y}}||^2\) which scales only with \(n\) not \(m\).
So far we have achieved our first goal, a sufficient condition for the second, i.e., the control of the dynamics to ensure that \({\mathbf{H}}(t)\) remains close to \({\mathbf{H}}(0)\), can be derived by guaranteeing that the parameters of network never move far from initialization.
Lemma 5 Assume \(\phi\) is differentiable and that there exists a \(C\in \mathbb{R}_{\geq 0}\) such that \(|\phi'(z)|\leq C\) for all \(z \in \mathbb{R}\) and \(\phi'\) is \(C\)-Lipschitz. For some \(t \in \mathbb{R}_{\geq 0}\) suppose \(|| {\mathbf{w}}_r(t) - {\mathbf{w}}(0)|| \leq \frac{\lambda_n({\mathbf{H}}_{\infty})}{8C^2 n } =: R\). Then \[ || {\mathbf{H}}(t) - {\mathbf{H}}(0)|| \leq \frac{\lambda_n({\mathbf{H}}_{\infty})}{4} \]
Proof. First, for arbitrary \(i,l \in [n]\) observe from that \[\begin{align*} | H_{il}(t) - H_{il}(0)| &= \frac{1}{m} \left| \sum_{r=1}^m \phi'({\mathbf{w}}_r(t)^T {\mathbf{x}}_i)\phi'({\mathbf{w}}_r(t)^T {\mathbf{x}}_l) - \phi'({\mathbf{w}}_r(0)^T {\mathbf{x}}_i)\phi'({\mathbf{w}}_r(0)^T {\mathbf{x}}_l) \right|\\ &\leq \frac{1}{m} \sum_{r=1}^m \left|\phi'({\mathbf{w}}_r(t)^T {\mathbf{x}}_i)\phi'({\mathbf{w}}_r(t)^T {\mathbf{x}}_l) - \phi'({\mathbf{w}}_r(0)^T {\mathbf{x}}_i)\phi'({\mathbf{w}}_r(0)^T {\mathbf{x}}_l) \right|\\ \end{align*}\] For typographical ease let \(g({\mathbf{a}}, {\mathbf{b}}) = \left|\phi'({\mathbf{a}}^T {\mathbf{x}}_i)\phi'({\mathbf{a}}^T {\mathbf{x}}_l) - \phi'({\mathbf{b}}^T {\mathbf{x}}_i)\phi'({\mathbf{b}}^T {\mathbf{x}}_l) \right|\), then using the triangle inequality and the fact that \(|\phi'(z)| < C\) \[\begin{align*} g({\mathbf{a}}, {\mathbf{b}}) &\leq \left|\phi'({\mathbf{a}}^T {\mathbf{x}}_i)\phi'({\mathbf{a}}^T {\mathbf{x}}_l) - \phi'({\mathbf{a}}^T {\mathbf{x}}_l)\phi'({\mathbf{b}}^T {\mathbf{x}}_i) \right|\\ &+\left|\phi'({\mathbf{a}}^T {\mathbf{x}}_l)\phi'({\mathbf{b}}^T {\mathbf{x}}_i) - \phi'({\mathbf{b}}^T {\mathbf{x}}_i)\phi'({\mathbf{b}}^T {\mathbf{x}}_l) \right|\\ & \leq C \left(|\phi'({\mathbf{a}}^T {\mathbf{x}}_i) - \phi'({\mathbf{b}}^T {\mathbf{x}}_i) | + |\phi'({\mathbf{a}}^T {\mathbf{x}}_l) - \phi'({\mathbf{b}}^T {\mathbf{x}}_l) | \right). \end{align*}\] As \(\phi'\) is \(C\)-Lipschitz continuous then \(|\phi'({\mathbf{a}}^T {\mathbf{x}}) - \phi'({\mathbf{b}}^T {\mathbf{x}}) | \leq C |{\mathbf{a}}^T {\mathbf{x}}- {\mathbf{b}}^T {\mathbf{x}}|\), therefore \[\begin{align*} g({\mathbf{a}}, {\mathbf{b}}) &\leq C^2 \left( |{\mathbf{a}}^T {\mathbf{x}}_i - {\mathbf{b}}^T {\mathbf{x}}_i| + |{\mathbf{a}}^T {\mathbf{x}}_l - {\mathbf{b}}^T {\mathbf{x}}_l |\right)\\ & \leq C^2(|| {\mathbf{x}}_i || + || {\mathbf{x}}_l || ) || {\mathbf{a}}- {\mathbf{b}}|| \\ & = 2C^2 || {\mathbf{a}}- {\mathbf{b}}||. \end{align*}\] Therefore \[\begin{align*} | H_{il}(t) - H_{il}(0)| \leq \frac{1}{m} \sum_{r=1}^m g({\mathbf{w}}_r(t), {\mathbf{w}}_r(0)) \leq 2C^2 \frac{\lambda_n({\mathbf{H}}_{\infty})}{8C^2 n } = \frac{\lambda_n({\mathbf{H}}_{\infty})}{4n}. \end{align*}\] As a result \[\begin{align*} || {\mathbf{H}}(t) - {\mathbf{H}}(0) ||^2 \leq || {\mathbf{H}}(t) - {\mathbf{H}}(0) ||_F^2 = \sum_{i,l=1}^n | H_{il}(t) - H_{il}(0)|^2 \leq \frac{\lambda_n^2({\mathbf{H}}_{\infty})}{16}. \end{align*}\]
Following Lemma Lemma 5 we can bound the 2-norm distance between the parameters at time \(t\) and initialization as follows.
Lemma 6 Assume \(\phi\) is continuously differentiable and also that there exists a \(C\in \mathbb{R}_{\geq 0}\) such that \(|\phi'(z)|\leq C\) for all \(z \in \mathbb{R}\). For a given \(T \in \mathbb{R}_{\geq 0}\) and any \(r \in [m]\), \(t \in [0,T]\) it holds that \[\begin{align*} || {\mathbf{w}}_r(t) - {\mathbf{w}}_r(0) || \leq C\sqrt{\frac{n}{m}}\int_{0}^t L(\tau) d\tau. \end{align*}\]
Proof. As \(\phi\) is continuously differentiable then by construction so is \(f\). For arbitrary \(T \in \mathbb{R}_>0\), then by inspection each entry of \(\nabla_{\theta}L(t)\), \[\begin{align*} \frac{\partial L(t)}{\partial \theta_k} = \sum_{i=1}^n \frac{\partial f(\theta(t), {\mathbf{x}}_i)}{ \partial \theta_k} (f(\theta, {\mathbf{x}}_i) - y_i), \end{align*}\] is continuous on \([0,T]\). In addition, due to the boundedness of the derivative and the continuity of \(f\) each entry of \(\nabla_{\theta}L(t)\) is also bounded on \([0,T]\). Therefore, and as we are using gradient flow each entry of \(\frac{d\theta(t)}{dt}\) is a real valued, continuous and Riemann integrable function on \([0,T]\). As \(\theta(t)\) is the antiderivative of \(\frac{d\theta(t)}{dt}\), then by the Fundamental Theorem of Calculus for any \(t \in [0,T]\) \[ \theta(t) = \theta(0) + \int_{0}^t \frac{d\theta(\tau)}{d\tau} d\tau \] and therefore \[ {\mathbf{w}}_r(t) = {\mathbf{w}}_r(0) + \int_{0}^t \frac{d{\mathbf{w}}_r(\tau)}{d\tau} d\tau. \] Rearranging and taking the norm gives \[ || {\mathbf{w}}_r(t) - {\mathbf{w}}_r(0) || = || \int_{0}^t \frac{d{\mathbf{w}}(\tau)}{d\tau} d\tau|| \leq \int_{0}^t \left| \left| \frac{d{\mathbf{w}}(\tau)}{d\tau} \right| \right| d\tau. \] We proceed to upper bound the integrand as follows, as \(|a_j| \leq 1\), the input data has unit norm and \(|\phi'(z)| \leq C\) for all \(z \in \mathbb{R}\), then \[\begin{align*} \left| \left| \frac{d{\mathbf{w}}_r(t)}{d t} \right| \right|^2&= \nabla_{{\mathbf{w}}_r} L(t)^T \nabla_{{\mathbf{w}}_r} L(t) \\ &= \sum_{k =1}^d \left(\frac{\partial L(t)}{\partial w_{rk}}\right)^2\\ &= \sum_{k =1}^d \left(\sum_{i=1}^n\frac{\partial f(\theta(t), {\mathbf{x}}_i)}{ \partial w_{rk}} (f(\theta, {\mathbf{x}}_i) - y_i)\right)^2\\ &= \sum_{k =1}^d \left(\sum_{i=1}^n\frac{1}{\sqrt{m}} a_j \phi'({\mathbf{w}}_j^T{\mathbf{x}}_i)x_k r_i(t) \right)^2\\ & \leq \frac{C^2}{m} \left(\sum_{k =1}^d x_k^2\right) \left(\sum_i r_i(t) \right)^2 \\ & =\frac{C^2}{m} {\mathbf{r}}(t)^T \textbf{1}_{n \times n} {\mathbf{r}}(t)\\ & =\frac{C^2}{m} ||n^{-1/2} \textbf{1}_{n \times n}{\mathbf{r}}(t)||^2\\ & \leq \frac{C^2}{mn} ||\textbf{1}_{n \times n}||^2 || {\mathbf{r}}(t)||^2\\ & = \frac{C^2n}{m}|| {\mathbf{r}}(t)||^2. \end{align*}\] Therefore, for any \(r \in [m]\) \[ || {\mathbf{w}}_r(t) - {\mathbf{w}}_r(0) || \leq C\sqrt{\frac{n}{m}}\int_{0}^t L(\tau) d\tau \] as claimed.
Lemma 6 raises a tricky issue, in particular, it seems like we have arrived at the following circular argument.
- To bound \(L(t)\) it suffices bound \(|| {\mathbf{H}}(t) - {\mathbf{H}}(0) ||\).
- To bound \(|| {\mathbf{H}}(t) - {\mathbf{H}}(0) ||\) it suffices to bound \(|| {\mathbf{w}}_r(t) - {\mathbf{w}}_r(0) ||\) for all \(r \in [m]\).
- To bound \(|| {\mathbf{w}}_r(t) - {\mathbf{w}}_r(0) ||\) for all \(r \in [m]\) it suffices to bound \(L(t)\).
However, we can circumvent this issue using a real induction argument! We now present the main result. Note we present the result in terms of \(\lambda_n({\mathbf{H}}_{\infty})\) and assume \(\lambda_n({\mathbf{H}}_{\infty})>0\). This is a reasonable assumption as for generic data (where the data matrix is drawn from a set with full Lebesgue measure) the smallest eigenvalue will be strictly positive. Quantitative lower bounds can be derived by placing for instance a seperation condition on the data, as an example see Corollary I.2 in (Oymak and Soltanolkotabi 2019).
Theorem 1 Assume \(\lambda_n({\mathbf{H}}_{\infty})>0\), \(\phi\) is continuously differentiable and that there exists a \(C\in \mathbb{R}_{> 0}\) such that \(|\phi'(z)|\leq C\) for all \(z \in \mathbb{R}\), \(\mathbb{E}[\phi^2(Z)]\leq C\) with \(Z \sim N(0,1)\) and \(\phi'\) is \(C\)-Lipschitz. For \(\delta \in (0,1)\), let \(n \geq 2\delta^{-1}\) and assume \(m \geq \max \{\frac{32C^4n^2}{\lambda_n({\mathbf{H}}_{\infty})^2} \ln \left( \frac{4n^2}{\delta} \right), \frac{256C^8 n^7 }{\lambda_n({\mathbf{H}}_{\infty})^4}\}\). Then with probability at least \(1-\delta\) we have for all \(t \geq 0\) \[\begin{align*} L(t) \leq \exp(- \lambda_n({\mathbf{H}}_{\infty}) t) L(0). \end{align*}\]
Proof. By construction \(\lambda_{n}({\mathbf{H}}_{\infty})\geq 0\), as the case \(\lambda_{n}({\mathbf{H}}_{\infty})=0\) is trivial assume \(\lambda_{n}({\mathbf{H}}_{\infty})>0\). To prove the result claimed recall from Lemma 1 that it suffices to show that \(\lambda_n({\mathbf{H}}(t)) \geq \lambda_n({\mathbf{H}}_{\infty})/2\) for all \(t >0\). From Lemma 2 for this condition to be true it suffices that \(||{\mathbf{H}}(t) - {\mathbf{H}}(0)||, ||{\mathbf{H}}(0) - {\mathbf{H}}_{\infty}|| \leq \lambda_n({\mathbf{H}}_{\infty})/4\) for all \(t \geq 0\). Note by Lemma 3 then as \[m \geq \frac{32C^4n^2}{\lambda_n({\mathbf{H}}_{\infty})^2} \ln \left( \frac{4n^2}{\delta} \right) \] then with probability at most \(\delta/2\) it holds that \[ || {\mathbf{H}}(0) - {\mathbf{H}}_{\infty}|| > \lambda_n({\mathbf{H}}_{\infty})/4. \] In addition, as \(n\geq 2\delta^{-1}\) then with probability at most \(\delta/2\) we have \(L(0) \geq 2Cn^2\). Therefore and under the assumptions of the lemma, using a union bound argument it follows that \(|| {\mathbf{H}}(0) - {\mathbf{H}}_{\infty}|| \leq \lambda_n({\mathbf{H}}_{\infty})/4\) and \(L(0) < 2Cn^2\).
We now proceed by real induction. As per the definition of an inductive set provided in Clark (2012), define for an arbitrary \(T >0\) \[ S = \{t \in [0,T]: \lambda_n({\mathbf{H}}(t)) > \lambda_n({\mathbf{H}}_{\infty})/2 \}. \] Our goal is to show \(S = [0,T]\). To this end it suffices to prove that the following statements are true, i) \(0 \in S\), ii) for any \(t \in (0, T)\) such that \(t \in S\), then there exists a \(t'>t\) such that \([t,t']\subset S\), ii) if \(t \in[0,T)\) and \([0,t) \subset S\) then \(t \in S\). For Statement i), recall for \(t=0\) that \[ |\lambda_n({\mathbf{H}}(0)) - \lambda_n({\mathbf{H}}_{\infty})| \leq || {\mathbf{H}}(0) - {\mathbf{H}}_{\infty}|| \leq \lambda_n({\mathbf{H}}_{\infty})/4. \] If \(\lambda_n({\mathbf{H}}(0))< \lambda_n({\mathbf{H}}_{\infty})\) then \(\lambda_n({\mathbf{H}}(0)) \geq 3\lambda_n({\mathbf{H}}_{\infty})/4\), otherwise \(\lambda_n({\mathbf{H}}(0))\geq \lambda_n({\mathbf{H}}_{\infty})\) implying \(0 \in S\).
For Statement ii), observe if \(t \in S\) then there exists an \(\epsilon >0\) such that \(\lambda_n({\mathbf{H}}(t))- \epsilon > \lambda_n({\mathbf{H}}_{\infty})/2\). To proceed we claim that \(\lambda_n({\mathbf{H}}(t))\) is continuous. Indeed, as \(\phi'\) is continuous then each entry \(H_{il}(t)\) is continuous. Therefore \({\mathbf{H}}(t)\) is continuous with respect to the Frobenius norm and indeed any other norm as all norms are equivalent in finite dimensional spaces. Therefore, by Theorem VI.1.4 Bhatia (1997), the eigenvalues including \(\lambda_n:\mathbb{R}_{\geq 0} \rightarrow \mathbb{R}_{\geq 0}\) are continuous functions. By continuity it follows that there exists a \(\delta'>0\) such that for all \(\tau \in [t \pm \delta']\) then \(\lambda_n({\mathbf{H}}(\tau)) \in [\lambda_n({\mathbf{H}}_{\infty})/2 \pm \epsilon]\). This in turn implies \(\lambda_n({\mathbf{H}}(\tau)) > \lambda_n({\mathbf{H}}_{\infty})/2\) for all \(\tau \in [t, t+\delta']\).
For Statement iii), if \([0,t) \subset S\) then for any \(\delta'>0\) it follows that \([0, t - \delta'] \subset S\). Using Lemma 5 \[\begin{align*} ||{\mathbf{w}}_r(t) - {\mathbf{w}}_r(0) || &\leq C\sqrt{\frac{n}{m}}\int_{0}^t L(\tau) d\tau\\ &= C\sqrt{\frac{n}{m}}\left(\int_{0}^{t-\delta'} L(\tau) d\tau + \int_{t-\delta'}^{t} L(\tau) d\tau\right)\\ & \leq C\sqrt{\frac{n}{m}} L(0) \left( \int_{0}^{t-\delta'} \exp \left(- \lambda_n({\mathbf{H}}_{\infty}) \right) d\tau + \delta' \right)\\ & \leq \sqrt{\frac{n}{m}} \frac{2C^2n^2}{\lambda_n({\mathbf{H}}_{\infty})} \left( 1 + \delta'\lambda_n({\mathbf{H}}_{\infty}) \right). \end{align*}\] Letting \(\delta' \leq \frac{1}{2\lambda_n({\mathbf{H}}_{\infty})}\), then as \(m\geq \frac{256C^8n^7 }{\lambda_n({\mathbf{H}}_{\infty})^4}\) it follows that \[ ||{\mathbf{w}}_r(t) - {\mathbf{w}}_r(0) || \leq \frac{3R}{4}. \] Therefore \(||{\mathbf{H}}(t) - {\mathbf{H}}_{\infty} || \leq \lambda_n({\mathbf{H}}_{\infty})/4\) and as a result we conclude that \(\lambda_n({\mathbf{H}}(t))\geq \lambda_n({\mathbf{H}}_{\infty})/2\) and thus \(t \in S\). This concludes the proof by real induction. Note that as \(T\) can be arbitrarily large we can further conclude that \(\lambda_n({\mathbf{H}}(t))\geq \lambda_n({\mathbf{H}}_{\infty})/2\) for all \(t >0\), thereby establishing a uniform lower bound on the smallest eigenvalue of \({\mathbf{H}}(t)\).
Conclusion
Theorem 1 is interesting as despite the loss function being non-linear and non-convex in the model parameters, it says for sufficiently overparameterized networks and assuming \(\lambda_n({\mathbf{H}}_{\infty}) > 0\), then given a sufficiently long training time one can achieve arbitrarily small loss using gradient flow! There are some notable limitations however, in particular the level of overparameterization is exceedingly severe and not representative of networks in practice. Moreover, in order to derive our results we had to limit the movement of each neuron throughout training! This condition requiring the NTK to remain close to its initialization, or equivalently each neuron to remain close to its initialization, is limiting as it rules out a rich feature learning regime in which the kernel (or weights adapt to the data). In this post we looked at deriving from first principles guarantees in the very simple setting of a differentiable single layer network. These ideas can be generalized to encompass more general networks and problems in terms of depth, architecture and loss functions. A nice summary and introduction to these topics can be found here https://www.benjamin-bowman.com/assets/Intro_to_NTK.pdf.