In this post we will look at how the proof of convergence for the Perceptron algorithm can be used to derive strong training guarantees for a shallow neural network assuming linearly separable data.
%%%%% NEW MATH DEFINITIONS %%%%% $$ %
{lemma}{Lemma}
{}[1]{{#1}}
% Figure reference, lower-case. % Figure reference, capital. For start of sentence % Section reference, lower-case. % Section reference, capital. % Reference to two sections. % Reference to three sections. % Reference to an equation, lower-case. % Reference to an equation, upper case % A raw reference to an equation—avoid using if possible % Reference to a chapter, lower-case. % Reference to an equation, upper case. % Reference to a range of chapters % Reference to an algorithm, lower-case. % Reference to an algorithm, upper case. % Reference to a part, lower case % Reference to a part, upper case
% Random variables % rm is already a command, just don’t name any random variables m
% Random vectors
% Elements of random vectors
% Random matrices
% Elements of random matrices
% Vectors
% Elements of vectors
% Matrix
% Tensor
%
% Graph
% Sets % Don’t use a set called E, because this would be the same as our symbol % for expectation.
% Entries of a matrix
% entries of a tensor % Same font as tensor, without wrapper
% The true underlying data generating distribution % The empirical distribution defined by the training set % The model distribution % Stochastic autoencoder distributions
% Laplace distribution
% Wolfram Mathworld says is for function spaces and is for vectors % But then they seem to use for vectors throughout the site, and so does % wikipedia.
% See usage in notation.tex. Chosen to match Daphne’s book.
% MM commands % Defined commands
%
% Caligraphic letters $$
Introduction
For neural networks deriving training guarantees for reaching global optima is challenging. In another blog post we showed how such guarantees can be achieved by ensuring the network in question is sufficiently wide (this is the NTK approach). However, the typical width requirements are often unrealistic and moreover are designed in order to keep the parameters within some ball of their initialization throughout training and therefore cannot really explain the rich feature learning observed in practice. In this post we remove any width requirements at the price of having to impose significant structure on the target function and data. In particular we consider the problem of learning a binary classifier on linearly separable data. To this end let denote the training data where are the input features and the output labels. If the problem is linearly separable then there exists a such that , or equivalently for all . We assume the input data is bounded, i.e., for all . Letting denote the map of a shallow neural network our goal is to understand if gradient descent can find parameters such that for all . To achieve this we deploy a neat trick based on the Perceptron algorithm which can be seen in (Brutzkus et al. 2018). Our presentation will most closely follow that given in (Karhadkar et al. 2024).
Convergence of the perceptron algorithm for linearly seperable data
Before we consider a shallow neural network, let’s first recall the Perceptron algorithm.
- Inputs:
- Algorithm:
- Initialize ,
- While there exists a s.t. do
- return
In the above the map selects a point to update at each iteration. Furthermore observe that at each iteration the Perceptron algorithm improves its performance on at least one feature-label pair in the training set as .
For linearly separable data it is possible to show the Perceptron algorithm converges to a global minimizer. Recall that linearly separable means there exists a , such that for all (the requirement that is unit norm is made for convenience as we can always re-scale by dividing by ). As the decision boundary of the linear classifier is the vector subspace orthogonal to then the shortest distance from to this decision boundary is simply . Let denote the margin of the linear classifier . Without loss of generality we may as well consider the linear classifier with the maximum margin: to this end moving forward we let denote the max margin and the max margin classifier. For linearly separable data it is possible to show that the Perceptron algorithm terminates after a finite number of steps and successfully classifies all points in the training sample.
Lemma 1 If is linearly separable then the Perceptron algorithm terminates after at most iterations and for all .
Proof. The key idea of the proof is to show that grows faster than . By the Cauchy-Schwarz (CS) inequality, for any iteration it must hold that
Therefore, there must exist an iteration after which no further updates take place: if this was not true this would imply for some sufficiently large iteration that the CS inequality is invalid! To lower bound observe To upper bound note Let denote the iteration at which the Perceptron algorithm terminates where indicates the Perceptron algorithm never terminates. Then by the CS inequality which is finite! Rearranging then as claimed.
From the perceptron to shallow neural networks
Consider now a shallow neural network where is the weight matrix of the first layer, is the th row of , is the vector of output weights and is the th entry of . We focus on the Leaky-ReLU non-linearity for some . Consider learning the parameters and using gradient descent (GD) with the hinge loss (recall the hinge loss is defined as ). Using the minimum norm subgradient (as is standard in most automatic differentiation software) at zero we let for and is otherwise, equivalently we can write this as . Similarly for and is otherwise. A straightforward calculation gives Consider training the network with gradient descent in order to minimize the loss : as then the gradient updates for each neuron are where are the learning rates for the inner and outer weights respectively. Inspecting these equations we see that shallow networks trained with hinge loss and GD have a similar update rule to the Perceptron algorithm! Inspired by this observation we pursue the idea of comparing the growth rate of versus for each . In particular, analogous to studying the growth of for the Perceptron algorithm, we instead study Here we use A for alignment and again denotes the unit norm weights of the max margin linear classifier. Similarly, instead of considering we study (we use F for Frobenius) where is a diagonal square matrix with . Again by the CS inequality and assuming the data is linearly separable with max margin unit vetor , then it must hold that
Therefore, if we can show that grows slower than under the outlined gradient update rule there must be a finite iteration after which the update is zero and the parameters no longer change. We will do this by lower bounding and upper bounding in terms of the total number of times each point in the training set contributes to the update of the weights at time , which we denote . This is useful as we can bound the number of updates by considering the setting where only one point contributes to the update at each time step, leading to the bound .
Training with frozen outer weights
First we consider a simplified setting where we freeze the outer weights by setting : in this setting we are able to prove analogous results to the Perceptron setting.
Lemma 2 Suppose the training data is linearly separable, assume , , and for all . Then gradient descent will terminate after iterations, at which point the network will have achieved zero training (hinge) loss.
Proof. First we lower bound . Let denote the set of training points that do not have zero hinge loss at the previous iterate (alternatively we can view this set as the points that participate in the update of the weights at the current iterate ). For convenience using then recalling the gradient update rule we have As and then Under our assumptions on the initialization scale Therefore, defining we have Before proceeding it is worth emphasizing that we need to make this bound non-trivial and increasing in . To extend this technique to ReLU networks one might instead explore bounding for each neuron the set , where .
For the upper bound on we again plug in the GD update rule, The first term in the above is simply . The second term can be simplified by observing that for a piecewise linear function . Therefore where the final inequality follows from the fact that implies . Finally, the third term in the previous bound on can be simplified using the inequalities , , , , which imply Note the bound placed on the step-size means we avoid having to work with . Proceeding, it follows that By the CS inequality and dividing both sides by then Plugging in the bounds for and , using and rearranging produces the following sequence of inequalities, To conclude, observe if is the final iterate then
A few reflections on this result are as follows.
Comparing the bound for shallow network with bound for Perceptron: as a sense check note both bounds have a term proportional to and (observe in the case of the shallow network this is through ). In addition, for the shallow network we see that a smaller , i.e., a Leaky ReLU activation closer to ReLU, we get a weaker bound, while at the other end if we chose a linear activation we would get a bound which is very similar! Perhaps the key difference is that through the upper bound on the number of iterations of GD scales proportional to while the bound for the Perceptron has no dependence on . Given that we use full-batch GD versus single batch SGD for the Perceptron one might expect the bound for the shallow network to be proportional somehow to , however instead the upper bound grows proportional to for the shallow network! This arises as a result of two artefacts of the proof: first, to bound in terms of we use the lower bound for all , i.e., we assume only one point is involved in the update of the parameters at each time step when it could be as large as . Second, we removed a factor of by using a step-size which is proportional to . It is worth highlighting that the bound for the Perceptron is in terms of the number of non-zero updates as opposed to the number of updates. One could use exactly the same proof to bound the number of non-zero iterations of mini-batch SGD of size say which would only require the proportional to and would thereby remove the dependence on .
Width of the network does not matter: the particular choice of does not actually matter in terms of impacting the upper bound on the number of iterations. This should not perhaps be overly surprising as the target function is linear and could be solved by a single neuron!
All neurons are treated the same: the bounds we have written down are derived without having to study the individual dynamics of neurons, or even sub-groups of neurons. Equivalently put, we used the same bound for every neuron and every time step and avoided considering the activation patterns of each neuron on different data points at different time steps. Analysis of activation patterns is in general quite challenging and current works in this regard typically require additional assumptions on the data, such as the input features being nearly orthogonal, see e.g., (George et al. 2023).
Does not work for ReLU: we require as otherwise the lower bound we derived for does not grow with . This condition means that our results do not cover ReLU. In order to extend to ReLU a similar technique could be deployed if one where to consider instead bounding instead of . For ReLU networks it is worth highlighting that points which lie in the intersection of each neuron’s inactive halfspace are zeroed by the network and cannot be fitted: this property clearly introduces sub-optimal stationary points! By comparison, Leaky ReLU networks never zero an input unless they are exactly orthogonal to the weight vectors of every neuron, which is a null set.
Scale of the initialization : for linearly separable data the scale of the initialization does not impact convergence, however a larger initialization scale increases the upper bound on which perhaps suggests that it slows down convergence.
Step-size : the condition simplifies the upper bound on , in particular without this assumption the upper bound would depend on instead of . The dependence of on is primarily for convenience in simplifying the expressions encountered in the proof and is not necessary. Observing the role of the step-size in the upper bound then, and agreeing with our intuition, the smaller the step size the larger upper bound, which in turn perhaps suggests GD will take longer to converge. If is sufficiently large relative to , and then the inequality can be reduced to the form for some constant independent of and . As the right-hand-side grows linearly with while the left-hand-side quadratically then there must be some finite at which the inequality no longer holds. At this point GD must have terminated, despite the large step-size!
Training both the inner and outer layers
A natural question one might ask is what happens if you unfreeze the outer layer weights, i.e., ? Practically we should expect nothing to change (you can try this experimentally if you like!), however from the perspective of deriving guarantees our job becomes harder. The first challenge is actually deriving bounds for and which are non-trivial: generally given bounded data it will be possible to upper bound at least crudely , however a positive lower bound on we will shortly observe is trickier. Second, even if we are able to derive upper and lower bounds this technique only works if the lower bound on grows faster than with .
Consider the lower bound on : for convenience we use the notation where , from the update rules we derived previously we have the following. The first of the above four terms is simply , while the second term is positive as However, bounding the other two terms from below is challenging and requires a more fine-grained understanding of the dynamics of each neuron. Note in the case where the outer weights was frozen we found ourselves in the easy position where we could apply the same bound to each neuron at each time step. In this sense we were actually able to treat all neurons identically as we could uniformly (both in time and neuron index) bound the alignment with as well as the growth of the norm.
References
Brutzkus, Alon, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2018.
“SGD Learns over-Parameterized Networks That Provably Generalize on Linearly Separable Data.” In
International Conference on Learning Representations.
https://openreview.net/forum?id=rJ33wwxRb.
George, Erin, Michael Murray, William Swartworth, and Deanna Needell. 2023.
“Training Shallow ReLU Networks on Noisy Data Using Hinge Loss: When Do We Overfit and Is It Benign?” In
Advances in Neural Information Processing Systems, edited by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, 36:35139–89. Curran Associates, Inc.
https://proceedings.neurips.cc/paper_files/paper/2023/file/6e73c39cc428c7d264d9820319f31e79-Paper-Conference.pdf.
Karhadkar, Kedar, Erin George, Michael Murray, Guido Montúfar, and Deanna Needell. 2024.
“Benign Overfitting in Leaky ReLU Networks with Moderate Input Dimension.” https://arxiv.org/abs/2403.06903.