Global optimization guarantees for shallow neural networks via the Perceptron algorithm

Author

Michael Murray

Published

June 17, 2024

In this post we will look at how the proof of convergence for the Perceptron algorithm can be used to derive strong training guarantees for a shallow neural network assuming linearly separable data.

Introduction

For neural networks deriving training guarantees for reaching global optima is challenging. In another blog post we showed how such guarantees can be achieved by ensuring the network in question is sufficiently wide (this is the NTK approach). However, the typical width requirements are often unrealistic and moreover are designed in order to keep the parameters within some ball of their initialization throughout training and therefore cannot really explain the rich feature learning observed in practice. In this post we remove any width requirements at the price of having to impose significant structure on the target function and data. In particular we consider the problem of learning a binary classifier fθ:Rd{±1} on linearly separable data. To this end let (xi,yi)i[n] denote the training data where xiRd are the input features and yi{±1} the output labels. If the problem is linearly separable then there exists a uRd such that sign(a,xi)=yi, or equivalently yia,xi>0 for all i[n]. We assume the input data is bounded, i.e., ||xi||R for all i[n]. Letting f(x;θ) denote the map of a shallow neural network our goal is to understand if gradient descent can find parameters θ such that sign(f(xi;θ))=yi for all i[n]. To achieve this we deploy a neat trick based on the Perceptron algorithm which can be seen in (). Our presentation will most closely follow that given in ().

Convergence of the perceptron algorithm for linearly seperable data

Before we consider a shallow neural network, let’s first recall the Perceptron algorithm.

  • Inputs: (xi,yi)i[n]
  • Algorithm:
    • Initialize w(0)=0d, t=1
    • While there exists a π(t)[n] s.t. yπ(t)w(t),xπ(t)0 do
      • w(t)=w(t1)+yπ(t)xπ(t)
      • t=t+1
    • return w(t)

In the above the map π:N[n] selects a point to update at each iteration. Furthermore observe that at each iteration the Perceptron algorithm improves its performance on at least one feature-label pair in the training set as yπ(t)w(t),xπ(t)=yπ(t)w(t1),xπ(t)+1.

For linearly separable data it is possible to show the Perceptron algorithm converges to a global minimizer. Recall that linearly separable means there exists a uRd, ||u||=1 such that yixi,u>0 for all i[n] (the requirement that u is unit norm is made for convenience as we can always re-scale by dividing by ||u||). As the decision boundary of the linear classifier u is the vector subspace orthogonal to u then the shortest distance from xi to this decision boundary is simply |xi,u|. Let γ(u):=mini[n]|xi,u|, denote the margin of the linear classifier u. Without loss of generality we may as well consider the linear classifier with the maximum margin: to this end moving forward we let γ:=max||u||=1mini[n]|xi,u| denote the max margin and u the max margin classifier. For linearly separable data it is possible to show that the Perceptron algorithm terminates after a finite number of steps and successfully classifies all points in the training sample.

Lemma 1 If (xi,yi)i[n] is linearly separable then the Perceptron algorithm terminates after at most TR2γ2 iterations and yixi,w(T)>0 for all i[n].

Proof. The key idea of the proof is to show that |w(t),u|2 grows faster than ||w(t)||2. By the Cauchy-Schwarz (CS) inequality, for any iteration t1 it must hold that

|w(t),u|2||w(t)||2. Therefore, there must exist an iteration T after which no further updates take place: if this was not true this would imply for some sufficiently large iteration that the CS inequality is invalid! To lower bound |w(t),u|2 observe w(t),uw(t1),u+yπ(t)xπ(t),uw(t1),u+γw(t2),u+2γtγ. To upper bound ||w(t)||2 note ||w(t)||2=||w(t1)+yπ(t)xπ(t)||2=||w(t1)||2+2yπ(t)xπ(t),w(t1)+||xπ(t)||2||w(t1)||2+||xπ(t)||2||w(t1)||2+R2||w(t1)||2+2R2tR2. Let T denote the iteration at which the Perceptron algorithm terminates where T= indicates the Perceptron algorithm never terminates. Then by the CS inequality T2γ2|wt,u|2||wt||2TR2 which is finite! Rearranging then TR2γ2 as claimed.

From the perceptron to shallow neural networks

Consider now a shallow neural network f(x;W,v)=vTσ(Wx)=j=12mvjσ(wj,x) where WR2m×d is the weight matrix of the first layer, wj is the jth row of W, vR2m is the vector of output weights and vj is the jth entry of v. We focus on the Leaky-ReLU non-linearity σ(z)=max{αz,z} for some α(0,1). Consider learning the parameters W and v using gradient descent (GD) with the hinge loss (recall the hinge loss is defined as (z)=max{0,1z}). Using the minimum norm subgradient (as is standard in most automatic differentiation software) at zero we let (z)=1 for z<1 and is 0 otherwise, equivalently we can write this as (z)=1(z<1). Similarly σ(z)=α for z0 and is 1 otherwise. A straightforward calculation gives fwj(x;W,v)=vjσ(wj,x)x,fvj(x;W,v)=σ(wj,x). Consider training the network with gradient descent in order to minimize the loss L(W,v)=i=1n(yif(xi,W,v)): as Lθ=i[n](yif(xi;W,v))yifθ(xi;W,v) then the gradient updates for each neuron j[2m] are wj(t)=wj(t1)+ηwi=1n1[yif(xi;W(t1),v(t1))<1]vj(t1)σ(wj(t1),xi)yixi,vj(t)=vj(t1)+ηvi=1n1[yif(xi;W(t1),v(t1))<1]σ(wj(t1),xi), where ηw,ηv>0 are the learning rates for the inner and outer weights respectively. Inspecting these equations we see that shallow networks trained with hinge loss and GD have a similar update rule to the Perceptron algorithm! Inspired by this observation we pursue the idea of comparing the growth rate of |vj(t)wj(t),u|2 versus ||vj(t)wj(t)||2 for each j[2m]. In particular, analogous to studying the growth of w(t),u for the Perceptron algorithm, we instead study A(t)=j=12mvj(t)wj(t),u. Here we use A for alignment and u again denotes the unit norm weights of the max margin linear classifier. Similarly, instead of considering ||w(t)||2 we study F(t)=||diag(v(t))W(t)||F2 (we use F for Frobenius) where diag(v(t)) is a diagonal square matrix with [diag(v(t))]jj=vj(t). Again by the CS inequality and assuming the data is linearly separable with max margin unit vetor u, then it must hold that A2(t)=|vec(diag(v(t))W(t)),u12m|2F(t)2m.

Therefore, if we can show that F(t) grows slower than A(t) under the outlined gradient update rule there must be a finite iteration after which the update is zero and the parameters no longer change. We will do this by lower bounding A(t) and upper bounding F(t) in terms of the total number of times each point in the training set contributes to the update of the weights at time t, which we denote U(t)=τ=1t|F(τ)|. This is useful as we can bound the number of updates T by considering the setting where only one point contributes to the update at each time step, leading to the bound TU(t).

Training with frozen outer weights

First we consider a simplified setting where we freeze the outer weights by setting ηv=0: in this setting we are able to prove analogous results to the Perceptron setting.

Lemma 2 Suppose the training data (xi,yi)i[n] is linearly separable, assume ηv=0, ηw1nR2, vj(0)=(1)j and ||wj(0)||λw for all j[2m]. Then gradient descent will terminate after T<2(λwαγ+1)ηwα2γ2 iterations, at which point the network will have achieved zero training (hinge) loss.

Proof. First we lower bound A(t). Let F(t)={i[n]:yif(xi;θ(t1))<1} denote the set of training points that do not have zero hinge loss at the previous iterate t1 (alternatively we can view this set as the points that participate in the update of the weights at the current iterate t). For convenience using σji(t):=σ(wj(t),xi) then recalling the gradient update rule we have wj(t)=wj(t1)+ηwiF(t)n(1)jσji(t1)yixi. As yixi,uγ and σji(t)α>0 then A(t)=j=12m(1)jwj(t),u=A(t1)+ηwj=12miF(t)nσji(t)yixi,uA(t1)+ηwγαj=12miF(t)n1=A(t1)+ηwγα2m|F(t)|=A(0)+ηwγα2mτ=1t|F(τ)|. Under our assumptions on the initialization scale A(0)=j=12m(1)jwj(0),uj=12m|wj(0),u|2mλw. Therefore, defining U(t)=τ=1t|F(τ)| we have A(t)2mλw+2mηwαγU(t). Before proceeding it is worth emphasizing that we need α>0 to make this bound non-trivial and increasing in t. To extend this technique to ReLU networks one might instead explore bounding for each neuron j[2m] the set |F(t)Aj(t)|, where Aj(t)={i[n]:wj(t),xi>0}.

For the upper bound on F(t) we again plug in the GD update rule, F(t)=j=12m||wj(t)||2=j=12m||wj(t1)+ηwiF(t)n(1)jσji(t1)yixi||2=j=12m||wj(t1)||2+2ηwiF(t)nj=12m(1)jσji(t1)yiwj(t1),xi+ηw2j=12mi,kF(t)n(1)2jσji(t1)σki(t1)xi,xk. The first term in the above is simply F(t1). The second term can be simplified by observing that σ(z)=σ(z)z for a piecewise linear function σ. Therefore ηwiF(t)nj=12m(1)jσji(t1)wj(t1),xi=ηwiF(t)nj=12m(1)jσ(wj(t1),xi)=ηwiF(t)nyif(xi;W(t1))<ηw|F(t)|, where the final inequality follows from the fact that iF(t) implies yif(xi;W(t1))<1. Finally, the third term in the previous bound on F(t) can be simplified using the inequalities σji(t1)1, ||xi||R, ηw1/nR2, |F(t)|n, which imply ηw2j=12mi,kF(t)n(1)2jσji(t1)σki(t1)xi,xk2mηw2R2|F(t)|22mηw|F(t)|. Note the bound placed on the step-size means we avoid having to work with |F(t)|2. Proceeding, it follows that F(t)<F(t1)+(2+2m)ηw|F(t)|F(t1)+4mηw|F(t)|<F(0)+4mηwτ=1t|F(t)|=F(0)+4mηwU(t)=j=12m||wj(0)||2+4mηwU(t)2mλw2+4mηwU(t). By the CS inequality and dividing both sides by 4m2 then A2(t)4m2F(t)2m. Plugging in the bounds for A(t) and F(t), using λw1 and rearranging produces the following sequence of inequalities, (ηwαγU(t)λw)2<λw2+2ηwU(t)ηw2α2γ2U2(t)2λwηwαγU(t)+λw2<λw2+2ηwU(t)ηwα2γ2U2(t)<2(λwαγ+1)U(t),U(t)2(λwαγ+1)ηwα2γ2. To conclude, observe if T is the final iterate then T=τ=1T1τ=1T|F(τ)|=U(T)<2(λwαγ+1)ηwα2γ2.

A few reflections on this result are as follows.

  • Comparing the bound for shallow network with bound for Perceptron: as a sense check note both bounds have a term proportional to γ2 and R2 (observe in the case of the shallow network this is through ηw). In addition, for the shallow network we see that a smaller α, i.e., a Leaky ReLU activation closer to ReLU, we get a weaker bound, while at the other end if we chose a linear activation we would get a bound which is very similar! Perhaps the key difference is that through ηw the upper bound on the number of iterations of GD scales proportional to n while the bound for the Perceptron has no dependence on n. Given that we use full-batch GD versus single batch SGD for the Perceptron one might expect the bound for the shallow network to be proportional somehow to 1/n, however instead the upper bound grows proportional to n for the shallow network! This arises as a result of two artefacts of the proof: first, to bound T in terms of U(t) we use the lower bound |F(t)|1 for all tT, i.e., we assume only one point is involved in the update of the parameters at each time step when it could be as large as n. Second, we removed a factor of |F(t)| by using a step-size which is proportional to 1/n. It is worth highlighting that the bound for the Perceptron is in terms of the number of non-zero updates as opposed to the number of updates. One could use exactly the same proof to bound the number of non-zero iterations of mini-batch SGD of size k[n] say which would only require the ηw proportional to 1/k and would thereby remove the dependence on n.

  • Width of the network does not matter: the particular choice of m does not actually matter in terms of impacting the upper bound on the number of iterations. This should not perhaps be overly surprising as the target function is linear and could be solved by a single neuron!

  • All neurons are treated the same: the bounds we have written down are derived without having to study the individual dynamics of neurons, or even sub-groups of neurons. Equivalently put, we used the same bound for every neuron and every time step and avoided considering the activation patterns of each neuron on different data points at different time steps. Analysis of activation patterns is in general quite challenging and current works in this regard typically require additional assumptions on the data, such as the input features being nearly orthogonal, see e.g., ().

  • Does not work for ReLU: we require α>0 as otherwise the lower bound we derived for A(t) does not grow with t. This condition means that our results do not cover ReLU. In order to extend to ReLU a similar technique could be deployed if one where to consider instead bounding |F(t)A(t)| instead of |F(t)|. For ReLU networks it is worth highlighting that points which lie in the intersection of each neuron’s inactive halfspace are zeroed by the network and cannot be fitted: this property clearly introduces sub-optimal stationary points! By comparison, Leaky ReLU networks never zero an input unless they are exactly orthogonal to the weight vectors of every neuron, which is a null set.

  • Scale of the initialization λw: for linearly separable data the scale of the initialization does not impact convergence, however a larger initialization scale increases the upper bound on U(t) which perhaps suggests that it slows down convergence.

  • Step-size ηw: the condition ηw1/nR2 simplifies the upper bound on F(t), in particular without this assumption the upper bound would depend on τ=1t|F(t)|2 instead of U(t). The dependence of ηw on R2 is primarily for convenience in simplifying the expressions encountered in the proof and is not necessary. Observing the role of the step-size in the upper bound then, and agreeing with our intuition, the smaller the step size the larger upper bound, which in turn perhaps suggests GD will take longer to converge. If ηw is sufficiently large relative to α, γ and λw then the inequality A(t)22mF(t) can be reduced to the form (τ=1t|F(t)|)2Cτ=1t|F(t)|2 for some constant C independent of n and t. As the right-hand-side grows linearly with t while the left-hand-side quadratically then there must be some finite T at which the inequality no longer holds. At this point GD must have terminated, despite the large step-size!

Training both the inner and outer layers

A natural question one might ask is what happens if you unfreeze the outer layer weights, i.e., ηu0? Practically we should expect nothing to change (you can try this experimentally if you like!), however from the perspective of deriving guarantees our job becomes harder. The first challenge is actually deriving bounds for A(t) and F(t) which are non-trivial: generally given bounded data it will be possible to upper bound at least crudely F(t), however a positive lower bound on A(t) we will shortly observe is trickier. Second, even if we are able to derive upper and lower bounds this technique only works if the lower bound on A(t) grows faster than F(t) with t.

Consider the lower bound on A(t): for convenience we use the notation ϕji(t)=ϕ(wj(t),xi) where ϕ{σ,σ}, from the update rules we derived previously we have the following. A(t+1)=j=12mvj(t+1)wj(t+1),u=j=12m(vj(t)+ηviF(t)σji(t))(wj(t)+ηwkF(t)vj(t)σjk(t)ykxk,u)=j=12m(vj(t)+ηviF(t)σji(t))(wj(t),u+ηwkF(t)vj(t)σjk(t)ykxk,u)=j=12mvj(t)wj(t),u+ηwkF(t)j=12mvj2(t)σjk(t)ykxk,u+ηviF(t)j=12mσji(t)wj(t),u+ηvηwi,kF(t)j=12mvj(t)σji(t)σjk(t)ykxk,u. The first of the above four terms is simply A(t), while the second term is positive as ηwkF(t)j=12mvj2(t)σjk(t)ykxk,u||v(t)||2αγ. However, bounding the other two terms from below is challenging and requires a more fine-grained understanding of the dynamics of each neuron. Note in the case where the outer weights was frozen we found ourselves in the easy position where we could apply the same bound to each neuron at each time step. In this sense we were actually able to treat all neurons identically as we could uniformly (both in time and neuron index) bound the alignment with u as well as the growth of the norm.

References

Brutzkus, Alon, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2018. SGD Learns over-Parameterized Networks That Provably Generalize on Linearly Separable Data.” In International Conference on Learning Representations. https://openreview.net/forum?id=rJ33wwxRb.
George, Erin, Michael Murray, William Swartworth, and Deanna Needell. 2023. “Training Shallow ReLU Networks on Noisy Data Using Hinge Loss: When Do We Overfit and Is It Benign?” In Advances in Neural Information Processing Systems, edited by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, 36:35139–89. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/file/6e73c39cc428c7d264d9820319f31e79-Paper-Conference.pdf.
Karhadkar, Kedar, Erin George, Michael Murray, Guido Montúfar, and Deanna Needell. 2024. “Benign Overfitting in Leaky ReLU Networks with Moderate Input Dimension.” https://arxiv.org/abs/2403.06903.