Instigated by Zhang et al 2016, there has been considerable consternation in the deep learning community about why models capable of fitting random labels generalize well. It should be immediately noted that simple linear regression with more features than data points will also generalize well when appropriately regularized (ridge regression). What is different about deep models is that the regularization seems to be implicit in SGD as described earlier by Neyshebur et al. 2014. Recent theoretical work on the regularization implicit in SGD, such as Arora et al. 2019, focuses on quenching shallow networks. In my previous post I argued that annealing seems more powerful than quenching. In this post I will give a PAC-Bayesian bound (a formal generalization guarantee) appropriate for annealed SGD. I’m calling it the residual entropy bound.
First some comments on PAC-Bayesian theory. I coined the term “PAC-Bayes” in the late 90’s to describe a class of theorems giving PAC generalization guarantees in terms of arbitrarily chosen prior distributions. Some such theorems (Occam bounds) pre-date my work. Over last twenty years there has been significant refinement of these bounds by various authors. A concise general presentation of PAC-Bayesian theory can be found in my PAC-Bayes tutorial.
After about 2005, PAC-Bayesian analysis largely fell out of usage in the learning theory community in favor of more “sophisticated” concepts. However, PAC-Bayesian bounds are now having a resurgence — their conceptual simplicity is paying off in the analysis of deep networks. Attempts to apply VC dimension or Rademacher complexity to deep networks yield extremely vacuous guarantees — guarantees on binary error rates tens of orders of magnitude larger than 1. PAC-Bayesian theorems, on the other hand, can produce non-vacuous guarantees — guarantees less than 1. Non-vacuous PAC-Bayesian guarantees for deep networks were first computed for MNIST by Dziugaite et al. and recently for ImageNet by Zhou et al.
In annealed SGD the learning rate acts as a temperature parameter which is set high initially and then cooled. As we cool the temperature we can think of the model parameters as a glass that is cooled to some finite temperature at which it becomes solid — becomes committed to a particular local basin of the energy landscape. For a given parameter initialization , annealed SGD defines a probability density
over the final model
given initialization
and training data
. Under Langevin dynamics we have that
is a smooth density. If we are concerned that Langevin dynamics is only an approximation, we can add some small Gaussian noise to SGD to ensure that
is smooth.
The entropy of the distribution is the residual entropy of the parameter vector. Note that if there was one final global optimum which was always found by SGD (quartz crystal) then the residual entropy would be zero. The distribution
, and hence the residual entropy, includes all the possible local basins (all the possible solid structures of the glass).
To state the residual entropy bound we also need to define a parameter distribution in terms of the population independent of any particular sample. Let be the number of data points in the sample. Let
be the distribution defined by first drawing an IID sample
of size
and then sampling
from
. The entropy of the distribution
is the residual entropy of annealing as defined by
and
independent of the draw of any particular sample. The residual entropy bound is governed by
. This is an expected KL-divergence between two residual entropy distributions. It is also equal to the mutual information between
and
under the distribution
.
To formally state the bound we assume that for a data point we have a loss
. I will write
for expectation of
when
is drawn from the population, and write
for the average of
over
drawn from the training sample. The following residual entropy bound is a corollary of theorem 4 in my PAC-Bayes tutorial.
There are good reasons to believe that (1) is extremely tight. This is a PAC-Bayesian bound where the “prior” is . PAC-Bayesian bounds yield non-vacuous values under Gaussian priors. It can be shown that
is the optimal prior for posteriors of the form
.
Unfortunately analysis of (1) seems intractable. But difficulty of theoretical analysis does not imply that a conceptual framework is wrong. The clear case that (1) is extremely tight would seem to cast doubt on analyses selected for tractability at the expense of realism.
It seems to me that there is a third possibility, intermediate between annealed and quenched, which is studied in Arora et al 2019, the Jaehoon Lee et al 2019 (and other Google Brain group) papers, and many others. It is to take the initialization \Phi_0 as a random variable and then do GD. Maybe your (1) would be tractable in this setting ?
I agree that the random initialization of \Phi_0 is important and is ignored in my post. I will interpret your suggestion as replacing the universal quantifier over \Phi_0 in my (1) by an expectation over an appropriate distribution on initializations. To analyze this I think we would still have to use a quenching version of SGD. The analysis in Arora et al. 2019 relies on SGD staying close to gradient flow. It seems to be an empirical question as to whether the performance of deep networks requires learning rates large enough to support a “thermal search” over the parameter space of a non-convex optimization problem. The residual entropy bound holds under either quenching or annealing but appears extremely difficult to analyze for annealing — the case where thermal noise in the SGD is significant. Arora at al. 2019 analyze two layer networks with unreasonably large network width and, I believe, unreasonably small learning rates.
Consider a fixed architecture of neural network with no drop out. For mini batch gradient descent, if we choose batches in a fixed order and no shuffle, then for fixed \Phi_0 and S, mini batch gradient descent will always find a fixed solution and therefore the residual entropy will be zero. Is this true?
If the SGD is deterministic then indeed the residual entropy of p(Phi | S Phi_0) is zero. However, p(Phi | Phi_0,N) will still be a smooth distribution as long as the data distribution is smooth over the data features. In this case E_S KL(p(Phi | S,Phi_0), p(Phi | Phi_0,N)) will be infinite and the bound becomes vacuous. The bound is strong when p(Phi | S,Phi_0) has only a mild dependence on the sample S.