I believe there is currently a struggle for the soul of SGD. The issue is whether our theoretical understanding should be based on quenching or on annealing. The quenching school assumes that the goal of SGD is to reach a local minimum as quickly as possible. The annealing school views SGD as a nontrivial local search over parameter space in which “thermal fluctuations” play an important role in escaping from inferior local optima. Quenching seems easier to analyze theoretically. But as I have claimed in my earlier post on the role of theory in deep learning, the desire for rigor can mislead. We should not assume that quenching is the correct conceptual framework simply because it is easier to analyze. Intuitively, annealing would seem to to be the more powerful method of searching parameter space. Also there are reasons to believe that annealing yields better generalization than quenching. The generalization issues will be the subject of my next post. This post is about modeling SGD as an annealing process using Langevin dynamics as a “statistical mechanics” of SGD.

I would like to make an analogy between deep architectures and physical materials. Different physical materials have different sensitivities to annealing. Steel is a mixture of iron and carbon. At temperatures below 727 C the carbon precipitates out into carbon sheets between grains of iron crystal. But above 727 the iron forms a different crystal structure causing the carbon sheets to dissolve into the iron. If we heat a small piece of steel above 727 and then drop it into cold water it makes a sound like “quench”. When it is quenched the high temperature crystal structure is preserved and we get hardened steel. Hardened steel can be used as a cutting blade in a drill bit to drill into soft steel. Annealing is a process of gradually reducing the temperature. Gradual annealing produces soft grainy steel. Tempering is a process of re-heating quenched steel to temperatures high enough to change its properties but below the original pre-quenching temperature. This can make the steel less brittle while preserving its hardness. (Acknowledgments to my ninth grade shop teacher.)

Molten glass (silicon dioxide) can never be cooled slowly enough to reach its global energy minimum which is quartz crystal. Instead, silicon dioxide at atmospheric pressure always cools to a glassy (disordered) local optimum with residual entropy but with statistically reliable properties. Minute levels of impurities in the glass (skip connections?) can act as catalysts allowing the cooling process to achieve glassy states with lower energy and different properties. (Acknowledgements to discussions with Akira Ikushima of TTI in Nagoya about the nature of glass.)

The remainder of this post is fairly technical. The quenching school tends to focus on gradient flow as defined by the differential equation

where is the gradient of the average loss over the training data. This defines a deterministic continuous path through parameter space which we can try to analyze.

The annealing school views SGD as an MCMC process defined by the stochastic state transition

where is a random variable equal to the loss gradient of a random data point. Langevin dynamics yields a formal statistical mechanics for SGD as defined by (2). In this blog post I want to try to explain Langevin dynamics as intuitively as I can using abbreviated material from My lecture slides on the subject.

First, I want to consider numerical integration of gradient flow (1). A simple numerical integration can be written as

Comparing (3) with (2) it is natural to interpret in (2) as . For the stochastic process defined by (2) I will define by

or

This provides a notion of “time” for SGD as defined by (2) that is consistent with gradient flow in the limit .

We now assume that is sufficiently small so that for large we still have small and that is essentially constant over the range to . Assuming the total gradient is essentially constant at over the interval from to we can rewrite (2) in terms of time as

By the multivariate law of large numbers a random vector that is a large sum of IID random vectors is approximately Gaussian. This allows us to rewrite (4) as

where is the covariance matrix of the random variable . A derivation of (5) from (4) is given in the slides. Note that the noise term vanishes in the limit and we are back to gradient flow. However in Langevin dynamics we do not take to zero but instead we hold at a fixed nonzero value small enough that (5) is accurate even for small. Langevin dynamics is formally a continuous time stochastic process under which (5) also holds but where the equation is taken to hold at arbitrarily small values of . The Langevin dynamics can be denoted by the notation

If independent of , and is constant, we get Brownian motion.

The importance of the Langevin dynamics is that it allows us to solve for, and think in terms of, a stationary density. Surprisingly, if the covariance matrix is not isotropic (if the eigenvalues are not all the same) then the stationary density is not Gibbs. Larger noise will yield a broader (hotter) distribution. When different directions in parameter space have different levels of noise we get different “temperatures” in the different directions. A simple example is given in the slides. However we can make the noise isotropic by replacing (2) with the update

For this update (and assuming that remains constant over parameter space) we get

for a universal constant .

I would argue that the Gibbs stationary distribution (8) is desirable because it helps in escaping from a local minima. More specifically, we would like to avoid artificially biasing the search for an escape toward high noise directions in parameter space at the expanse of exploring low-noise directions. The search for an escape direction should be determined by the loss surface rather than the noise covariance.

Note that (8) indicates that as we reduce the temperature toward zero the loss will approach the loss of the (local) minimum. Let denote the average loss under the stationary distribution of SGD around a local minimum at temperature . The slides contain a proof of the following simple relationship.

where the random values are drawn at the local optimum (at the local optimum the average gradient is zero). This equation is proved without the use of Langevin dynamics and holds independent of the shape of the stationary distribution. Once we are in the linear region, halving the learning rate corresponds to moving half way to the locally minimal loss. This seems at least qualitatively consistent with empirical learning curves such as the following from the original ResNet paper by He et al.

Getting back to the soul of SGD, different deep learning models, like different physical materials, will have different properties. Intuitively, annealing would seem to be the more powerful search method. One might expect that as models become more sophisticated — as they become capable of expressing arbitrary knowledge about the world — annealing will be essential.

The quenching/annealing dichotomy is an interesting way to put it. I usually say that it’s wise not to be too greedy in optimisation, especially when learning hidden structure. Could we investigate this empirically, e.g., with batch algorithms like CG or LBFGS, at least with toy examples like MNIST? (Surely someone must have done this).

Another comment: I thought there were fancy optimisers (like Adagrad?) that are designed to correct for the difference in variance of the parameters. It’s impressive that while many other optimisers have come and gone, SGD just keeps on being state of the art. (A cynic might say that by adopting things like residual connections, we are optimising our models for our optimisation algorithm.)

I would love to get comments with references to relevant empirical results.

The most popular adaptive algorithm is Adam. Adam is the union of RMS prop and momentum with “bias correction”.

A first issue is that it is generally claimed that the RMSprop update Phi[j] -= (eta/sigma[j]) g[j] is empirically superior to the theoretically better motivated (by various analyses) Phi[i] -= (eta/sigma[j]^2) g[j]. I have no explanation.

A second issue (discussed in my previous post) is that for Adam and RMSprop (as provided by the current frameworks) there is no easy way to make the learning rate conjugate to the batch size. Another way to say this is that it is not possible to give a semantics to the adaptation that is independent of batch size.

A third issue is that theoretical analyses involve the inverse of the covariance matrix of the sample gradients while in practice we make crude diagonal approximations.

It is not clear whether these three issue are connected in some way.

I see no problem with designing the models so as to facilitate SGD. Batch normalization is explicitly motivated by this — that good old internal covariate shift :).

Thanks for the posts — they are thought-provoking, as always!

1. What would be a good empirical test of the annealing vs. quenching hypotheses?

I suppose we could compare SGD to a hill-climbing batch algorithm (gradient descent, or perhaps CG or L-BFGS?). But how exactly? Compare the log likelihood of the training corpus as well as a held-out corpus? (That’s the classic way of detecting over-fitting).

And many of our current algorithmic tricks are aimed at stochastic optimisation (e.g., dropout), so it would be hard to do a “fair” comparison of batch vs stochastic optimisation. (The importance of dropout seems to support the annealing hypothesis, I think).

2. Your point about making hyperparameters conjugate is very reasonable. Of course if we are only doing random search for optimal hyperparameters (and it’s not clear any hyperparameter optimisation methods are really doing much more than random search, AFAIK) then maybe conjugacy won’t help much.

3. What’s going on with these crazy learning rate schedules people are using these days? I can see why starting with a large learning rate and lowering it makes sense (following your argument above), but what’s going on with these saw-tooth schedules? Maybe your conjugacy ideas can help here (maybe adjusting the learning rate is in fact adjusting something else as well?).

4. My point about SGD dominating more advanced optimisation algorithms (yes, I meant Adam, not Adagrad) is that if your argument is correct, you’d expect Adam to not be worse than SGD and occasionally to be better (at least when the covariance matrix is approximately diagonal). But I don’t see Adam used much these days …