I believe there is currently a struggle for the soul of SGD. The issue is whether our theoretical understanding should be based on quenching or on annealing. The quenching school assumes that the goal of SGD is to reach a local minimum as quickly as possible. The annealing school views SGD as a nontrivial local search over parameter space in which “thermal fluctuations” play an important role in escaping from inferior local optima. Quenching seems easier to analyze theoretically. But as I have claimed in my earlier post on the role of theory in deep learning, the desire for rigor can mislead. We should not assume that quenching is the correct conceptual framework simply because it is easier to analyze. Intuitively, annealing would seem to to be the more powerful method of searching parameter space. Also there are reasons to believe that annealing yields better generalization than quenching. The generalization issues will be the subject of my next post. This post is about modeling SGD as an annealing process using Langevin dynamics as a “statistical mechanics” of SGD.
I would like to make an analogy between deep architectures and physical materials. Different physical materials have different sensitivities to annealing. Steel is a mixture of iron and carbon. At temperatures below 727 C the carbon precipitates out into carbon sheets between grains of iron crystal. But above 727 the iron forms a different crystal structure causing the carbon sheets to dissolve into the iron. If we heat a small piece of steel above 727 and then drop it into cold water it makes a sound like “quench”. When it is quenched the high temperature crystal structure is preserved and we get hardened steel. Hardened steel can be used as a cutting blade in a drill bit to drill into soft steel. Annealing is a process of gradually reducing the temperature. Gradual annealing produces soft grainy steel. Tempering is a process of re-heating quenched steel to temperatures high enough to change its properties but below the original pre-quenching temperature. This can make the steel less brittle while preserving its hardness. (Acknowledgments to my ninth grade shop teacher.)
Molten glass (silicon dioxide) can never be cooled slowly enough to reach its global energy minimum which is quartz crystal. Instead, silicon dioxide at atmospheric pressure always cools to a glassy (disordered) local optimum with residual entropy but with statistically reliable properties. Minute levels of impurities in the glass (skip connections?) can act as catalysts allowing the cooling process to achieve glassy states with lower energy and different properties. (Acknowledgements to discussions with Akira Ikushima of TTI in Nagoya about the nature of glass.)
The remainder of this post is fairly technical. The quenching school tends to focus on gradient flow as defined by the differential equation
where is the gradient of the average loss over the training data. This defines a deterministic continuous path through parameter space which we can try to analyze.
The annealing school views SGD as an MCMC process defined by the stochastic state transition
where is a random variable equal to the loss gradient of a random data point. Langevin dynamics yields a formal statistical mechanics for SGD as defined by (2). In this blog post I want to try to explain Langevin dynamics as intuitively as I can using abbreviated material from My lecture slides on the subject.
First, I want to consider numerical integration of gradient flow (1). A simple numerical integration can be written as
Comparing (3) with (2) it is natural to interpret in (2) as . For the stochastic process defined by (2) I will define by
This provides a notion of “time” for SGD as defined by (2) that is consistent with gradient flow in the limit .
We now assume that is sufficiently small so that for large we still have small and that is essentially constant over the range to . Assuming the total gradient is essentially constant at over the interval from to we can rewrite (2) in terms of time as
By the multivariate law of large numbers a random vector that is a large sum of IID random vectors is approximately Gaussian. This allows us to rewrite (4) as
where is the covariance matrix of the random variable . A derivation of (5) from (4) is given in the slides. Note that the noise term vanishes in the limit and we are back to gradient flow. However in Langevin dynamics we do not take to zero but instead we hold at a fixed nonzero value small enough that (5) is accurate even for small. Langevin dynamics is formally a continuous time stochastic process under which (5) also holds but where the equation is taken to hold at arbitrarily small values of . The Langevin dynamics can be denoted by the notation
If independent of , and is constant, we get Brownian motion.
The importance of the Langevin dynamics is that it allows us to solve for, and think in terms of, a stationary density. Surprisingly, if the covariance matrix is not isotropic (if the eigenvalues are not all the same) then the stationary density is not Gibbs. Larger noise will yield a broader (hotter) distribution. When different directions in parameter space have different levels of noise we get different “temperatures” in the different directions. A simple example is given in the slides. However we can make the noise isotropic by replacing (2) with the update
For this update (and assuming that remains constant over parameter space) we get
for a universal constant .
I would argue that the Gibbs stationary distribution (8) is desirable because it helps in escaping from a local minima. More specifically, we would like to avoid artificially biasing the search for an escape toward high noise directions in parameter space at the expanse of exploring low-noise directions. The search for an escape direction should be determined by the loss surface rather than the noise covariance.
Note that (8) indicates that as we reduce the temperature toward zero the loss will approach the loss of the (local) minimum. Let denote the average loss under the stationary distribution of SGD around a local minimum at temperature . The slides contain a proof of the following simple relationship.
where the random values are drawn at the local optimum (at the local optimum the average gradient is zero). This equation is proved without the use of Langevin dynamics and holds independent of the shape of the stationary distribution. Once we are in the linear region, halving the learning rate corresponds to moving half way to the locally minimal loss. This seems at least qualitatively consistent with empirical learning curves such as the following from the original ResNet paper by He et al.
Getting back to the soul of SGD, different deep learning models, like different physical materials, will have different properties. Intuitively, annealing would seem to be the more powerful search method. One might expect that as models become more sophisticated — as they become capable of expressing arbitrary knowledge about the world — annealing will be essential.