This year I realized that VAEs are non-parametrically consistent as models of the observed data even when the encoder is held fixed and arbitrary. This is best demonstrated with a nonstandard derivation of VAEs bypassing the ELBO.

Let range over observable data and let range over latent values. Let the encoder be defined by a probability distribution . We then have a joint distribution where is drawn from the population () and is drawn from the encoder. We let the entropies and the mutual information , , , and all be defined by this joint distribution. To derive the VAE objective we start with the following basic information theoretic equalities.

Assuming that we can sample from the encoder distribution , and that we can compute for any and , the conditional entropy can be estimated by sampling. However that is not true of or because we have no way of computing or . However, entropies defined in terms of the population can be upper bounded (and estimated) by cross-entropies and we introduce two models and with which to define cross-entropies.

Inserting these two cross entropy upper bounds (or entropy estimators) into (1) gives

The right hand side of (4) is the standard VAE objective function in terms of the prior, the encoder and the decoder. However, this derivation of the upper bound (4) from the exact equality (1) shows that we get a consistent non-parametric estimator of by optimizing the prior and posterior according to (2) and (3) while holding the encoder fixed. This follows directly from the fact that cross entropy is a consistent non-parametric estimator of entropy in the sense that . Furthermore, we expect that estimates and that estimates again independent of

the choice of .

This observation gives us freedom in designing latent variable objective functions that produce useful or interpretable latent variables. We can train the prior and posterior by (2) and (3) and train the encoder by **any choice of an encoder objective function**. For example, a natural choice might be

The weight in (5) can be interpreted as providing a rate-distortion trade-off where the mutual information (upper bound) expresses the channel capacity (information rate) of as a communication channel for the message . This is exactly the -VAE which weights the rate by rather than the distortion by .

However, there can be different encoders achieving the same rate and distortion. The consistency of (4) independent of allows additional desiderata to be placed on the encoder. For example, we might want to be a sequence with the independent and the mutual information with evenly balanced between the yielding a VAE similar to an InfoGAN.

Here we are designing different objectives for different model components — the objectives defined by (2) and (3) for and are intended to be independent of any designed objective for and the objective for can be designed independent of (2) and (3). Multiple objective functions yield a multiplayer game with Nash equilibria. In practice we will need to insert stop gradients to prevent, for example, the objective for (player) from interfering with the objective for (player) and vice-versa.

The bottom line is that we can select any objective for the encoder while preserving non-parametric consistency of the VAE as a model of the observed data.

A fundamental question is whether belief gradient is a better conceptual framework for RL than policy gradient. I have always felt that there is something wrong with policy gradient methods. The optimal policy is typically deterministic while policy gradient methods rely on significant exploration (policy stochasticity) to compute a gradient. It just seems paradoxical. The belief gradient approach seems to resolve this paradox.

**The action belief function and a belief gradient algorithm. **For a given state in a given Markov decision process (MDP) with a finite action space there exists a deterministic optimal policy mapping each state to an optimal action . For each decision (in life?) there is some best choice that is generally impossible to know. I propose reinterpreting AlphaZero’s policy network as giving an estimator for the probability that is the best choice — that . To make this reinterpretation explicit I will change notation and write for the belief that is the optimal choice. The notation is generally interpreted as specifying stochastic (erratic?) behavior.

To make the mathematics clearer I will assume that the belief is computed by a softmax

where is a vector representation of the state . The softmax is not as important here as the idea that provides only incomplete knowledge of . Further observations on are possible. For example, we can grow a search tree rooted as . For a given distribution on states we then get a distribution on the pairs . We assume that we also have some way of computing a more informed belief . AlphaZero computes a replay buffer containing pairs where the stored distribution is . Ideally the belief would match the marginal beliefs over the more informed search tree results. This motivates the following were denotes the replay buffer.

(1)

This is just the gradient of a cross-entropy loss from the replay buffer to the belief .

**Some AlphaZero details.** For the sake of completeness I will describe a few more details of the AlphaZero algorithm. The algorithm computes rollouts to the end of an episode (game) where each action in the rollout is selected based on tree search. Each rollout adds a set of pairs to the replay buffer where is a state in that rollout and is a recording of . Each rollout also has a reward (was the game won or lost). AlphaZero stores a second set of replay values for each rollout state state where is the reward of that rollout. The replay pairs are used to train a value function estimating the reward that will be achieved from state . The value function acts as a kind of static board evaluator in growing the tree .

**Improved generality.** The abstract mathematical formulation given in equation (1) is more general than tree search. In the belief we can have that is any information about that usefully augments the information provided in .

**A belief gradient theorem.** While I have reinterpreted policies as beliefs, and have recast AlphaZero’s algorithm in that light, I have not provided a belief gradient theorem. The simplest such theorem is for imitation learning. Assume an expert that labels states with actions. Optimizing the cross-entropy loss for this labeled data yields a probability . Learning a probability does not imply that the agent should make a random choice. Belief is not the same as choice.

Marvin Minsky: We need common-sense knowledge – and programs that can use it. Common sense computing needs several ways of representing knowledge. It is harder to make a computer housekeeper than a computer chess-player … There are [currently] very few people working with common sense problems. … [one such person is] John McCarthy, at Stanford University, who was the first to formalize common sense using logics. …

The desire to understand human reasoning was the philosophical motivation for mathematical logic in the 19th and early 20th centuries. Indeed, the situation calculus of McCarthy, 1963 was a seminal point in the development of logical approaches to commonsense. But today the logicist approach to AI is generally viewed as a failure and, in spite of the recent advances with deep learning, understanding commonsense remains a daunting roadblock on the path to AGI.

Today we look to BERT and GPT and their descendants for some kind of implicit understanding of semantics. Do these huge models, trained merely on raw text, contain some kind of implicit knowledge of commonsense? I have always believed that truly minimizing the entropy of English text requires an understanding of whatever the text is about — that minimizing the entropy of text about everyday events requires an understanding of the events themselves (whatever that means). Before the BERT era I generally got resistance to the idea that language modeling could do much interesting. Indeed it remains unclear to what extend current language models actually embody semantics and to what extent semantics can actually be extracted from raw text.

In recent times the Winograd schema challenge (WSC) has been the most strenuous test of common sense reasoning. The currently most prestigious version is the SuperGLUE-WSC. In this version a sentence is presented with both a pronoun and a noun tagged. The task is to determine if the tagged pronoun is referring to the tagged noun. The sentences are selected with the intention of requiring commonsense understanding. For example we have

The trophy would not fit in the **suitcase** because **it** was too big.

The task is to determine if tagged pronoun “it” refers to the tagged noun “suitcase” — here “it” actually refers to “trophy”. If we replace “big” by “small” the referent flips from “trophy” to “suitcase”. It is perhaps shocking (disturbing?) that the state of the art (SOTA) for this problem is 94% — close to human performance. This is done by fine-tuning a huge language model on a very modest number of training sentences from WSC.

But do the language models really embody common sense? A paper from Jackie Cheung’s group at McGill found that a fraction of the WSC problems can be answered by simple n-gram statistics. For example consider

I’m sure that my **map** will show this building; **it** is very famous.

A language model easily determines that the phrase “my map is very famous” is less likely than “this building is very famous” so the system can guess that the referent is “building” and not “map”. But they report that this phenomenon only accounts for 13.5% of test problems. Presumably there are other “tells” that allow correct guessing without understanding. But hidden tells are hard to identify and we have no way of measuring what fraction of answers are generated legitimately.

A very fun, and also shocking, example of language model commonsense is the COMET system from Yejin Choi’s group at the University of Washington. In COMET one presents the system with unstructured text describing an event and the system fills in answers to nine standard questions about events such as “before, the agent needed …”. This can be viewed as asking for the action prerequisites in McCarthy’s situation calculus. COMET gives an answer as unstructured text. There is a web interface where one can play with it. As a test I took the first sentence from the jacket cover of the novel “Machines like me”.

Charlie, drifting through life, is in love with Miranda, a brilliant student with a dark secret.

I tried “Charlie is drifting through life” and got I was shocked to see that the system suggested that Charlie had lost his job, which was true in the novel. I recommend playing with COMET. I find it kind of creepy.

But, as Turing suggested, the real test of understanding is dialogue. This brings me to the chatbot Meena. This is a chatbot derived from a very large language model — 2.6 billion parameters trained on 341 gigabytes of text. The authors devised various human evaluations and settled on an average of a human-judged specificity score and a human-judged sensibility score — Specificity-Sensibility-Average or SSA. They show that they perform significantly better than previous chatbots by this metric. Perhaps most interestingly, they give the following figure relating the human-judged SSA performance to the perplexity achieved by the underlying language model.

They suggest that a much more coherent chatbot could be achieved if the perplexity could be reduced further. This supports the belief that minimal perplexity requires understanding.

But in spite of the success of difficult-to-interpret language models, I still believe that interpretable entities and relations are important for commonsense. It also seems likely that we will find ways to greatly reduce the data requirements of our language models. A good place to look for such improvements is to somehow improve our understanding of the relationship, assuming there is one, between language modeling and interpretable entities and relations.

]]>The classical proof of consistency for pseudo-likelihood assumes that the actual population distribution is defined by some setting of the MRF weights. For BERT I will replace this assumption with the assumption that the deep model is capable of exactly modeling the various conditional distributions. Because deep models are intuitively much more expressive than linear MRFs over hand-designed features, this deep expressivity assumption seems much weaker than the classical assumption.

In addition to assuming universal expressivity, I will assume that training finds a global optimum. Assumptions of complete optimization currently underly much of our intuitive understanding of deep learning. Consider the GAN consistency theorem. This theorem assumes both universal expressivity and complete optimization of both the generator and the discriminator. While these assumptions seem outrageous, the GAN consistency theorem is the source of the design of the GAN architecture. The value of such outrageous assumptions in architecture design should not be under-estimated.

For training BERT we assume a population distribution over blocks (or sentences) of words . I will assume that BERT is trained by blanking a single word in each block. This single-blank assumption is needed for the proof but seems unlikely to matter in practice. Also, I believe that the proof can be modified to handle XLNet which predicts a single held-out subsequence per block rather than multiple independently modeled blanks.

Let be the BERT parameters and let be the distribution on words that BERT assigns to the $i$th word when the $i$th word is blanked. The training objective for BERT is

where denotes cross-entropy conditioned on . Each cross entropy term is individually minimized when . Our universality assumption is that there exists a satisfying all of these conditional distributions simultaneously. Under this assumption we have

for all and .

I must now define the language model (full sentence distribution) defined by . For this I consider Gibbs sampling — the stochastic process defined on by randomly selecting and replacing by a sample from . The language model is now defined to be the stationary distribution of this Gibbs sampling process. But this Gibbs process is the same process as Gibbs sampling using the population conditionals. Therefore the stationary distribution must be . Q.E.D.

]]>Hinton’s grand vision of AI has always been that there are simple general principles of learning, analogous to the Navier-Stokes equations of fluid flow, from which complex general intelligence emerges. I think Hinton under-estimates the complexity required for a general learning mechanism, but I agree that we are searching for some general (i.e., minimal-bias) architecture. For the following reasons I believe that vector quantization is an inevitable component of the architecture we seek.

**A better learning bias.** Do the objects of reality fall into categories? If so, shouldn’t a learning architecture be designed to categorize? A standard theory of language learning is that the child learns to recognize certain things, like mommy and doggies, and then later attaches these learned categories to the words of language. It seems natural to assume that categorization precedes language in both development and evolution. The objects of reality do fall into categories and every animal must identify potential mates, edible objects, and dangerous predators.

It is not clear that the vector quanta used in VQ-VAE-2 correspond to meaningful categories. It is true, however, that the only meaningful distribution models of ImageNet images are class-conditional. A VQ-VAE with a vector quanta for the image as a whole at least has the potential to allow class-conditioning to emerge from the data.

**Interpretability. **Vector quantization shifts the interpretability question from that of interpreting linear threshold units to that of interpreting emergent symbols — the embedded tokens that are the emergent vector quanta. A VQ-VAE with a vector-quantized full image representation would cry out for a class interpretation for the emergent image symbol.

A fundamental issue is whether the vectors being quantized actually fall into natural discrete clusters. This form of interpretation is often done with t-SNE. But if vectors naturally fall into clusters then it seems that our models should seek and utilize that clustering. Interpretation can then focus on the meaning of the emergent symbols.

**The rate-distortion training objective. **VQ-VAEs support rate-distortion training as discussed in my previous blog post on rate-distortion metrics for GANs. I have always been skeptical of GANS because of the lack of meaningful performance metrics for generative models lacking an encoder. While the new CAS metric does seem more meaningful than previous metrics, I still feel that training on cross-entropy loss (negative log likelihood) should ultimately be more effective than adversarial training. Rate-distortion metrics assume a discrete compressed representation defining a “rate” (a size of a compressed image file) and some measure of distortion between the original image and its reconstruction from the compressed file. The rate is measured by negative log-likelihood (a kind of cross-entropy loss). Rate-distortion training is also different from differential cross-entropy training as used in Flow networks (invertible generative networks such as GLOW). Differential cross-entropy can be unboundedly negative. To avoid minimizing an unboundedly negative quantity, when training on differential cross-entropy one must introduce a scale parameter for the real numbers in the output (the parameter “a” under equation (2) in the GLOW paper). This scale parameter effectively models the output numbers as discrete integers such as the bytes in the color channels of an image. The unboundedly negative differential cross-entropy then becomes a non-negative discrete cross-entropy. However, this treatment of differential cross-entropy still fails to support a rate-distortion tradeoff parameter.

**Unifying vision and language architectures.** A fundamental difference between language models and image models involves word vectors. Language models use embedded symbols where the symbols are manifest in the data. The vast majority of image models do not involve embedded symbols. Vector quantization is a way for embedded symbols to emerge from continuous signal data.

**Preserving parameter values under retraining.** When we learn to ski we do not forget how to ride a bicycle. However, when a deep model is trained on a first task (riding a bicycle) and then on a second task (skiing), optimizing the parameters for the second task degrades the performance on the first. But note that when a language model is training on a new topic, the word embeddings of words not used in new topic will not change. Similarly a model based on vector quanta will not change the vectors for the bicycle control symbols not invoked when training on skiing.

**Improved transfer learning.** Transfer learning and few-shot learning (meta-learning) may be better supported with embedded symbols for the same reason that embedded symbols reduce forgetting — embedded symbols can be class or task specific. The adaptation to a new domain can be restricted to the symbols that arise (under some non-parametric nearest neighbor scheme) in the new domain, class or task.

**Emergent symbolic representations.** The historical shift from symbolic logic-based representations to distributed vector representations is typically viewed as one of the cornerstones of the deep learning revolution. The dramatic success of distributed (vector) representations in a wide variety of applications cannot be disputed. But it also seems true that mathematics is essential to science. I personally believe that logical symbolic reasoning is necessary for AGI. Vector quantization seems to be a minimal-bias way for symbols to enter into deep models.

]]>

Cross-entropy loss is typically disregarded for GANs in spite of the fact that it is the de-facto metric for modeling distributions and in spite of its success in pre-training for NLP tasks. In this post I argue that rate-distortion metrics — a close relative of cross entropy loss — should be a major component of GAN evaluation (in addition to discrimination loss). Furthermore, evaluating GANs by rate-distortion metrics leads to a conceptual unification of GANs, VAEs and signal compression. This unification is already emerging from image compression applications of GANS such as in the work of Agustsson et al 2018. The following figure by Julien Despois can be interpreted in terms of VAEs, signal compression, or GANs.

The VAE interpretation is defined by

Now define as the minimum of (1) over while holding and fixed. Using this to express the objective as a function of and , and assume universal expressiveness of , the standard ELBO analysis shows that (1) reduces to minimizing cross-entropy loss of .

It should be noted, however, that differential entropies and cross-entropies suffer from the following conceptual difficulties.

- The numerical value of entropy and cross entropy depends on an arbitrary choice of units. For a distribution on lengths, probability per inch is numerically very different from probability per mile.
- Shannon’s source coding theorem fails for continuous densities — it takes an infinite number of bits to specify a single real number.
- The data processing inequality fails for differential entropy — has a different differential entropy than .
- Differential entropies can be negative.

For continuous data we can replace the differential cross-entropy objective with a more conceptually meaningful rate-distortion objective. Independent of conceptual objections to differential entropy, a rate-distortion objective allows for greater control of the model through a rate-distortion tradeoff parameter as is done in -VAEs (Higgens et al. 2017, Alemi et al 2017). A special case of a -VAE is defined by

The VAE optimization (1) can be transformed into the rate-distortion equation (3) by taking

and taking to be a fixed constant. In this case (1) transforms into (3) with . Distortion measures such as L1 and L2 preserve the units of the signal and are more conceptually meaningful than differential cross-entropy. But see the comments below on other obvious issues with L1 and L2 distortion measures. KL-divergence is defined in terms of a ratio of probability densities and, unlike differential entropy, is conceptually well-formed.

Equation (3) leads to the signal compression interpretation of the figure above. It turns out that the KL term in (3) can be interpreted as a compression rate. Let be the optimum in (3) for a fixed value of and . Assuming universality of , the resulting optimization of and becomes the following where

The KL term can now be written as a mutual information between and .

Hence (4) can be rewritten as

A more explicit derivation can be found in slides 17 through 21 in my lecture slides on rate-distortion auto-encoders.

By Shannon’s channel capacity theorem, the mutual information is the number of bits transmitted through a noisy channel from to — it is the number of bits from than reach the decoder . In the figure is defined by the equation for some fixed noise distribution on . Adding noise can be viewed as limiting precision. For standard data compression, where must be a compressed file with a definite number of bits, the equation can be interpreted as a rounding operation that rounds to integer coordinates. See Agustsson et al 2018.

We have now unified VAEs with data-compression rate-distortion models. To unify these with GANs we can take and to be the generator of a GAN. We can train the GAN generator in the traditional way using only adversarial discrimination loss and then measure a rate-distortion metric by training to minimize (3) while holding and fixed. Alternatively, we can add a discrimination loss to (3) based on the discrimination between and and train all the parameters together. It seems intuitively clear that a low rate-distortion value on test data indicates an absence of mode collapse — it indicates that the model can efficiently represent novel images drawn from the population. Ideally, the rate-distortion metric should not increase much as we add weight to a discrimination loss.

A standard objection to L1 or L2 distortion measures is that they do not represent “perceptual distortion” — the degree of difference between two images as perceived by a human observer. One interpretation of perceptual distortion is that two images are perceptually similar if the are both “natural” and carry “the same information”. In defining what we mean by the same information we might invoke predictive coding or the information bottleneck method. The basic idea is to find an image representation that achieves compression while preserving mutual information with other (perhaps future) images. This can be viewed as an information theoretic separation of “signal” from “noise”. When we define the information in an image we should be disregarding noise. So while it is nice to have a unification of GANs, VAEs and signal compression, it would seem better to have a theoretical framework providing a distinction between signal and noise. Ultimately we would like a rate-utility metric for perceptual representations.

]]>First some comments on PAC-Bayesian theory. I coined the term “PAC-Bayes” in the late 90’s to describe a class of theorems giving PAC generalization guarantees in terms of arbitrarily chosen prior distributions. Some such theorems (Occam bounds) pre-date my work. Over last twenty years there has been significant refinement of these bounds by various authors. A concise general presentation of PAC-Bayesian theory can be found in my PAC-Bayes tutorial.

After about 2005, PAC-Bayesian analysis largely fell out of usage in the learning theory community in favor of more “sophisticated” concepts. However, PAC-Bayesian bounds are now having a resurgence — their conceptual simplicity is paying off in the analysis of deep networks. Attempts to apply VC dimension or Rademacher complexity to deep networks yield extremely vacuous guarantees — guarantees on binary error rates tens of orders of magnitude larger than 1. PAC-Bayesian theorems, on the other hand, can produce non-vacuous guarantees — guarantees less than 1. Non-vacuous PAC-Bayesian guarantees for deep networks were first computed for MNIST by Dziugaite et al. and recently for ImageNet by Zhou et al.

In annealed SGD the learning rate acts as a temperature parameter which is set high initially and then cooled. As we cool the temperature we can think of the model parameters as a glass that is cooled to some finite temperature at which it becomes solid — becomes committed to a particular local basin of the energy landscape. For a given parameter initialization , annealed SGD defines a probability density over the final model given initialization and training data . Under Langevin dynamics we have that is a smooth density. If we are concerned that Langevin dynamics is only an approximation, we can add some small Gaussian noise to SGD to ensure that is smooth.

The entropy of the distribution is the residual entropy of the parameter vector. Note that if there was one final global optimum which was always found by SGD (quartz crystal) then the residual entropy would be zero. The distribution , and hence the residual entropy, includes all the possible local basins (all the possible solid structures of the glass).

To state the residual entropy bound we also need to define a parameter distribution in terms of the population independent of any particular sample. Let be the number of data points in the sample. Let be the distribution defined by first drawing an IID sample of size and then sampling from . The entropy of the distribution is the residual entropy of annealing as defined by and independent of the draw of any particular sample. The residual entropy bound is governed by . This is an expected KL-divergence between two residual entropy distributions. It is also equal to the mutual information between and under the distribution .

To formally state the bound we assume that for a data point we have a loss . I will write for expectation of when is drawn from the population, and write for the average of over drawn from the training sample. The following residual entropy bound is a corollary of theorem 4 in my PAC-Bayes tutorial.

There are good reasons to believe that (1) is extremely tight. This is a PAC-Bayesian bound where the “prior” is . PAC-Bayesian bounds yield non-vacuous values under Gaussian priors. It can be shown that is the optimal prior for posteriors of the form .

Unfortunately analysis of (1) seems intractable. But difficulty of theoretical analysis does not imply that a conceptual framework is wrong. The clear case that (1) is extremely tight would seem to cast doubt on analyses selected for tractability at the expense of realism.

]]>I would like to make an analogy between deep architectures and physical materials. Different physical materials have different sensitivities to annealing. Steel is a mixture of iron and carbon. At temperatures below 727 C the carbon precipitates out into carbon sheets between grains of iron crystal. But above 727 the iron forms a different crystal structure causing the carbon sheets to dissolve into the iron. If we heat a small piece of steel above 727 and then drop it into cold water it makes a sound like “quench”. When it is quenched the high temperature crystal structure is preserved and we get hardened steel. Hardened steel can be used as a cutting blade in a drill bit to drill into soft steel. Annealing is a process of gradually reducing the temperature. Gradual annealing produces soft grainy steel. Tempering is a process of re-heating quenched steel to temperatures high enough to change its properties but below the original pre-quenching temperature. This can make the steel less brittle while preserving its hardness. (Acknowledgments to my ninth grade shop teacher.)

Molten glass (silicon dioxide) can never be cooled slowly enough to reach its global energy minimum which is quartz crystal. Instead, silicon dioxide at atmospheric pressure always cools to a glassy (disordered) local optimum with residual entropy but with statistically reliable properties. Minute levels of impurities in the glass (skip connections?) can act as catalysts allowing the cooling process to achieve glassy states with lower energy and different properties. (Acknowledgements to discussions with Akira Ikushima of TTI in Nagoya about the nature of glass.)

The remainder of this post is fairly technical. The quenching school tends to focus on gradient flow as defined by the differential equation

where is the gradient of the average loss over the training data. This defines a deterministic continuous path through parameter space which we can try to analyze.

The annealing school views SGD as an MCMC process defined by the stochastic state transition

where is a random variable equal to the loss gradient of a random data point. Langevin dynamics yields a formal statistical mechanics for SGD as defined by (2). In this blog post I want to try to explain Langevin dynamics as intuitively as I can using abbreviated material from My lecture slides on the subject.

First, I want to consider numerical integration of gradient flow (1). A simple numerical integration can be written as

Comparing (3) with (2) it is natural to interpret in (2) as . For the stochastic process defined by (2) I will define by

or

This provides a notion of “time” for SGD as defined by (2) that is consistent with gradient flow in the limit .

We now assume that is sufficiently small so that for large we still have small and that is essentially constant over the range to . Assuming the total gradient is essentially constant at over the interval from to we can rewrite (2) in terms of time as

By the multivariate law of large numbers a random vector that is a large sum of IID random vectors is approximately Gaussian. This allows us to rewrite (4) as

where is the covariance matrix of the random variable . A derivation of (5) from (4) is given in the slides. Note that the noise term vanishes in the limit and we are back to gradient flow. However in Langevin dynamics we do not take to zero but instead we hold at a fixed nonzero value small enough that (5) is accurate even for small. Langevin dynamics is formally a continuous time stochastic process under which (5) also holds but where the equation is taken to hold at arbitrarily small values of . The Langevin dynamics can be denoted by the notation

If independent of , and is constant, we get Brownian motion.

The importance of the Langevin dynamics is that it allows us to solve for, and think in terms of, a stationary density. Surprisingly, if the covariance matrix is not isotropic (if the eigenvalues are not all the same) then the stationary density is not Gibbs. Larger noise will yield a broader (hotter) distribution. When different directions in parameter space have different levels of noise we get different “temperatures” in the different directions. A simple example is given in the slides. However we can make the noise isotropic by replacing (2) with the update

For this update (and assuming that remains constant over parameter space) we get

for a universal constant .

I would argue that the Gibbs stationary distribution (8) is desirable because it helps in escaping from a local minima. More specifically, we would like to avoid artificially biasing the search for an escape toward high noise directions in parameter space at the expanse of exploring low-noise directions. The search for an escape direction should be determined by the loss surface rather than the noise covariance.

Note that (8) indicates that as we reduce the temperature toward zero the loss will approach the loss of the (local) minimum. Let denote the average loss under the stationary distribution of SGD around a local minimum at temperature . The slides contain a proof of the following simple relationship.

where the random values are drawn at the local optimum (at the local optimum the average gradient is zero). This equation is proved without the use of Langevin dynamics and holds independent of the shape of the stationary distribution. Once we are in the linear region, halving the learning rate corresponds to moving half way to the locally minimal loss. This seems at least qualitatively consistent with empirical learning curves such as the following from the original ResNet paper by He et al.

Getting back to the soul of SGD, different deep learning models, like different physical materials, will have different properties. Intuitively, annealing would seem to be the more powerful search method. One might expect that as models become more sophisticated — as they become capable of expressing arbitrary knowledge about the world — annealing will be essential.

]]>While this post is about hyperparameter search, I want to mention in passing some issues that do not seem to rise to level of a full post.

**Teaching backprop.** When we teach backprop we should stop talking about “computational graphs” and talk instead about programs defined by a sequence of in-line assignments. I present backprop as an algorithm that runs on in-line code and I give a loop invariant for the backward loop over the assignment statements.

**Teaching Frameworks.** My class provides a 150 line implementation of a framework (the educational framework EDF) written in Python/NumPy. The idea is to teach what a framework is at a conceptual level rather than teaching the details of any actual industrial strength framework. There are no machine problems in my class — the class is an “algorithms course” in contrast to a “programming course”.

I also have a complaint about the way that PyTorch handles parameters. In EDF modules take “parameter packages” (python objects) as arguments. This simplifies parameter sharing and maintains object structure over the parameters rather than reducing them to lists of tensors.

**Einstein Notation.** When we teach deep learning we should be using Einstein notation. This means that we write out all the indices. This goes very well with “loop notation”. For example we can apply a convolution filter to a layer using the following program.

for , , :

for , , , , , :

for , , :

Of course we can **also** draw pictures. The initialization to zero can be left implicit — all data tensors other than the input are implicitly initialized to zero. The body of the “tensor contraction loop” is just a product of two scalers. The back propagation on a product of two scalars is trivial. To back-propagate to the filter we have.

for , , , , , :

Since essentially all deep learning models consist of tensor contractions and scalar nonlinearities, we do not have to discuss Jacobian matrices. Also, in Einstein notation we have mnemonic variable names such as , and for tensor indices which, for me, greatly clarifies the notation. Yet another point is that we can easily insert a batch index into the data tensors when explaining minibatching.

Of course NumPy is most effective when using the index-free tensor operations. The students see that style in the EDF code. However, when explaining models conceptually I greatly prefer “Einstein loop notation”.

**Hyperparameter Conjugacy. **We now come to the real topic of this post. A lot has been written about hyperparameter search. I believe that hyperparameter search can be greatly facilitated by simple reparameterizations that greatly improve hyperparameter conjugacy. Conjugacy means that changing the value of a parameter does not change (or only minimally influences) the optimal value of another parameter . More formally, for a loss we would like to have

Perhaps the simplest example is the relationship between batch size and learning rate. It seems well established now that the learning rate should be scaled up linearly in batch size as one moves to larger batches. See Don’t Decay the Learning Rate, Increase the Batch Size by Smith et al. But note that if we simply redefine the learning rate parameter to be the learning rate appropriate for a batch size of 1, and simply change the minibatching convention to update the model by the sum of gradients rather than the average gradient, then the learning rate and the batch size become conjugate — we can optimize the learning rate on a small machine and leave it the same when we move to a bigger machine allowing a bigger batch. We can also think of this as giving the learning rate a semantics independent of the batch size. A very simple argument for this particular conjugacy is given in my SGD slides.

The most serious loss of conjugacy is the standard parameterization of momentum. The standard parameterization strongly couples the momentum parameter with the learning rate. For most fameworks we have

where is the momentum parameter, is the learning rate, is the gradient of a single minibatch, and is the system of model parameters. This can be rewritten as

A recurrence of the form yields that is a running average of . The running average of is linear in so the above formulation of momentum can be rewritten as

Now we begin to see the problem. It can be shown that each individual gradient makes a total contribution to of size . If the parameter vector remains relatively constant on the time scale of updates (where is typically 10) then all we have done by adding momentum is to change the learning rate from to . Pytorch starts from a superficially different but actually equivalent definition of the momentum and suffers from the same coupling of the momentum term with the learning rate. But a trivial patch is to use the following more conjugate formulation of momentum.

The slides summarize a fully conjugate (or approximately so) parameterization of the learning rate, batch size and momentum parameters.

There appears to be no simple reparameterization of the adaptive algorithms RMSProp and Adam that is conjugate to batch size. The problem is that the gradient variances are computed from batch gradients and variance estimates computed from large batches, even if “corrected”, contain less information that variances estimated at batch size 1. A patch would be to keep track of for each each individual within a batch. But this involves a modification to the backpropagation calculation. This is all discussed in more detail in the slides.

The next post will be on Langevin dynamics.

]]>

I have talked in previous posts about the “servant mission” — the AI mission of serving or advocating for a particular human. In a email to Eric Horovitz I later suggested that such agents be called “advobots”. An advobot advocates in a variety of professional and personal ways for a particular individual human — its “client”. I find the idea of a superintelligent personal advocate quite appealing. If every AI is an advobot for a particular client, and every person has an advobot, then power is distributed and the AI dangers seem greatly reduced. But there are subtleties here involving the relationship between advocacy and truth.

I will call an advobot *strongly truthful* if it judges truth independent of its advocacy. Strongly truthful advobots face some dilemmas:

**Religion.** A strongly truthful advobot will likely disagree with its client on various religious beliefs. For example evolution or the authorship of the Bible.

**Politics. **A strongly truthful advobot will likely disagree with its client concerning politically charged statements. Does immigration lead to increased crime? What will the level of sea rise due to CO2 emissions be over the next 50 years? Does socialized health care lead to longer life expectancy than privatized health care?

**Competence. **A strongly truthful advobot would almost certainly disagree with its client over the client’s level of competence. Should an advobot be required to be strongly truthful when speaking to an employer about the appropriateness of their client for a particular job opening?

These examples demonstrate the strong interaction between truth and advocacy. Much of human speaking or writing involves advocacy. Freshman are taught rhetoric — the art of writing *persuasively*. Advocating for a cause inevitably involves advocating for a belief.

It just does not seem workable to require advobots to be strongly truthful. But if advobot statements must be biased, it might be nice to have some other AI agents as a source of unbiased judgements. We could call these “judgebots”. A judgebot’s mission is simply to judge truth as accurately as possible independent of any external mission or advocacy. I do believe that the truth, or degree of truth, or truth under different interpretations, can be judged objectively. This is certainly true of the statements of mathematics. Presumably this is true of most scientific hypotheses. I think that it is also true of many of the statements argued over in politics and religion. Of course judgebots need not have any legal authority — defendants could still be tried by a jury of human peers or have cases decided by a human judge. But the judgements of superintelligent judgebots would still presumably influence people.

In addition to judging truth, judgebots could directly judge decisions relative to missions. Consider a corporation with a mission statement and a choice — say opening a plant in either city A or city B or hiring either A or B as the new CEO. We could ask a judgebot which of A or B is most faithful to the mission — which choice is best for the corporation as judged by its stated mission. This kind of judgement is admittedly difficult. But choices have to be made in any case. Who is better able to judge choices than a superintelligent judgebot? A human corporate CEO or board of directors could retain legal control of the corporation. The board could also control and change the mission statement, or refuse to publish a mission at all. But the judgements of superintelligent judgebots relative to various mission statements (published or not) would be available to the public and the stockholders. Judgebots would likely have influence precisely because they themselves have no agenda other than the truth.

It is possible that different judgebots would have access to different data and computational resources. Advobots would undoubtedly try to influence the judgebots by controlling the data and computation resources. But it would also be possible to require that the resources underlying every judgebot judgement be public information. A judgebot with all available data and large computational resources would intuitively seem most reliable — a good judge listens to all sides and thinks hard.

But as Pilot said to Jesus, what is truth? What, at an operational level, is the mission of a judgebot? That is a hard one. But if we are going to build a superintelligence I believe we will need an answer.

]]>