Thoughts from TTIC31230: Rate-Distortion Metrics for GANs.

Throughout my career I have believed that progress in AI arises from objective evaluation metrics.  When I first learned of GANs I was immediately skeptical because of the apparent lack of meaningful metrics.  While GANs have made enormous progress in generating realistic images, the problem of metrics remains (see Borji 2018).  I suspect that the lack of meaningful metrics is related to a failure of GANs to play an important role in unsupervised pre-training (feature learning) for discriminative applications.  This is in stark contrast to language models (ELMO, BERT and GPT-2) where simple cross-entropy loss has proved extremely effective for pre-training.

Cross-entropy loss is typically disregarded for GANs in spite of the fact that it is the de-facto metric for modeling distributions and in spite of its success in pre-training for NLP tasks. In this post I argue that rate-distortion metrics — a close relative of cross entropy loss — should be a major component of GAN evaluation (in addition to discrimination loss).  Furthermore, evaluating GANs by rate-distortion metrics leads to a conceptual unification of GANs, VAEs and signal compression.  This unification is already emerging from image compression applications of GANS such as in the work of Agustsson et al 2018. The following figure by Julien Despois can be interpreted in terms of VAEs, signal compression, or GANs.

$y \hspace{13em} z = z_\Psi(y) + \epsilon \hspace{10em}||y - \hat{y}_\Theta(z)||^2$

$p_\Phi(z)$

The VAE interpretation is defined by

$\Phi^*, \Psi^*,\Theta^* = \mathrm{argmin}_{\Phi,\Psi,\Theta}\;E_{y\sim \mathrm{Pop}}\;\left(\;KL(p_\Psi(z|y),p_\Phi(z)) + E_{z \sim P_\Psi(z|y)}\; -\ln p_\Theta(y|z)\right)\;\;\;\;\;\;(1)$

Now define $\Psi^*(\Phi,\Theta)$ as the minimum of (1) over $\Psi$ while holding $\Phi$ and $\Theta$  fixed. Using this to express the objective as a function of $\Phi$ and $\Theta$ , and assume universal expressiveness of $p_\Psi(z|y)$, the standard ELBO analysis shows that (1) reduces to minimizing cross-entropy loss of $p_{\Phi,\Theta}(y) =\int dz\;\;p_\Phi(z)p_\Theta(y|z)$.

$\Phi^*,\Theta^* = \mathrm{argmin}_{\Phi,\Theta}\; E_{y \sim \mathrm{Pop}}\;-\ln p_{\Phi,\Theta}(y)\;\;\;\;\;\;(2)$

It should be noted, however, that differential entropies and cross-entropies suffer from the following conceptual difficulties.

1. The numerical value of entropy and cross entropy depends on an arbitrary choice of units. For a distribution on lengths, probability per inch is numerically very different from probability per mile.
2. Shannon’s source coding theorem fails for continuous densities — it takes an infinite number of bits to specify a single real number.
3. The data processing inequality fails for differential entropy — $x/2$ has a different differential entropy than $x$.
4. Differential entropies can be negative.

For continuous data we can replace the differential cross-entropy objective with a more conceptually meaningful rate-distortion objective. Independent of conceptual objections to differential entropy, a rate-distortion objective allows for greater control of the model through a rate-distortion tradeoff parameter $\beta$ as is done in $\beta$-VAEs (Higgens et al. 2017Alemi et al 2017).  A special case of a $\beta$-VAE  is defined by

$\Phi^*, \Psi^*,\Theta^* = \mathrm{argmin}_{\Phi,\Psi}\;E_{y\sim \mathrm{Pop}}\;\left(\beta \;KL(p_\Psi(z|y),p_\Phi(z)) + E_{z \sim P_\Psi(z|y)}\;||y - \hat{y}_\Theta(z)||^2\right)\;\;\;\;(3)$

The VAE optimization (1) can be transformed into the rate-distortion equation (3) by taking

$p_\Theta(y|z) = \frac{1}{Z(\sigma)} \exp\left(\frac{-||y-z||^2}{2\sigma^2}\right)$

and taking $\sigma$ to be a fixed constant.   In this case (1) transforms into (3) with $\beta = 2\sigma^2$.  Distortion measures such as L1 and L2 preserve the units of the signal and are more conceptually meaningful than differential cross-entropy.  But see the comments below on other obvious issues with L1 and L2 distortion measures. KL-divergence is defined in terms of a ratio of probability densities and, unlike differential entropy, is conceptually well-formed.

Equation (3) leads to the signal compression interpretation of the figure above. It turns out that the KL term in (3) can be interpreted as a compression rate.  Let $\Phi^*(\Psi,\Theta)$ be the optimum $\Phi$ in (3) for a fixed value of $\Psi$ and $\Theta$. Assuming universality of $p_\Phi(z)$, the resulting optimization of $\Psi$ and $\Theta$ becomes the following where $p_\Psi(z) = \int dy \;\mathrm{Pop}(y)p_\Psi(z|y)$

$\Psi^*,\Theta^* = \mathrm{argmin}_{\Psi,\Theta}\;\beta E_y\; KL(p_\Psi(z|y), \;p_\Psi(z)) + E_y\;||y - \hat{y}_\Theta(z)||^2\;\;\;\;\;\;(4)$

The KL term can now be written as a mutual information between $y$ and $z$.

$E_{y \sim \mathrm{Pop}} KL(p_\Psi(z|y), \;p_\Psi(z)) = H(z) - H(z|y) = I(y,z)$

Hence (4) can be rewritten as

$\Psi^*,\Theta^* = \mathrm{argmin}_{\Psi,\Theta}\;\beta I(z,y) + E_{y,z}\;||y - \hat{y}_\Theta(z)||^2\;\;\;\;\;\;(5)$

A more explicit derivation can be found in slides 17 through 21 in my lecture slides on rate-distortion auto-encoders.

By Shannon’s channel capacity theorem, the mutual information $I(y,z)$ is the number of bits transmitted through a noisy channel from $y$ to $z$ — it is the number of bits from $y$ than reach the decoder $\hat{y}_\Theta(z)$.  In the figure $P_\Psi(z|y)$ is defined by the equation $z = z_\Psi(y) + \epsilon$ for some fixed noise distribution on $\epsilon$.  Adding noise can be viewed as limiting precision. For standard data compression, where $z$ must be a compressed file with a definite number of bits, the equation $z = z_\Psi(y) + \epsilon$ can be interpreted as a rounding operation that rounds $z_\Psi(y)$ to integer coordinates.  See Agustsson et al 2018.

We have now unified VAEs with data-compression rate-distortion models.  To unify these with GANs we can take $\Phi$ and $\Theta$ to be the generator of a GAN.  We can train the GAN generator $(\Phi,\Theta)$ in the traditional way using only adversarial discrimination loss and then measure a rate-distortion metric by training $\Psi$ to minimize (3) while holding $\Phi$ and $\Theta$ fixed.  Alternatively, we can add a discrimination loss to (3) based on the discrimination between $y$ and $\hat{y}_\Theta(z)$ and train all the parameters together. It seems intuitively clear that a low rate-distortion value on test data indicates an absence of mode collapse — it indicates that the model can efficiently represent novel images drawn from the population.  Ideally, the rate-distortion metric should not increase much as we add weight to a discrimination loss.

A standard objection to L1 or L2 distortion measures is that they do not represent “perceptual distortion” — the degree of difference between two images as perceived by a human observer.  One interpretation of perceptual distortion is that two images are perceptually similar if the are both “natural” and carry “the same information”.  In defining what we mean by the same information we might invoke predictive coding or the information bottleneck method. The basic idea is to find an image representation that achieves compression while preserving mutual information with other (perhaps future) images.  This can be viewed as an information theoretic separation of “signal” from “noise”.  When we define the information in an image we should be disregarding noise.  So while it is nice to have a unification of GANs, VAEs and signal compression, it would seem better to have a theoretical framework providing a distinction between signal and noise. Ultimately we would like a rate-utility metric for perceptual representations.

Thoughts from TTIC31230: Rethinking Generalization.

Instigated by Zhang et al 2016, there has been considerable consternation in the deep learning community about why models capable of fitting random labels generalize well. It should be immediately noted that simple linear regression with more features than data points will also generalize well when appropriately regularized (ridge regression).  What is different about deep models is that the regularization seems to be implicit in SGD as described earlier by Neyshebur et al. 2014. Recent theoretical work on the regularization implicit in SGD, such as Arora et al. 2019, focuses on quenching shallow networks. In my previous post I argued that annealing seems more powerful than quenching. In this post I will give a PAC-Bayesian bound (a formal generalization guarantee) appropriate for annealed SGD.  I’m calling it the residual entropy bound.

First some comments on PAC-Bayesian theory. I coined the term “PAC-Bayes” in the late 90’s to describe a class of theorems giving PAC generalization guarantees in terms of arbitrarily chosen prior distributions.  Some such theorems (Occam bounds) pre-date my work.  Over last twenty years there has been significant refinement of these bounds by various authors. A concise general presentation of PAC-Bayesian theory can be found in my PAC-Bayes tutorial.

After about 2005, PAC-Bayesian analysis largely fell out of usage in the learning theory community in favor of more “sophisticated” concepts.  However, PAC-Bayesian bounds are now having a resurgence — their conceptual simplicity is paying off in the analysis of deep networks.  Attempts to apply VC dimension or Rademacher complexity to deep networks yield extremely vacuous guarantees — guarantees on binary error rates tens of orders of magnitude larger than 1.  PAC-Bayesian theorems, on the other hand, can produce non-vacuous guarantees — guarantees less than 1. Non-vacuous PAC-Bayesian guarantees for deep networks were first computed for MNIST by Dziugaite et al. and recently for ImageNet by Zhou et al.

In annealed SGD the learning rate acts as a temperature parameter which is set high initially and then cooled.  As we cool the temperature we can think of the model parameters as a glass that is cooled to some finite temperature at which it becomes solid — becomes committed to a particular local basin of the energy landscape. For a given parameter initialization $\Phi_0$, annealed SGD defines a probability density $p(\Phi|\;S,\Phi_0)$ over the final model $\Phi$ given initialization $\Phi_0$ and training data $S$.  Under Langevin dynamics we have that $p(\Phi\;|\;S,\Phi_0)$ is a smooth density.  If we are concerned that Langevin dynamics is only an approximation, we can add some small Gaussian noise to SGD to ensure that $p(\Phi\;|\;S,\Phi_0)$ is smooth.

The entropy of the distribution $p(\Phi\;|\;S,\Phi_0)$ is the residual entropy of the parameter vector. Note that if there was one final global optimum which was always found by SGD (quartz crystal) then the residual entropy would be zero. The distribution $p(\Phi\;|S,\Phi_0)$, and hence the residual entropy, includes all the possible local basins (all the possible solid structures of the glass).

To state the residual entropy bound we also need to define a parameter distribution in terms of the population independent of any particular sample. Let $N$ be the number of data points in the sample. Let $p(\Phi\;|\;\Phi_0,N)$ be the distribution defined by first drawing an IID sample $S$ of  size $N$ and then sampling $\Phi$ from $p(\Phi\;|\;S,\Phi_0)$.  The entropy of the distribution $p(\Phi\;|\;\Phi_0,N)$ is the residual entropy of annealing as defined by $\Phi_0$ and $N$ independent of the draw of any particular sample.  The residual entropy bound is governed by $E_S\;\mathrm{KL}(p(\Phi\;|\;S,\Phi_0),p(\Phi\;|\;\Phi_0,N))$.  This is an expected KL-divergence between two residual entropy distributions.  It is also equal to the mutual information between $\Phi$ and $S$ under the distribution $p(\Phi,S\;|\;\Phi_0,N)$.

To formally state the bound we assume that for a data point $z$ we have a loss ${\cal L}(\Phi,z) \in [0,L_{\mathrm{max}}]$.  I will write ${\cal L}(\Phi)$ for expectation of ${\cal L}(\Phi,z)$ when $z$ is drawn from the population, and write $\hat{\cal L}(\Phi)$ for the average of ${\cal L}(\Phi,z)$ over $z$ drawn from the training sample. The following residual entropy bound is a corollary of theorem 4 in my PAC-Bayes tutorial.

$\forall\Phi_0,N\;\;\;E_{S,\;\Phi \sim p(\Phi\;|\;S,\Phi_0)} \;{\cal L}(\Phi) \leq E_{S,\;\Phi\sim p(\Phi\;|\;S,\Phi_0)}\;\frac{10}{9}\left(\hat{\cal L}(\Phi) + \frac{5L_{\mathrm{max}}}{N}\mathrm{KL}(p(\Phi\;|\;S,\Phi_0),p(\Phi\;|\;\Phi_0,N))\right)\;\;\;\;(1)$

There are good reasons to believe that (1) is extremely tight. This is a PAC-Bayesian bound where the “prior” is $p(\Phi\;|\;\Phi_0,N)$.   PAC-Bayesian bounds yield non-vacuous values under Gaussian priors.  It can be shown that $p(\Phi\;|\;\Phi_0,N)$ is the optimal prior for posteriors of the form $p(\Phi\;|\;S,\Phi_0)$.

Unfortunately  analysis of (1) seems intractable. But difficulty of theoretical analysis does not imply that a conceptual framework is wrong.  The clear case that (1) is extremely tight would seem to cast doubt on analyses selected for tractability at the expense of realism.

Posted in Uncategorized | 4 Comments

Thoughts from TTIC31230: Langevin Dynamics and the Struggle for the Soul of SGD.

I believe there is currently a struggle for the soul of SGD.  The issue is whether our theoretical understanding should be based on quenching or on annealing.  The quenching school assumes that the goal of SGD is to reach a local minimum as quickly as possible.  The annealing school views SGD as a nontrivial local search over parameter space in which “thermal fluctuations” play an important role in escaping from inferior local optima.  Quenching seems easier to analyze theoretically. But as I have claimed in my earlier post on the role of theory in deep learning, the desire for rigor can mislead.  We should not assume that quenching is the correct conceptual framework simply because it is easier to analyze.  Intuitively, annealing would seem to to be the more powerful method of searching parameter space.  Also there are reasons to believe that annealing yields better generalization than quenching.  The generalization issues will be the subject of my next post.  This post is about modeling SGD as an annealing process using Langevin dynamics as a “statistical mechanics” of SGD.

I would like to make an analogy between deep architectures and physical materials. Different physical materials have different sensitivities to annealing.  Steel is a mixture of iron and carbon. At temperatures below 727$^\circ$ C the carbon precipitates out into carbon sheets between grains of iron crystal.  But above 727$^o$ the iron forms a different crystal structure causing the carbon sheets to dissolve into the iron. If we heat a small piece of steel above 727$^\circ$ and then drop it into cold water it makes a sound like “quench”.  When it is quenched the high temperature crystal structure is preserved and we get hardened steel. Hardened steel can be used as a cutting blade in a drill bit to drill into soft steel. Annealing is a process of gradually reducing the temperature. Gradual annealing produces soft grainy steel. Tempering is a process of re-heating quenched steel to temperatures high enough to change its properties but below the original pre-quenching temperature. This can make the steel less brittle while preserving its hardness. (Acknowledgments to my ninth grade shop teacher.)

Molten glass (silicon dioxide) can never be cooled slowly enough to reach its global energy minimum which is quartz crystal.  Instead, silicon dioxide at atmospheric pressure always cools to a glassy (disordered) local optimum with residual entropy but with statistically reliable properties.  Minute levels of impurities in the glass (skip connections?) can act as catalysts allowing the cooling process to achieve glassy states with lower energy and different properties. (Acknowledgements to discussions with Akira Ikushima of TTI in Nagoya about the nature of glass.)

The remainder of this post is fairly technical. The quenching school tends to focus on gradient flow as defined by the differential equation

$\frac{d\Phi}{dt} = - g(\Phi)\;\;\;(1)$

where $g(\Phi)$ is the gradient of the average loss over the training data. This defines a deterministic continuous path through parameter space which we can try to analyze.

The annealing school views SGD as an MCMC process defined by the stochastic state transition

$\Phi_{i+1} = \Phi_i - \eta \hat{g}_i\;\;\;\;(2)$

where $\hat{g}_i$ is a random variable equal to the loss gradient of a random data point.  Langevin dynamics yields a formal statistical mechanics for SGD as defined by (2).   In this blog post I want to try to explain Langevin dynamics  as intuitively as I can using abbreviated material from My lecture slides on the subject.

First, I want to consider numerical integration of gradient flow (1). A simple numerical integration can be written as

$\Phi_{t + \Delta t} = \Phi_t - \Delta t\; g(\Phi)\;\;\;\;(3)$

Comparing (3) with (2) it is natural to interpret $\eta$ in (2) as $\Delta t$. For the stochastic process defined by (2) I will define $t$ by

$t = i\eta$ or $i = t/\eta$

This provides a notion of “time” for SGD as defined by (2) that is consistent with gradient flow in the limit $\eta \rightarrow 0$.

We now assume that $\eta$ is sufficiently small so that for large $\Delta i$ we still have small $\Delta t$ and that $\Phi_i$ is essentially constant over the range $i$ to $i + \Delta i$.  Assuming the total gradient is essentially constant at $g_t$ over the interval from $t$ to $t+ \Delta t$ we can rewrite (2) in terms of time as

$\Phi_{t + \Delta t} = \Phi_t - g_t\Delta t - \sum_{i = i_0}^{i_0 + \Delta t/\eta} \eta (\hat{g}_i - g_t)\;\;\;\;\;(4)$

By the multivariate law of large numbers a random vector that is a large sum of IID random vectors is approximately Gaussian.  This allows us to rewrite (4) as

$\begin{array}{rcl}\Phi_{t + \Delta t} & = & \Phi_t - g_t\Delta t + \sqrt{\Delta t}\;\epsilon \\ \\ \epsilon & \sim & {\cal N}(0,\eta\Sigma)\end{array}\;\;\;\;(5)$

where $\Sigma$ is the covariance matrix of the random variable $\hat{g}_i - g$.  A derivation of (5) from (4) is given in the slides.  Note that the noise term vanishes in the limit $\eta \rightarrow 0$ and we are back to gradient flow.  However in Langevin dynamics we do not take $\eta$ to zero but instead we hold $\eta$ at a fixed nonzero value small enough that (5) is accurate even for $\Delta t$ small.  Langevin dynamics is formally a continuous time stochastic process under which (5) also holds but where the equation is taken to hold at arbitrarily small values of $\Delta t$.  The Langevin dynamics can be denoted by the notation

$\begin{array}{rcl}d\Phi & = & - g\;dt + \epsilon\sqrt{dt} \\ \\ \epsilon & \sim & {\cal N}(0,\eta\Sigma)\end{array}\;\;\;\;(6)$

If $g = 0$ independent of $\Phi$, and $\Sigma$ is constant,  we get Brownian motion.

The importance of the Langevin dynamics is that it allows us to solve for, and think in terms of, a stationary density.  Surprisingly, if the covariance matrix $\Sigma$ is not isotropic (if the eigenvalues are not all the same) then the stationary density is not Gibbs. Larger noise will yield a broader (hotter) distribution. When different directions in parameter space have different levels of noise we get different “temperatures” in the different directions.  A simple example is given in the slides. However we can make the noise isotropic by replacing (2) with the update

$\Phi_{i+1} = \Phi_i - \eta \Sigma^{-1} \hat{g}_i\;\;\;\;\;\;(7)$

For this update (and assuming that $\Sigma$ remains constant over parameter space) we get

$p(\Phi) \propto \exp\left( \frac{- \mathrm{Loss}(\Phi)}{\alpha \eta}\right)\;\;\;\;\;(8)$

for a universal constant $\alpha$.

I would argue that the Gibbs stationary distribution (8) is desirable because it helps in escaping from a local minima.  More specifically, we would like to avoid artificially biasing the search for an escape toward high noise directions in parameter space at the expanse of exploring low-noise directions. The search for an escape direction should be determined by the loss surface rather than the noise covariance.

Note that (8) indicates that as we reduce the temperature $\eta$ toward zero the loss will approach the loss of the (local) minimum.   Let $\mathrm{Loss}(\eta)$ denote the average loss under the stationary distribution of SGD around a local minimum at temperature $\eta$. The slides contain a proof of the following simple relationship.

$\left.\frac{\partial \mathrm{Loss}(\eta)}{\partial \eta}\right|_{\eta = 0}\; = \;\frac{1}{4} E_i \;||\hat{g}_i||^2\;\;\;\;(9)$

where the random values $\hat{g}_i$ are drawn at the local optimum (at the local optimum the average gradient is zero).  This equation is proved without the use of Langevin dynamics and holds independent of the shape of the stationary distribution.  Once we are in the linear region, halving the learning rate corresponds to moving half way to the locally minimal loss.  This seems at least qualitatively consistent with empirical learning curves such as the following from the original ResNet paper by He et al.

Getting back to the soul of SGD, different deep learning models, like different physical materials, will have different properties.  Intuitively, annealing would seem to be the more powerful search method.  One might expect that as models become more sophisticated — as they become capable of expressing arbitrary knowledge about the world — annealing will be essential.

Posted in Uncategorized | 3 Comments

Thoughts from TTIC31230: Hyperparameter Conjugacy

I have just finished teaching TTIC31230: fundamental of deep learning.  This is the first of a series of posts on thoughts arising from teaching this class.

While this post is about hyperparameter search, I want to mention in passing some issues that do not seem to rise to level of a full post.

Teaching backprop. When we teach backprop we should stop talking about “computational graphs” and talk instead about  programs defined by a sequence of in-line assignments.  I present backprop as an algorithm that runs on in-line code and I give a loop invariant for the backward loop over the assignment statements.

Teaching Frameworks. My class provides a 150 line implementation of a framework (the educational framework EDF) written in Python/NumPy.  The idea is to teach what a framework is at a conceptual level rather than teaching the details of any actual industrial strength framework.  There are no machine problems in my class — the class is an “algorithms course” in contrast to a “programming course”.

I also have a complaint about the way that PyTorch handles parameters. In EDF modules take “parameter packages” (python objects) as arguments.  This simplifies parameter sharing and maintains object structure over the parameters rather than reducing them to lists of tensors.

Einstein Notation. When we teach deep learning we should be using Einstein notation.  This means that we write out all the indices.  This goes very well with “loop notation”.  For example we can apply a convolution filter $W[\Delta x,\Delta y,i,j]$ to a layer $L_{\mathrm{in}}[x,y,j]$ using the following program.

for $x$, $y$, $i$:    $L'_{\mathrm{out}}[x,y,i]=0$

for $x$, $y$, $\Delta x$, $\Delta y$, $i$, $j$:   $L'_{\mathrm{out}}[x,y,i] \;+\!\!= \; W[\Delta x,\Delta y,i,j]L_{\mathrm{in}}[x+\Delta x,y+\Delta y,j]$

for $x$, $y$, $i$:   $L_{\mathrm{out}}[x,y,i] \;= \sigma(L'[x,y,i])$

Of course we can also draw pictures. The initialization to zero can be left implicit — all data tensors other than the input are implicitly initialized to zero.  The body of the “tensor contraction loop” is just a product of two scalers. The back propagation on a product of two scalars is trivial.  To back-propagate to the filter we have.

for $x$, $y$, $\Delta x$, $\Delta y$, $i$, $j$:   $W.\mathrm{grad}[\Delta x,\Delta y,i,j,] \;+\!\!=\;L'_{\mathrm{out}}.\mathrm{grad}[x,y,i]L_{\mathrm{in}}[x+\Delta x,y+\Delta y,j]$

Since essentially all deep learning models consist of tensor contractions and scalar nonlinearities, we do not have to discuss Jacobian matrices.  Also, in Einstein notation we have mnemonic variable names such as $x$, $y$ and $i$ for tensor indices which, for me, greatly clarifies the notation.  Yet another point is that we can easily insert a batch index into the data tensors when explaining minibatching.

Of course NumPy is most effective when using the index-free tensor operations.  The students see that style in the EDF code.  However, when explaining models conceptually I greatly prefer “Einstein loop notation”.

Hyperparameter Conjugacy.  We now come to the real topic of this post. A lot has been written about hyperparameter search.  I believe that hyperparameter search can be greatly facilitated by simple  reparameterizations that greatly improve hyperparameter conjugacy.  Conjugacy means that changing the value of a parameter $x$ does not change (or only minimally influences) the optimal value of another parameter $y$.  More formally, for a loss ${\cal L}$ we would like to have

$\frac{\partial^2 {\cal L}}{\partial x \partial y} = 0.$

Perhaps the simplest example is the relationship between batch size and learning rate.  It seems well established now that the learning rate should be scaled up linearly in batch size as one moves to larger batches.  See Don’t Decay the Learning Rate, Increase the Batch Size by Smith et al. But note that if we simply redefine the learning rate parameter to be the learning rate appropriate for a batch size of 1, and simply change the minibatching convention to update the model by the sum of gradients rather than the average gradient, then the learning rate and the batch size become conjugate — we can optimize the learning rate on a small machine and leave it the same when we move to a bigger machine allowing a bigger batch. We can also think of this as giving the learning rate a semantics independent of the batch size.  A very simple argument for this particular conjugacy is given in my SGD slides.

The most serious loss of conjugacy is the standard parameterization of momentum. The standard parameterization strongly couples the momentum parameter with the learning rate. For most fameworks we have

$\begin{array}{rcl} v & = & \left(1-\frac{1}{N}\right) v + \eta \hat{g} \\ \Phi & -\!\!= & v\end{array}$

where $(1- 1/N)$ is the momentum parameter, $\eta$ is the learning rate, $\hat{g}$ is the gradient of a single minibatch, and $\Phi$ is the system of model parameters.  This can be rewritten as

$\begin{array}{rcl} v & = & \left(1-\frac{1}{N}\right) v + \frac{1}{N}(N\eta \hat{g}) \\ \Phi & -\!\!= & v\end{array}$

A recurrence of the form $y_{t+1}= (1-1/N)y_t + (1/N)x_t$ yields that $y$ is a running average of $x$.  The running average of $x$ is linear in $x$ so the above formulation of momentum can be rewritten as

$\begin{array}{rcl} \mu & = & \left(1-\frac{1}{N}\right)\mu + \frac{1}{N}\;\hat{g} \\ \Phi & -\!\!= & N \eta \mu\end{array}$

Now we begin to see the problem.  It can be shown that each individual gradient makes a total contribution to $\Phi$ of size $N \eta$.  If the parameter vector $\Phi$ remains relatively constant on the time scale of $N$ updates (where $N$ is typically 10) then all we have done by adding momentum is to change the learning rate from $\eta$ to $N\eta$.  Pytorch starts from a superficially different but actually equivalent definition of the momentum and suffers from the same coupling of the momentum term with the learning rate. But a trivial patch is to use the following more conjugate formulation of momentum.

$\begin{array}{rcl} \mu & = & \left(1-\frac{1}{N}\right)\mu + \frac{1}{N}\;\hat{g} \\ \Phi & -\!\!= & \eta \mu\end{array}$

The slides summarize a fully conjugate (or approximately so) parameterization of the learning rate, batch size and momentum parameters.

There appears to be no simple reparameterization of the adaptive algorithms RMSProp and Adam that is conjugate to batch size.  The problem is that the gradient variances are computed from batch gradients and variance estimates computed from large batches, even if “corrected”, contain less information that variances estimated at batch size 1.  A patch would be to keep track of $\sum \;\hat{g}^2$ for each each individual $\hat{g}$ within a batch.  But this involves a modification to the backpropagation calculation. This is all discussed in more detail in the slides.

The next post will be on Langevin dynamics.

Posted in Uncategorized | 3 Comments

Superintelligence and The Truth

A superintelligence would presumably be capable of judging truth.  But what are the human consequences of superintelligent truth judgements? Should a friendly AI avoid taking sides in religion and politics?  Should a friendly AI avoid hiding the truth?  Are these two objectives compatible?

I have talked in previous posts about the “servant mission” — the AI mission of serving or advocating for a particular human.  In a email to Eric Horovitz I later suggested that such agents be called “advobots”.  An advobot  advocates in a variety of professional and personal ways for a particular individual human — its “client”.  I find the idea of a superintelligent personal advocate quite appealing.  If every AI is an advobot for a particular client, and every person has an advobot, then power is distributed and the AI dangers seem greatly reduced.  But there are subtleties here involving the relationship between advocacy and truth.

I will call an advobot strongly truthful if it judges truth independent of  its advocacy.  Strongly truthful advobots face some dilemmas:

Religion. A strongly truthful advobot will likely disagree with its client on various religious beliefs.  For example evolution or the authorship of the Bible.

Politics. A strongly truthful advobot will likely disagree with its client concerning politically charged statements.  Does immigration lead to increased crime? What will the level of sea rise due to CO2 emissions be over the next 50 years?   Does socialized health care lead to longer life expectancy than privatized health care?

Competence.  A strongly truthful advobot would almost certainly disagree with its client over the client’s level of competence. Should an advobot be required to be strongly truthful when speaking to an employer about the appropriateness of their client for a particular job opening?

These examples demonstrate the strong interaction between truth and advocacy.  Much of human speaking or writing involves advocacy.  Freshman are taught rhetoric — the art of writing persuasively.  Advocating for a cause inevitably involves advocating for a belief.

It just does not seem workable to require advobots to be strongly truthful. But if advobot statements must be biased,  it might be nice to have some other AI agents as a source of unbiased judgements.  We could call these “judgebots”.  A judgebot’s mission is simply to judge truth as accurately as possible independent of any external mission or advocacy.  I do believe that the truth, or degree of truth, or truth under different interpretations, can be judged objectively.  This is certainly true of the statements of mathematics. Presumably this is true of most scientific hypotheses.  I think that it is also true of many of the statements argued over in politics and religion.  Of course judgebots need not have any legal authority — defendants could still be tried by a jury of human peers or have cases decided by a human judge.  But the judgements of superintelligent judgebots would still presumably influence people.

In addition to judging truth, judgebots could directly judge decisions relative to missions. Consider a corporation with a mission statement and a choice — say opening a plant in either city A or city B or hiring either A or B as the new CEO.  We could ask a judgebot which of A or B is most faithful to the mission — which choice is best for the corporation as judged by its stated mission.  This kind of judgement is admittedly difficult.  But choices have to be made in any case.  Who is better able to judge choices than a superintelligent judgebot?  A human corporate CEO or board of directors could retain legal control of the corporation. The board could also control and change the mission statement, or refuse to publish a mission at all. But the judgements of superintelligent judgebots relative to various mission statements (published or not) would be available to the public and the stockholders.  Judgebots would likely have influence precisely because they themselves have no agenda other than the truth.

It is possible that different judgebots would have access to different data and computational resources.  Advobots would undoubtedly try to influence the judgebots by controlling the data and computation resources.  But it would also be possible to require that the resources underlying every judgebot judgement be public information.  A judgebot with all available data and large computational resources would intuitively seem most reliable — a good judge listens to all sides and thinks hard.

But as Pilot said to Jesus, what is truth?  What, at an operational level, is the mission of a judgebot?  That is a hard one.  But if we are going to build a superintelligence I believe we will need an answer.

Predictive Coding and Mutual Information

Many people believe that unsupervised and predictive learning are currently the most important problems in deep learning.  From an AGI perspective we have to find some way of getting the machines to learn NLP semantics (and knowledge representation and reasoning) from unlabeled data. We do not know how to label semantics.

This post is based on two papers, my own note from February, Information-Theoretic Co-Training, and a paper from July,  Representation Learning with Contrastive Predictive Coding by Aaron van den Oord, Yazhe Li and Oriol Vinyals.  These two papers both focus on mutual information for predictive coding.   van den Oord et al. give empirical results which include what appears to be the best ImageNet pre-training results to date by a large margin.  Speech and NLP results are also reported. My paper gives a different model and is less developed.  Each approach has pros and cons.

Why Information Theory?

For me, the fundamental equation of deep learning is

$\Phi^* = \mathrm{argmin}_\Phi \;E_{(x,y)\sim{\cal P}}\; -\log P_\Phi(y|x)$

This is the classical log-loss where ${\cal P}$ is a population distribution and $P_\Phi(y|x)$ is the conditional probability of a model with parameters $\Phi$.  Here we think of $y$ as a label for input $x$.  For example, $x$ can be an image and $y$ some kind of image label. These days log loss goes by the name of cross-entropy loss.

$E_{(x,y)\sim {\cal P}} \;- \log P_\Phi(y|x) = E_{x \sim {\cal P}} \;H({\cal P}(y|x),\;P_\Phi(y|x))$

$H(P,Q) \equiv E_{y \sim P} -\log Q(y)$

$= H(P) + KL(P,Q)$

Here $H(P,Q)$ is the cross-entropy from $P$ to $Q$. If information theoretic training objectives are empirically effective in supervised learning, they should also be effective in unsupervised learning.

Why Not Direct Cross-Entropy Density Estimation?

An obvious approach to unsupervised learning is density estimation by direct cross-entropy minimization.

$\Phi^* = \mathrm{argmin}_\Phi \;E_{x \sim {\cal P}}\; -\log P_{\Phi}(x)$

$= \mathrm{argmin}_\Phi\;H({\cal P},P_\Phi)$

This is the perplexity objective of language models.  Language modeling remains extremely relevant to NLP applications.  The hidden states of LSTM language models yield “contextual word vectors” (see ELMO).  Contextual word vectors seem to be  replacing more conventional word vectors in a variety of NLP applications including leaders on the SQUAD question answering leader board.  Contextual word embeddings based on the hidden states of a transformer network language model seem to be an improvement on ELMO and lend further support to the transformer network architecture. (Deep learning is advancing at an incredible rate.)

The success of language modeling for pre-training would seem to support direct density estimation by cross-entropy minimization.  So what is the problem?

The problem is noise.  An image contains a large quantity of information that we do not care about.  We do not care about each blade of grass or each raindrop or the least significant bit of each color value.  Under the notion of signal and noise discussed below, sentences have higher signal-to-noise ratios than do images.  But sentences still have some noise. A sentence can be worded in different ways and we typically do not remember exact phrasing or word choice. From a data-compression viewpoint we want to store only the important information.  Density estimation by direct cross-entropy minimization models noise.  It seems intuitively clear that the modeling of the signal can get lost in the modeling of the noise, especially in low signal-to-noise settings like images and audio signals.

Mutual Information: Separating Signal from Noise

The basic idea is that we define “signal” to be a function of sensory input that carries mutual information with future sensation. This is apparently a very old idea.  van den Oord et al. cite a paper on “predictive coding” by Peter Elias from 1955.  This was a time shortly after Shannon’s seminal paper when information theory research was extremely active.  van den Oord et al. give the following figure to visualize their formulation of predictive coding.

Here $x_t$ is some kind of perceptual input like a frame of video, or a frame of speech, or a sentence in a document, or an article in a thread of related articles.  Here $c_t$ is intended to be  a high level semantic representation of the past perception and $z_{t+k}$ is intended to be high level semantic representation of a future perception.   My paper and van den Oord et al. both define training objectives centered on maximizing the mutual information

$I(c_t,z_{t+k}|k) = H(z_{t+k}) - H(z_{t+k}|c_t,k)\;\;\;\;\;\;\mathrm{(1)}$

Intuitively, we want to establish predictive power — demonstrate that observations at the present time provide information (predictive power) about the future and hence reduces the uncertainty (entropy) of the future observations.  The difference between the entropy before the observation and the entropy after the observation is one way of defining mutual information.  If we think of $I(x,y)$ as the amount of “signal” in $x$ about $y$ then we might define a purely information theoretic signal-to-noise ratio as $I(x,y)/H(x)$.  But I am on shaky ground here — there should be a reference for this.

There is an immediate problem with the mutual information training objective (1). The objective $I(c_t,z_{t+k})$ can be maximized simply by taking $c_t$ to be the input sequence $x_0,\;\ldots,\;x_t$ and taking $z_{t+k}$ to be the input $x_{t+k}$.  So we need additional constraints that force $c_t$ and $z_t$ to be higher level semantic representations. The two papers take different approaches to this problem.

My paper suggests bounding the entropy of  $z_t$ by restricting it to a finite set, such as a finite length sequence of tokens from a finite vocabulary.  ven den Oord et al. assume that both $c_t$ and $z_{t+k}$ are vectors and require that a certain discriminative task be solved by a linear classifier.  Requiring that $z_t$ be used in linear classifiers should force it to expose semantic properties.

A Naive Approach

My paper takes a naive approach.  I work directly with equation (1).  The model includes a network defining a coding probability $P_\Psi(z_t|x_t)$ where $z_t$ is restricted to a (possibly combinatorially large) finite set.  Equation (1) can then be written as follows where the probability distribution on the sequence $z_1,\ldots,z_{t+k}$ is determined by the coding distribution parameter $\Psi$.

$I((z_1,\ldots,z_t),\;z_{t+k}) = H_\Psi(z) - H_{\Psi}(z_{t+k}\;|\;z_1,\ldots,z_t)$

We now introduce two more models used to estimate the above entropies.  A model $P_\Theta(z)$ for estimating the first entropy  and and a model $P_\Phi(z_{t+k}\;|\;(z_1,\ldots,z_t))$ for the second.  We then define a nested optimization problem

$\Psi^* = \mathrm{argmax}_\Psi\;\hat{H}_{\Psi,\Theta^*(\Psi)} - \hat{H}_{\Psi,\Phi^*(\Psi)}$

$\Theta^*(\Psi) = \mathrm{argmin}_\Theta\; \hat{H}_{\Psi,\Theta}$

$\Phi^*(\Psi) = \mathrm{argmin}_\Phi\; \hat{H}_{\Psi,\Phi}$

$\hat{H}_{\Psi,\Theta} = E_{z\sim P_\Psi(z|x)}\;-\log P_\Theta(z)$

$\hat{H}_{\Psi,\Phi} = E_{(z_1,\ldots,z_{t+k}) \sim P_\Psi} \;-\log P_\Phi(z_{t+k}\;|\;z_1,\ldots z_t)$

Holding $\Psi$ fixed, the optimization objectives for $\Theta^*(\Psi)$ and $\Phi^*(\Psi)$ are just classical cross-entropy loss functions (log loss).  If $\Theta$ and $\Phi$ have been fully optimized for $\Psi$ fixed, then the gradients with respect to $\Theta$ and $\Phi$ must be zero and $\Psi$ can be updated ignoring the effect of that update on $\Theta$ and $\Phi$.  If the models are universally expressive, and the nested optimization can be done exactly, then we maximize the desired mutual information.

This naive approach has some issues.  First, the optimization problem is adversarial.  Note that the optimizations of $\Psi$ and $\Theta$ are pulling $\hat{H}_{\Psi,\Theta}$ in opposite directions. This raises all the issues in adversarial optimization.  Second, $P_\Psi(z|x)$ defines a discrete distribution raising the specter of REINFORCE (although see VQ-VAE for a more sane treatment of discrete latent variables).   Third, the model entropy estimates do not provide any guarantee on the true mutual information.  This is because model estimates of entropy are larger than the true entropy and the model estimate of the first entropy can be large due to modeling error.

The Discriminator Trick

The van den Oord et al. paper is based on a “discriminator trick” — a novel form of contrastive estimation (see also this) where a discriminator must distinguish in-context items from out-of-context items.  The formal relationship between contrastive estimation and mutual information appears to be novel in this paper. Suppose that we want to measure the mutual information $I(x,y)$ between two arbitrary random variables $x$ and $y$.  We draw $N$ pairs $(x_1,y_1),\ldots,\;(x_N,y_N)$ from the joint distribution.  We then randomly shuffle the $y$‘s and present a “discriminator” with $x_1$ and the shuffled $y$ values.  The discriminator’s job is to find $y_1$ (the value paired with $x_1$) among the shuffled values of $y$.  ven den Oord et al. show the following lower bound on $I(x,y)$ where $Y$ is the shuffled set of $y$‘s; the index $i$ is the position of $y_1$ in $Y$; and $\Phi$ is the model parameters of the discriminator.

$I(x,y) \geq \log N -L\;\;\;\;\;\mathrm{(2)}$

$L = E_{x_1,Y\sim{\cal P}}\;-\log P_\Phi(i\;|\;x_1,Y)$

I refer readers to the paper for the derivation. The discriminator is simply trained to minimize the above log loss (cross-entropy loss).  In the paper the discriminator is forced to be linear in $z$ which should force $z$ to be high level.

$P_\Phi(i\;|\;c_t,(z_1,\ldots,z_N)) = \mathrm{softmax}_i \;\; (c_t)^\top M_k z_i$

Here the networks producing $c_t$ and $z_t$ are part of the discriminator so training the discriminator trains the network for $z_t$.  The vector $z_t$ is then used as a pre-trained feature vector for the input $x_t$.

The main issue I see with the discriminator trick is that the discrimination task might be too easy when the mutual information is large.  If the discrimination task is easy the loss will be close to zero.  But by equation (2), the best possible guarantee on mutual information is $\log N$. This requires $N$ to be exponential in the size of the guarantee.

Summary

Predictive coding by maximizing mutual information feels like a very principled and promising approach to unsupervised and predictive learning.  A naive approach to maximizing mutual information involves adversarial objectives, discrete latent variables, and a lack of any formal guarantee. The discriminator trick avoids these issues but seems problematic in the case where the true mutual information is larger than a few bits.

Deep learning is progressing at an unprecedented rate.  I expect the next few years to yield considerable progress, one way or another, in unsupervised and predictive learning.

Posted in Uncategorized | 1 Comment

RL and the Game of Mathematics

I recently gave a lecture on Alphago in a deep learning class.  I had of course heard of the recent accomplishments — learning in a matter of days, or even hours, and entirely from self-play, to become superhuman in go, chess and shogi.  I had not, however, appreciated the magnitude of this advance until I read the papers (Alphgo Zero and AlphaZero). Alphago Zero is about 1400 ELO points above Alphago Lee — the program that defeated Lee Sedol. The ELO scale is the one used to rate chess players — 200 ELO points corresponds to 75% chance of victory by the stronger player.  Measuring a huge gap like 1400 points requires using a sequence of versions of the program of increasing strength.  Also, Alphago Zero runs on a single machine with four GPUs.  Alphago Lee runs on a supercomputer.  Also, the algorithm underlying Alphago Lee is new and simpler with no rollouts or hand designed features.  Also, the same algorithm does the same trick in chess and shogi — quickly learning to be far superior to any previous program for these games.  Also, the introduction of a fundamentally new algorithm seems likely to open up yet more progress in RL. The new algorithm can be viewed as tree-search bootstrapping  applied to policy gradient.

Ten years ago Jonathan Schaeffer at the University of Alberta proved that checkers is a draw.  His group computed and stored full drawing strategies for each player.  This was listed by Science magazine as one of the ten greatest breakthroughs of 2007. The game of chess is also believed to be a draw, although computing full drawing strategies is completely infeasible.   Empirically, as the level of play increases draws become more common.  Computers have been super-human in chess for the last 20 years.  The prevalence of draws at super-human levels causes the ELO scale to break down. (In practice go games are never drawn.) The strongest computer program, stockfish, was thought to be unbeatable — equivalent to perfect play.  AlphaZero defeated Stockfish 25/50 games playing from white (and lost none) and defeated Stockfish 3/50 games from black (and lost none).

DeepMind’s success in RL is based on perfect simulation.  Perfect simulation is possible in Atari games and in board games such as go and chess.  Perfect simulation allows large-scale on-policy learning.  Large-scale on-policy learning is not feasible in the “game” of human dialogue (the Turing test).  We don’t know how to accurately simulate human dialogue.  Mathematics, on the other hand, can be formulated as a game with rules defined by a formal foundational logic.  This immediately raises the question of whether the recent dramatic results in RL for games might soon lead to super-human open-domain mathematicians.

The foundations of mathematics has been largely ignored by AI researchers.  I have always felt that the foundations — the rules of “correct thought” — reflect an innate cognitive architecture.  But independent of such beliefs, it does seem possible to formulate inference rules defining the general notion of proof accepted by the mathematics community.  For this we have ZFC or type theory.  More on this later.  But in addition to rules defining moves, we need an objective— a reward function.  What is the objective of mathematics?

I believe that a meaningful quantitative objective for mathematics must involve an appropriate formal notion of a mathematical ontology — a set of concepts and statement made in terms of them. It is easy to list off  basic mathematical concepts — groups, rings, fields, vector spaces, Banach spaces, Hilbert spaces, manifolds, Lie algebras, and many more.  For some concepts, such as the finite simple groups or the compact two manifolds, complete classification is possible — it is possible to give a systematic enumeration of the instances of the concept up to isomorphism. Formulating an objective in terms of a concept ontology requires, of course, being clear (formal) about what we mean by a “concept”.  This brings us to the importance of type theory.

Since the 1920’s (or certainly the 1950s) the mathematics community has generally accepted Zermello-Fraenkel set theory with the axiom of choice (ZFC) as the formal foundation.  However, the ontology of mathematics is clearly organized around the apparent fact that the instances of a concept are only classified up to the notion of isomorphism associated with that concept. The classification of finite simple groups or compact two manifolds is done up to isomorphism. The fact that each concept comes with an associated notion of isomorphism has never been formalized in the framework of ZFC.  To formalize concepts and their associated notion of isomorphism we need type theory.

I have been working for years on a set-theoretic type theory — a type theory fully compatible with the ZFC foundations.  Alternatively there is constructive type theory and its extension to homotopy type theory (HoTT).  I will not discuss the pros and cons of different type theories here except to say that I expect mathematicians will be more accepting of a type theory fully compatible with the grammar of informal mathematical language as well as being compatible with the currently accepted content of mathematics.

Type theory gives a set of formal rules defining the “well-formed” concepts and statements.  I believe that a meaningful objective for mathematics can then be defined by an a-prior probability distribution over type-theoretically well-formed mathematical questions.  The well-formedness constraint ensures that each question is meaningful. We do not want to ask if 2 is a member of 3. The objective is to be able to answer the largest possible fraction of these well-formed questions.

Of course the questions under investigation in human mathematics evolve over time and this evolution is clearly governed by a social process.  But I claim that this is consistent with the existence of an objective notion of mathematical significance.  A socially determined distribution over questions can be importance-corrected (in the statistical sense) in judging objective significance. The use of a sampling distribution different from the prior, which is then importance corrected, seems likely to be useful in any case.

Of course I am not the only one to be considering deep learning for mathematical inference.  There have now been three meetings of the Conference on AI and Theorem Proving.  Many of the papers in this conference propose neural architectures for “premise selection” — the problem of selecting facts from a mathematical library that are likely to be useful for a given problem.  Christian Szegedy, a co-author of batch-normalization and a principal developer of Google’s inception network for image classification, has been attending this conference and working on deep learning for theorem proving [1,2,3].  I myself have attended the last two meetings and have found them useful in framing my own thinking on deep architectures for inference.

In summary, DeepMind’s dramatic recent success in RL raises the question of whether RL can produce a super-human player of the game of mathematics.  I hereby throw this out as a challenge to DeepMind and to the AI community generally.  A super-human open-domain mathematician would seem a major step toward AGI.

Posted in Uncategorized | 4 Comments