## Predictive Coding and Mutual Information

Many people believe that unsupervised and predictive learning are currently the most important problems in deep learning.  From an AGI perspective we have to find some way of getting the machines to learn NLP semantics (and knowledge representation and reasoning) from unlabeled data. We do not know how to label semantics.

This post is based on two papers, my own note from February, Information-Theoretic Co-Training, and a paper from July,  Representation Learning with Contrastive Predictive Coding by Aaron van den Oord, Yazhe Li and Oriol Vinyals.  These two papers both focus on mutual information for predictive coding.   van den Oord et al. give empirical results which include what appears to be the best ImageNet pre-training results to date by a large margin.  Speech and NLP results are also reported. My paper gives a different model and is less developed.  Each approach has pros and cons.

Why Information Theory?

For me, the fundamental equation of deep learning is $\Phi^* = \mathrm{argmin}_\Phi \;E_{(x,y)\sim{\cal P}}\; -\log P_\Phi(y|x)$

This is the classical log-loss where ${\cal P}$ is a population distribution and $P_\Phi(y|x)$ is the conditional probability of a model with parameters $\Phi$.  Here we think of $y$ as a label for input $x$.  For example, $x$ can be an image and $y$ some kind of image label. These days log loss goes by the name of cross-entropy loss. $E_{(x,y)\sim {\cal P}} \;- \log P_\Phi(y|x) = E_{x \sim {\cal P}} \;H({\cal P}(y|x),\;P_\Phi(y|x))$ $H(P,Q) \equiv E_{y \sim P} -\log Q(y)$ $= H(P) + KL(P,Q)$

Here $H(P,Q)$ is the cross-entropy from $P$ to $Q$. If information theoretic training objectives are empirically effective in supervised learning, they should also be effective in unsupervised learning.

Why Not Direct Cross-Entropy Density Estimation?

An obvious approach to unsupervised learning is density estimation by direct cross-entropy minimization. $\Phi^* = \mathrm{argmin}_\Phi \;E_{x \sim {\cal P}}\; -\log P_{\Phi}(x)$ $= \mathrm{argmin}_\Phi\;H({\cal P},P_\Phi)$

This is the perplexity objective of language models.  Language modeling remains extremely relevant to NLP applications.  The hidden states of LSTM language models yield “contextual word vectors” (see ELMO).  Contextual word vectors seem to be  replacing more conventional word vectors in a variety of NLP applications including leaders on the SQUAD question answering leader board.  Contextual word embeddings based on the hidden states of a transformer network language model seem to be an improvement on ELMO and lend further support to the transformer network architecture. (Deep learning is advancing at an incredible rate.)

The success of language modeling for pre-training would seem to support direct density estimation by cross-entropy minimization.  So what is the problem?

The problem is noise.  An image contains a large quantity of information that we do not care about.  We do not care about each blade of grass or each raindrop or the least significant bit of each color value.  Under the notion of signal and noise discussed below, sentences have higher signal-to-noise ratios than do images.  But sentences still have some noise. A sentence can be worded in different ways and we typically do not remember exact phrasing or word choice. From a data-compression viewpoint we want to store only the important information.  Density estimation by direct cross-entropy minimization models noise.  It seems intuitively clear that the modeling of the signal can get lost in the modeling of the noise, especially in low signal-to-noise settings like images and audio signals.

Mutual Information: Separating Signal from Noise

The basic idea is that we define “signal” to be a function of sensory input that carries mutual information with future sensation. This is apparently a very old idea.  van den Oord et al. cite a paper on “predictive coding” by Peter Elias from 1955.  This was a time shortly after Shannon’s seminal paper when information theory research was extremely active.  van den Oord et al. give the following figure to visualize their formulation of predictive coding. Here $x_t$ is some kind of perceptual input like a frame of video, or a frame of speech, or a sentence in a document, or an article in a thread of related articles.  Here $c_t$ is intended to be  a high level semantic representation of the past perception and $z_{t+k}$ is intended to be high level semantic representation of a future perception.   My paper and van den Oord et al. both define training objectives centered on maximizing the mutual information $I(c_t,z_{t+k}|k) = H(z_{t+k}) - H(z_{t+k}|c_t,k)\;\;\;\;\;\;\mathrm{(1)}$

Intuitively, we want to establish predictive power — demonstrate that observations at the present time provide information (predictive power) about the future and hence reduces the uncertainty (entropy) of the future observations.  The difference between the entropy before the observation and the entropy after the observation is one way of defining mutual information.  If we think of $I(x,y)$ as the amount of “signal” in $x$ about $y$ then we might define a purely information theoretic signal-to-noise ratio as $I(x,y)/H(x)$.  But I am on shaky ground here — there should be a reference for this.

There is an immediate problem with the mutual information training objective (1). The objective $I(c_t,z_{t+k})$ can be maximized simply by taking $c_t$ to be the input sequence $x_0,\;\ldots,\;x_t$ and taking $z_{t+k}$ to be the input $x_{t+k}$.  So we need additional constraints that force $c_t$ and $z_t$ to be higher level semantic representations. The two papers take different approaches to this problem.

My paper suggests bounding the entropy of $z_t$ by restricting it to a finite set, such as a finite length sequence of tokens from a finite vocabulary.  ven den Oord et al. assume that both $c_t$ and $z_{t+k}$ are vectors and require that a certain discriminative task be solved by a linear classifier.  Requiring that $z_t$ be used in linear classifiers should force it to expose semantic properties.

A Naive Approach

My paper takes a naive approach.  I work directly with equation (1).  The model includes a network defining a coding probability $P_\Psi(z_t|x_t)$ where $z_t$ is restricted to a (possibly combinatorially large) finite set.  Equation (1) can then be written as follows where the probability distribution on the sequence $z_1,\ldots,z_{t+k}$ is determined by the coding distribution parameter $\Psi$. $I((z_1,\ldots,z_t),\;z_{t+k}) = H_\Psi(z) - H_{\Psi}(z_{t+k}\;|\;z_1,\ldots,z_t)$

We now introduce two more models used to estimate the above entropies.  A model $P_\Theta(z)$ for estimating the first entropy  and and a model $P_\Phi(z_{t+k}\;|\;(z_1,\ldots,z_t))$ for the second.  We then define a nested optimization problem $\Psi^* = \mathrm{argmax}_\Psi\;\hat{H}_{\Psi,\Theta^*(\Psi)} - \hat{H}_{\Psi,\Phi^*(\Psi)}$ $\Theta^*(\Psi) = \mathrm{argmin}_\Theta\; \hat{H}_{\Psi,\Theta}$ $\Phi^*(\Psi) = \mathrm{argmin}_\Phi\; \hat{H}_{\Psi,\Phi}$ $\hat{H}_{\Psi,\Theta} = E_{z\sim P_\Psi(z|x)}\;-\log P_\Theta(z)$ $\hat{H}_{\Psi,\Phi} = E_{(z_1,\ldots,z_{t+k}) \sim P_\Psi} \;-\log P_\Phi(z_{t+k}\;|\;z_1,\ldots z_t)$

Holding $\Psi$ fixed, the optimization objectives for $\Theta^*(\Psi)$ and $\Phi^*(\Psi)$ are just classical cross-entropy loss functions (log loss).  If $\Theta$ and $\Phi$ have been fully optimized for $\Psi$ fixed, then the gradients with respect to $\Theta$ and $\Phi$ must be zero and $\Psi$ can be updated ignoring the effect of that update on $\Theta$ and $\Phi$.  If the models are universally expressive, and the nested optimization can be done exactly, then we maximize the desired mutual information.

This naive approach has some issues.  First, the optimization problem is adversarial.  Note that the optimizations of $\Psi$ and $\Theta$ are pulling $\hat{H}_{\Psi,\Theta}$ in opposite directions. This raises all the issues in adversarial optimization.  Second, $P_\Psi(z|x)$ defines a discrete distribution raising the specter of REINFORCE (although see VQ-VAE for a more sane treatment of discrete latent variables).   Third, the model entropy estimates do not provide any guarantee on the true mutual information.  This is because model estimates of entropy are larger than the true entropy and the model estimate of the first entropy can be large due to modeling error.

The Discriminator Trick

The van den Oord et al. paper is based on a “discriminator trick” — a novel form of contrastive estimation (see also this) where a discriminator must distinguish in-context items from out-of-context items.  The formal relationship between contrastive estimation and mutual information appears to be novel in this paper. Suppose that we want to measure the mutual information $I(x,y)$ between two arbitrary random variables $x$ and $y$.  We draw $N$ pairs $(x_1,y_1),\ldots,\;(x_N,y_N)$ from the joint distribution.  We then randomly shuffle the $y$‘s and present a “discriminator” with $x_1$ and the shuffled $y$ values.  The discriminator’s job is to find $y_1$ (the value paired with $x_1$) among the shuffled values of $y$.  ven den Oord et al. show the following lower bound on $I(x,y)$ where $Y$ is the shuffled set of $y$‘s; the index $i$ is the position of $y_1$ in $Y$; and $\Phi$ is the model parameters of the discriminator. $I(x,y) \geq \log N -L\;\;\;\;\;\mathrm{(2)}$ $L = E_{x_1,Y\sim{\cal P}}\;-\log P_\Phi(i\;|\;x_1,Y)$

I refer readers to the paper for the derivation. The discriminator is simply trained to minimize the above log loss (cross-entropy loss).  In the paper the discriminator is forced to be linear in $z$ which should force $z$ to be high level. $P_\Phi(i\;|\;c_t,(z_1,\ldots,z_N)) = \mathrm{softmax}_i \;\; (c_t)^\top M_k z_i$

Here the networks producing $c_t$ and $z_t$ are part of the discriminator so training the discriminator trains the network for $z_t$.  The vector $z_t$ is then used as a pre-trained feature vector for the input $x_t$.

The main issue I see with the discriminator trick is that the discrimination task might be too easy when the mutual information is large.  If the discrimination task is easy the loss will be close to zero.  But by equation (2), the best possible guarantee on mutual information is $\log N$. This requires $N$ to be exponential in the size of the guarantee.

Summary

Predictive coding by maximizing mutual information feels like a very principled and promising approach to unsupervised and predictive learning.  A naive approach to maximizing mutual information involves adversarial objectives, discrete latent variables, and a lack of any formal guarantee. The discriminator trick avoids these issues but seems problematic in the case where the true mutual information is larger than a few bits.

Deep learning is progressing at an unprecedented rate.  I expect the next few years to yield considerable progress, one way or another, in unsupervised and predictive learning.

This entry was posted in Uncategorized. Bookmark the permalink.