Many people believe that unsupervised and predictive learning are currently the most important problems in deep learning. From an AGI perspective we have to find some way of getting the machines to learn NLP semantics (and knowledge representation and reasoning) from unlabeled data. We do not know how to label semantics.
This post is based on two papers, my own note from February, Information-Theoretic Co-Training, and a paper from July, Representation Learning with Contrastive Predictive Coding by Aaron van den Oord, Yazhe Li and Oriol Vinyals. These two papers both focus on mutual information for predictive coding. van den Oord et al. give empirical results which include what appears to be the best ImageNet pre-training results to date by a large margin. Speech and NLP results are also reported. My paper gives a different model and is less developed. Each approach has pros and cons.
Why Information Theory?
For me, the fundamental equation of deep learning is
This is the classical log-loss where is a population distribution and is the conditional probability of a model with parameters . Here we think of as a label for input . For example, can be an image and some kind of image label. These days log loss goes by the name of cross-entropy loss.
Here is the cross-entropy from to . If information theoretic training objectives are empirically effective in supervised learning, they should also be effective in unsupervised learning.
Why Not Direct Cross-Entropy Density Estimation?
An obvious approach to unsupervised learning is density estimation by direct cross-entropy minimization.
This is the perplexity objective of language models. Language modeling remains extremely relevant to NLP applications. The hidden states of LSTM language models yield “contextual word vectors” (see ELMO). Contextual word vectors seem to be replacing more conventional word vectors in a variety of NLP applications including leaders on the SQUAD question answering leader board. Contextual word embeddings based on the hidden states of a transformer network language model seem to be an improvement on ELMO and lend further support to the transformer network architecture. (Deep learning is advancing at an incredible rate.)
The success of language modeling for pre-training would seem to support direct density estimation by cross-entropy minimization. So what is the problem?
The problem is noise. An image contains a large quantity of information that we do not care about. We do not care about each blade of grass or each raindrop or the least significant bit of each color value. Under the notion of signal and noise discussed below, sentences have higher signal-to-noise ratios than do images. But sentences still have some noise. A sentence can be worded in different ways and we typically do not remember exact phrasing or word choice. From a data-compression viewpoint we want to store only the important information. Density estimation by direct cross-entropy minimization models noise. It seems intuitively clear that the modeling of the signal can get lost in the modeling of the noise, especially in low signal-to-noise settings like images and audio signals.
Mutual Information: Separating Signal from Noise
The basic idea is that we define “signal” to be a function of sensory input that carries mutual information with future sensation. This is apparently a very old idea. van den Oord et al. cite a paper on “predictive coding” by Peter Elias from 1955. This was a time shortly after Shanon’s seminal paper when information theory research was extremely active. van den Oord et al. give the following figure to visualize their formulation of predictive coding.
Here is some kind of perceptual input like a frame of video, or a frame of speech, or a sentence in a document, or an article in a thread of related articles. Here is intended to be a high level semantic representation of the past perception and is intended to be high level semantic representation of a future perception. My paper and van den Oord et al. both define training objectives centered on maximizing the mutual information
Intuitively, we want to establish predictive power — demonstrate that observations at the present time provide information (predictive power) about the future and hence reduces the uncertainty (entropy) of the future observations. The difference between the entropy before the observation and the entropy after the observation is one way of defining mutual information. If we think of as the amount of “signal” in about then we might define a purely information theoretic signal-to-noise ratio as . But I am on shaky ground here — there should be a reference for this.
There is an immediate problem with the mutual information training objective (1). The objective can be maximized simply by taking to be the input sequence and taking to be the input . So we need additional constraints that force and to be higher level semantic representations. The two papers take different approaches to this problem.
My paper suggests bounding the entropy of by restricting it to a finite set, such as a finite length sequence of tokens from a finite vocabulary. ven den Oord et al. assume that both and are vectors and require that a certain discriminative task be solved by a linear classifier. Requiring that be used in linear classifiers should force it to expose semantic properties.
A Naive Approach
My paper takes a naive approach. I work directly with equation (1). The model includes a network defining a coding probability where is restricted to a (possibly combinatorially large) finite set. Equation (1) can then be written as follows where the probability distribution on the sequence is determined by the coding distribution parameter .
We now introduce two more models used to estimate the above entropies. A model for estimating the first entropy and and a model for the second. We then define a nested optimization problem
Holding fixed, the optimization objectives for and are just classical cross-entropy loss functions (log loss). If and have been fully optimized for fixed, then the gradients with respect to and must be zero and can be updated ignoring the effect of that update on and . If the models are universally expressive, and the nested optimization can be done exactly, then we maximize the desired mutual information.
This naive approach has some issues. First, the optimization problem is adversarial. Note that the optimizations of and are pulling in opposite directions. This raises all the issues in adversarial optimization. Second, defines a discrete distribution raising the specter of REINFORCE (although see VQ-VAE for a more sane treatment of discrete latent variables). Third, the model entropy estimates do not provide any guarantee on the true mutual information. This is because model estimates of entropy are larger than the true entropy and the model estimate of the first entropy can be large due to modeling error.
The Discriminator Trick
The van den Oord et al. paper is based on a “discriminator trick” — a novel form of contrastive estimation (see also this) where a discriminator must distinguish in-context items from out-of-context items. The formal relationship between contrastive estimation and mutual information appears to be novel in this paper. Suppose that we want to measure the mutual information between two arbitrary random variables and . We draw pairs from the joint distribution. We then randomly shuffle the ‘s and present a “discriminator” with and the shuffled values. The discriminator’s job is to find (the value paired with ) among the shuffled values of . ven den Oord et al. show the following lower bound on where is the shuffled set of ‘s; the index is the position of in ; and is the model parameters of the discriminator.
I refer readers to the paper for the derivation. The discriminator is simply trained to minimize the above log loss (cross-entropy loss). In the paper the discriminator is forced to be linear in which should force to be high level.
Here the networks producing and are part of the discriminator so training the discriminator trains the network for . The vector is then used as a pre-trained feature vector for the input .
The main issue I see with the discriminator trick is that the discrimination task might be too easy when the mutual information is large. If the discrimination task is easy the loss will be close to zero. But by equation (2), the best possible guarantee on mutual information is . This requires to be exponential in the size of the guarantee.
Predictive coding by maximizing mutual information feels like a very principled and promising approach to unsupervised and predictive learning. A naive approach to maximizing mutual information involves adversarial objectives, discrete latent variables, and a lack of any formal guarantee. The discriminator trick avoids these issues but seems problematic in the case where the true mutual information is larger than a few bits.
Deep learning is progressing at an unprecedented rate. I expect the next few years to yield considerable progress, one way or another, in unsupervised and predictive learning.