This blog post observes that the expectation maximization algorithm (EM) is exactly alternating maximization of the variational autoencoder (VAE) objective function provided that the VAE encoder is sufficiently expressive. This observation is mathematically straightforward but I have not seen it pointed out before (references welcome). It is hoped that being aware of this equivalence will aid the development of new forms of VAEs.

First some background:

The expectation maximization (EM) algorithm is a time-honored cornerstone of machine learning. Its general formulation was given in a classic 1977 paper by Dempster, Laird and Rubin, although many special cases had been developed prior to that. The algorithm is used for estimating unknown parameters of a probability distribution. The algorithm is usually introduced by looking at the special case of modeling a collection of points by a mixture of Gaussians. Given the points one must estimate a component weight, a mean and a covariance matrix for each component of the mixture.

This is a non-convex optimization problem. EM iteratively re-estimates the parameters with a guarantee that for each iteration, unless one is already at a parameter setting with zero probability gradients (a stationary point), the probability of the points under the model improves. There are numerous important special cases such as learning the parameters of a hidden Markov model or a probabilistic context free grammar.

Variational autoencoders (VAEs) (tutorial) were introduced in 2014 by Kingma and Welling in the context of deep learning. Autoencoding is related to compression. JPEG compresses an image into an encoded form that uses less memory but can then be decompressed to get the image back with some loss of resolution.

[Kevin Franz]

Information theoretically, compression is closely relate to distribution modeling. Shannon’s source coding theorem states that for any probability distribution, such as the distribution on “natural” images, one can (block) code the elements using a number of bits per element equal to the information-theoretic entropy of the distribution. In the above figure the average number of bits in the compressed form in the middle should be equal to the entropy of the probability distribution over images.

VAEs, however, do not directly address the compression problem. Instead they focus on fitting the parameters of the distribution so as to maximize the marginal probability of a given set of samples (such as a set of points or a set of images). The encoder is just a tool to make this parameter fitting more efficient. The problem being solved by a VAE is the same as the problem being solved by EM — fitting the parameters of a probability distribution to given data where the model includes latent variables not specified in the data.

The objective function of a VAE can be written as follows where we let denote the collection of all the empirical data (the collection of all points or images).

We now consider optimizing this objective by alternating maximization. Given settings and we compute and using

We will show that setting gives the E step of EM while setting gives the M step. We first consider the E step of computing . For this we use the following identity.

This allows us to rewrite (1) as

Since we are maximizing we want to minimize the KL divergence which is achieved when . We consider the case where this can be achieved exactly. This is just the requirement that is sufficiently expressive to represent the sufficient statistics of the E step of EM. To compute the M step for setting we can use (1) giving

which is in fact the M step of EM.

EM and VAEs are typically used in very different settings. For EM we typically have that has nontrivial semantics. For example, it is the unknown parse of a given sentence or the hidden state sequence of a hidden Markov model. For VAE we typically have that is just a source of noise — a semantics-free random noise vector that can be “decompressed” into a data point by . The relationship between VAEs and EM might provide intuitions for making more semantic in the case of VAEs. Also, making continuous undermines the data compression aspect of distribution modeling — a continuous variable contains an infinite amount of information. In a typical EM application is discrete. Making discrete for VAEs would yield a measure of compression and hence a measure of the degree to which the model is capturing regularities in the data.

In any case, the mathematical relationship between VAEs and EM seems worth noting.