As in previous years, teaching my course on the fundamentals of deep learning has inspired some blog posts.
This year I realized that VAEs are non-parametrically consistent as models of the observed data even when the encoder is held fixed and arbitrary. This is best demonstrated with a nonstandard derivation of VAEs bypassing the ELBO.
Let range over observable data and let range over latent values. Let the encoder be defined by a probability distribution . We then have a joint distribution where is drawn from the population () and is drawn from the encoder. We let the entropies and the mutual information , , , and all be defined by this joint distribution. To derive the VAE objective we start with the following basic information theoretic equalities.
Assuming that we can sample from the encoder distribution , and that we can compute for any and , the conditional entropy can be estimated by sampling. However that is not true of or because we have no way of computing or . However, entropies defined in terms of the population can be upper bounded (and estimated) by cross-entropies and we introduce two models and with which to define cross-entropies.
Inserting these two cross entropy upper bounds (or entropy estimators) into (1) gives
The right hand side of (4) is the standard VAE objective function in terms of the prior, the encoder and the decoder. However, this derivation of the upper bound (4) from the exact equality (1) shows that we get a consistent non-parametric estimator of by optimizing the prior and posterior according to (2) and (3) while holding the encoder fixed. This follows directly from the fact that cross entropy is a consistent non-parametric estimator of entropy in the sense that . Furthermore, we expect that estimates and that estimates again independent of
the choice of .
This observation gives us freedom in designing latent variable objective functions that produce useful or interpretable latent variables. We can train the prior and posterior by (2) and (3) and train the encoder by any choice of an encoder objective function. For example, a natural choice might be
The weight in (5) can be interpreted as providing a rate-distortion trade-off where the mutual information (upper bound) expresses the channel capacity (information rate) of as a communication channel for the message . This is exactly the -VAE which weights the rate by rather than the distortion by .
However, there can be different encoders achieving the same rate and distortion. The consistency of (4) independent of allows additional desiderata to be placed on the encoder. For example, we might want to be a sequence with the independent and the mutual information with evenly balanced between the yielding a VAE similar to an InfoGAN.
Here we are designing different objectives for different model components — the objectives defined by (2) and (3) for and are intended to be independent of any designed objective for and the objective for can be designed independent of (2) and (3). Multiple objective functions yield a multiplayer game with Nash equilibria. In practice we will need to insert stop gradients to prevent, for example, the objective for (player) from interfering with the objective for (player) and vice-versa.
The bottom line is that we can select any objective for the encoder while preserving non-parametric consistency of the VAE as a model of the observed data.