Encoder Autonomy

As in previous years, teaching my course on the fundamentals of deep learning has inspired some blog posts.

This year I realized that VAEs are non-parametrically consistent as models of the observed data even when the encoder is held fixed and arbitrary. This is best demonstrated with a nonstandard derivation of VAEs bypassing the ELBO.

Let y range over observable data and let z range over latent values. Let the encoder be defined by a probability distribution P_\Theta(z|y). We then have a joint distribution P_{\mathrm{Pop},\Theta}(y,z) where y is drawn from the population (\mathrm{Pop}) and z is drawn from the encoder. We let the entropies and the mutual information H_{\mathrm{Pop}}(y), H_{\mathrm{Pop},\Theta}(z), H_{\mathrm{Pop},\Theta}(y|z), H_{\mathrm{Pop},\Theta}(z|y) and I_{\mathrm{Pop},\Theta}(y,z) all be defined by this joint distribution. To derive the VAE objective we start with the following basic information theoretic equalities.

\begin{array}{rcl} I_{\mathrm{Pop},\Theta}(y,z) & = & H_{\mathrm{Pop}}(y) - H_{\mathrm{Pop},\Theta}(y|z) \\ \\ H_{\mathrm{Pop}}(y) & = & I_{\mathrm{Pop},\Theta}(y,z) + H_{\mathrm{Pop},\Theta}(y|z) \\ \\ & = & H_{\mathrm{Pop},\Theta}(z) - H_{\mathrm{Pop},\Theta}(z|y) + H_{\mathrm{Pop},\Theta}(y|z) \;\;(1)\end{array}

Assuming that we can sample z from the encoder distribution P_\Theta(z|y), and that we can compute P_\Theta(z|y) for any y and z, the conditional entropy H_{\mathrm{Pop},\Theta}(z|y) can be estimated by sampling. However that is not true of H_{\mathrm{Pop},\Theta}(z) or H_{\mathrm{Pop},\Theta}(y|z) because we have no way of computing P_{\mathrm{Pop},\Theta}(z) or P_{\mathrm{Pop},\Theta}(y|z). However, entropies defined in terms of the population can be upper bounded (and estimated) by cross-entropies and we introduce two models P_\Phi(z) and P_\Psi(y|z) with which to define cross-entropies.

\begin{array}{rcl} \hat{H}_{\mathrm{Pop},\Theta,\Phi}(z) & = & E_{\mathrm{Pop}, \Theta}\;-\ln P_\Phi(z) \\ \\ \Phi^* & = & \mathrm{argmin}_\Phi \;\hat{H}_{\mathrm{Pop},\Theta,\Phi}(z)\;\;(2)\\ \\\hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z) & = & E_{\mathrm{Pop}, \Theta}\;-\ln P_\Psi(y|z) \\ \\ \Psi^* & = & \mathrm{argmin}_\Psi \;\hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z)\;\;(3)\end{array}

Inserting these two cross entropy upper bounds (or entropy estimators) into (1) gives

H_{\mathrm{Pop}}(y) \leq \hat{H}_{\mathrm{Pop},\Theta,\Phi}(z) - H_{\mathrm{Pop},\Theta}(z|y) + \hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z). \;\;\;(4)

The right hand side of (4) is the standard VAE objective function in terms of the prior, the encoder and the decoder. However, this derivation of the upper bound (4) from the exact equality (1) shows that we get a consistent non-parametric estimator of H_{\mathrm{Pop}}(y) by optimizing the prior and posterior according to (2) and (3) while holding the encoder fixed. This follows directly from the fact that cross entropy is a consistent non-parametric estimator of entropy in the sense that \inf_Q\;H(P,Q) = H(P). Furthermore, we expect that P_\Phi(z) estimates P_{\mathrm{Pop},\Theta}(z) and that P_\Psi(y|z) estimates P_{\mathrm{Pop},\Theta}(y|z) again independent of
the choice of \Theta.

This observation gives us freedom in designing latent variable objective functions that produce useful or interpretable latent variables. We can train the prior and posterior by (2) and (3) and train the encoder by any choice of an encoder objective function. For example, a natural choice might be

\begin{array}{rcl} \Theta^* & = & \mathrm{argmin}_\Theta\; \hat{I}_{\mathrm{Pop},\Theta,\Phi}(z,y) + \lambda \hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z) \;\;(5)\\ \\ \hat{I}_{\mathrm{Pop},\Theta,\Phi}(z,y) & = & \hat{H}_{\mathrm{Pop},\Theta,\Phi}(z) - H_{\mathrm{Pop},\Theta}(z|y)\end{array}

The weight \lambda in (5) can be interpreted as providing a rate-distortion trade-off where the mutual information (upper bound) expresses the channel capacity (information rate) of z as a communication channel for the message y. This is exactly the \beta-VAE which weights the rate by \beta rather than the distortion by \lambda.

However, there can be different encoders achieving the same rate and distortion. The consistency of (4) independent of \Theta allows additional desiderata to be placed on the encoder. For example, we might want z to be a sequence z_1,\ldots,z_k with the z_i independent and the mutual information with y evenly balanced between the z_i yielding a VAE similar to an InfoGAN.

Here we are designing different objectives for different model components — the objectives defined by (2) and (3) for \Phi and \Psi are intended to be independent of any designed objective for \Theta and the objective for \Theta can be designed independent of (2) and (3). Multiple objective functions yield a multiplayer game with Nash equilibria. In practice we will need to insert stop gradients to prevent, for example, the objective for (player) \Theta from interfering with the objective for (player) \Phi and vice-versa.

The bottom line is that we can select any objective for the encoder while preserving non-parametric consistency of the VAE as a model of the observed data.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s