## Encoder Autonomy

As in previous years, teaching my course on the fundamentals of deep learning has inspired some blog posts.

This year I realized that VAEs are non-parametrically consistent as models of the observed data even when the encoder is held fixed and arbitrary. This is best demonstrated with a nonstandard derivation of VAEs bypassing the ELBO.

Let $y$ range over observable data and let $z$ range over latent values. Let the encoder be defined by a probability distribution $P_\Theta(z|y)$. We then have a joint distribution $P_{\mathrm{Pop},\Theta}(y,z)$ where $y$ is drawn from the population ( $\mathrm{Pop}$) and $z$ is drawn from the encoder. We let the entropies and the mutual information $H_{\mathrm{Pop}}(y)$, $H_{\mathrm{Pop},\Theta}(z)$, $H_{\mathrm{Pop},\Theta}(y|z)$, $H_{\mathrm{Pop},\Theta}(z|y)$ and $I_{\mathrm{Pop},\Theta}(y,z)$ all be defined by this joint distribution. To derive the VAE objective we start with the following basic information theoretic equalities. $\begin{array}{rcl} I_{\mathrm{Pop},\Theta}(y,z) & = & H_{\mathrm{Pop}}(y) - H_{\mathrm{Pop},\Theta}(y|z) \\ \\ H_{\mathrm{Pop}}(y) & = & I_{\mathrm{Pop},\Theta}(y,z) + H_{\mathrm{Pop},\Theta}(y|z) \\ \\ & = & H_{\mathrm{Pop},\Theta}(z) - H_{\mathrm{Pop},\Theta}(z|y) + H_{\mathrm{Pop},\Theta}(y|z) \;\;(1)\end{array}$

Assuming that we can sample $z$ from the encoder distribution $P_\Theta(z|y)$, and that we can compute $P_\Theta(z|y)$ for any $y$ and $z$, the conditional entropy $H_{\mathrm{Pop},\Theta}(z|y)$ can be estimated by sampling. However that is not true of $H_{\mathrm{Pop},\Theta}(z)$ or $H_{\mathrm{Pop},\Theta}(y|z)$ because we have no way of computing $P_{\mathrm{Pop},\Theta}(z)$ or $P_{\mathrm{Pop},\Theta}(y|z)$. However, entropies defined in terms of the population can be upper bounded (and estimated) by cross-entropies and we introduce two models $P_\Phi(z)$ and $P_\Psi(y|z)$ with which to define cross-entropies. $\begin{array}{rcl} \hat{H}_{\mathrm{Pop},\Theta,\Phi}(z) & = & E_{\mathrm{Pop}, \Theta}\;-\ln P_\Phi(z) \\ \\ \Phi^* & = & \mathrm{argmin}_\Phi \;\hat{H}_{\mathrm{Pop},\Theta,\Phi}(z)\;\;(2)\\ \\\hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z) & = & E_{\mathrm{Pop}, \Theta}\;-\ln P_\Psi(y|z) \\ \\ \Psi^* & = & \mathrm{argmin}_\Psi \;\hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z)\;\;(3)\end{array}$

Inserting these two cross entropy upper bounds (or entropy estimators) into (1) gives $H_{\mathrm{Pop}}(y) \leq \hat{H}_{\mathrm{Pop},\Theta,\Phi}(z) - H_{\mathrm{Pop},\Theta}(z|y) + \hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z). \;\;\;(4)$

The right hand side of (4) is the standard VAE objective function in terms of the prior, the encoder and the decoder. However, this derivation of the upper bound (4) from the exact equality (1) shows that we get a consistent non-parametric estimator of $H_{\mathrm{Pop}}(y)$ by optimizing the prior and posterior according to (2) and (3) while holding the encoder fixed. This follows directly from the fact that cross entropy is a consistent non-parametric estimator of entropy in the sense that $\inf_Q\;H(P,Q) = H(P)$. Furthermore, we expect that $P_\Phi(z)$ estimates $P_{\mathrm{Pop},\Theta}(z)$ and that $P_\Psi(y|z)$ estimates $P_{\mathrm{Pop},\Theta}(y|z)$ again independent of
the choice of $\Theta$.

This observation gives us freedom in designing latent variable objective functions that produce useful or interpretable latent variables. We can train the prior and posterior by (2) and (3) and train the encoder by any choice of an encoder objective function. For example, a natural choice might be $\begin{array}{rcl} \Theta^* & = & \mathrm{argmin}_\Theta\; \hat{I}_{\mathrm{Pop},\Theta,\Phi}(z,y) + \lambda \hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z) \;\;(5)\\ \\ \hat{I}_{\mathrm{Pop},\Theta,\Phi}(z,y) & = & \hat{H}_{\mathrm{Pop},\Theta,\Phi}(z) - H_{\mathrm{Pop},\Theta}(z|y)\end{array}$

The weight $\lambda$ in (5) can be interpreted as providing a rate-distortion trade-off where the mutual information (upper bound) expresses the channel capacity (information rate) of $z$ as a communication channel for the message $y$. This is exactly the $\beta$-VAE which weights the rate by $\beta$ rather than the distortion by $\lambda$.

However, there can be different encoders achieving the same rate and distortion. The consistency of (4) independent of $\Theta$ allows additional desiderata to be placed on the encoder. For example, we might want $z$ to be a sequence $z_1,\ldots,z_k$ with the $z_i$ independent and the mutual information with $y$ evenly balanced between the $z_i$ yielding a VAE similar to an InfoGAN.

Here we are designing different objectives for different model components — the objectives defined by (2) and (3) for $\Phi$ and $\Psi$ are intended to be independent of any designed objective for $\Theta$ and the objective for $\Theta$ can be designed independent of (2) and (3). Multiple objective functions yield a multiplayer game with Nash equilibria. In practice we will need to insert stop gradients to prevent, for example, the objective for (player) $\Theta$ from interfering with the objective for (player) $\Phi$ and vice-versa.

The bottom line is that we can select any objective for the encoder while preserving non-parametric consistency of the VAE as a model of the observed data.

This entry was posted in Uncategorized. Bookmark the permalink.