## Thoughts from TTIC31230: Rate-Distortion Metrics for GANs.

Throughout my career I have believed that progress in AI arises from objective evaluation metrics.  When I first learned of GANs I was immediately skeptical because of the apparent lack of meaningful metrics.  While GANs have made enormous progress in generating realistic images, the problem of metrics remains (see Borji 2018).  I suspect that the lack of meaningful metrics is related to a failure of GANs to play an important role in unsupervised pre-training (feature learning) for discriminative applications.  This is in stark contrast to language models (ELMO, BERT and GPT-2) where simple cross-entropy loss has proved extremely effective for pre-training.

Cross-entropy loss is typically disregarded for GANs in spite of the fact that it is the de-facto metric for modeling distributions and in spite of its success in pre-training for NLP tasks. In this post I argue that rate-distortion metrics — a close relative of cross entropy loss — should be a major component of GAN evaluation (in addition to discrimination loss).  Furthermore, evaluating GANs by rate-distortion metrics leads to a conceptual unification of GANs, VAEs and signal compression.  This unification is already emerging from image compression applications of GANS such as in the work of Agustsson et al 2018. The following figure by Julien Despois can be interpreted in terms of VAEs, signal compression, or GANs.

$y \hspace{13em} z = z_\Psi(y) + \epsilon \hspace{10em}||y - \hat{y}_\Theta(z)||^2$

$p_\Phi(z)$

The VAE interpretation is defined by

$\Phi^*, \Psi^*,\Theta^* = \mathrm{argmin}_{\Phi,\Psi,\Theta}\;E_{y\sim \mathrm{Pop}}\;\left(\;KL(p_\Psi(z|y),p_\Phi(z)) + E_{z \sim P_\Psi(z|y)}\; -\ln p_\Theta(y|z)\right)\;\;\;\;\;\;(1)$

Now define $\Psi^*(\Phi,\Theta)$ as the minimum of (1) over $\Psi$ while holding $\Phi$ and $\Theta$  fixed. Using this to express the objective as a function of $\Phi$ and $\Theta$ , and assume universal expressiveness of $p_\Psi(z|y)$, the standard ELBO analysis shows that (1) reduces to minimizing cross-entropy loss of $p_{\Phi,\Theta}(y) =\int dz\;\;p_\Phi(z)p_\Theta(y|z)$.

$\Phi^*,\Theta^* = \mathrm{argmin}_{\Phi,\Theta}\; E_{y \sim \mathrm{Pop}}\;-\ln p_{\Phi,\Theta}(y)\;\;\;\;\;\;(2)$

It should be noted, however, that differential entropies and cross-entropies suffer from the following conceptual difficulties.

1. The numerical value of entropy and cross entropy depends on an arbitrary choice of units. For a distribution on lengths, probability per inch is numerically very different from probability per mile.
2. Shannon’s source coding theorem fails for continuous densities — it takes an infinite number of bits to specify a single real number.
3. The data processing inequality fails for differential entropy — $x/2$ has a different differential entropy than $x$.
4. Differential entropies can be negative.

For continuous data we can replace the differential cross-entropy objective with a more conceptually meaningful rate-distortion objective. Independent of conceptual objections to differential entropy, a rate-distortion objective allows for greater control of the model through a rate-distortion tradeoff parameter $\beta$ as is done in $\beta$-VAEs (Higgens et al. 2017Alemi et al 2017).  A special case of a $\beta$-VAE  is defined by

$\Phi^*, \Psi^*,\Theta^* = \mathrm{argmin}_{\Phi,\Psi}\;E_{y\sim \mathrm{Pop}}\;\left(\beta \;KL(p_\Psi(z|y),p_\Phi(z)) + E_{z \sim P_\Psi(z|y)}\;||y - \hat{y}_\Theta(z)||^2\right)\;\;\;\;(3)$

The VAE optimization (1) can be transformed into the rate-distortion equation (3) by taking

$p_\Theta(y|z) = \frac{1}{Z(\sigma)} \exp\left(\frac{-||y-z||^2}{2\sigma^2}\right)$

and taking $\sigma$ to be a fixed constant.   In this case (1) transforms into (3) with $\beta = 2\sigma^2$.  Distortion measures such as L1 and L2 preserve the units of the signal and are more conceptually meaningful than differential cross-entropy.  But see the comments below on other obvious issues with L1 and L2 distortion measures. KL-divergence is defined in terms of a ratio of probability densities and, unlike differential entropy, is conceptually well-formed.

Equation (3) leads to the signal compression interpretation of the figure above. It turns out that the KL term in (3) can be interpreted as a compression rate.  Let $\Phi^*(\Psi,\Theta)$ be the optimum $\Phi$ in (3) for a fixed value of $\Psi$ and $\Theta$. Assuming universality of $p_\Phi(z)$, the resulting optimization of $\Psi$ and $\Theta$ becomes the following where $p_\Psi(z) = \int dy \;\mathrm{Pop}(y)p_\Psi(z|y)$

$\Psi^*,\Theta^* = \mathrm{argmin}_{\Psi,\Theta}\;\beta E_y\; KL(p_\Psi(z|y), \;p_\Psi(z)) + E_y\;||y - \hat{y}_\Theta(z)||^2\;\;\;\;\;\;(4)$

The KL term can now be written as a mutual information between $y$ and $z$.

$E_{y \sim \mathrm{Pop}} KL(p_\Psi(z|y), \;p_\Psi(z)) = H(z) - H(z|y) = I(y,z)$

Hence (4) can be rewritten as

$\Psi^*,\Theta^* = \mathrm{argmin}_{\Psi,\Theta}\;\beta I(z,y) + E_{y,z}\;||y - \hat{y}_\Theta(z)||^2\;\;\;\;\;\;(5)$

A more explicit derivation can be found in slides 17 through 21 in my lecture slides on rate-distortion auto-encoders.

By Shannon’s channel capacity theorem, the mutual information $I(y,z)$ is the number of bits transmitted through a noisy channel from $y$ to $z$ — it is the number of bits from $y$ than reach the decoder $\hat{y}_\Theta(z)$.  In the figure $P_\Psi(z|y)$ is defined by the equation $z = z_\Psi(y) + \epsilon$ for some fixed noise distribution on $\epsilon$.  Adding noise can be viewed as limiting precision. For standard data compression, where $z$ must be a compressed file with a definite number of bits, the equation $z = z_\Psi(y) + \epsilon$ can be interpreted as a rounding operation that rounds $z_\Psi(y)$ to integer coordinates.  See Agustsson et al 2018.

We have now unified VAEs with data-compression rate-distortion models.  To unify these with GANs we can take $\Phi$ and $\Theta$ to be the generator of a GAN.  We can train the GAN generator $(\Phi,\Theta)$ in the traditional way using only adversarial discrimination loss and then measure a rate-distortion metric by training $\Psi$ to minimize (3) while holding $\Phi$ and $\Theta$ fixed.  Alternatively, we can add a discrimination loss to (3) based on the discrimination between $y$ and $\hat{y}_\Theta(z)$ and train all the parameters together. It seems intuitively clear that a low rate-distortion value on test data indicates an absence of mode collapse — it indicates that the model can efficiently represent novel images drawn from the population.  Ideally, the rate-distortion metric should not increase much as we add weight to a discrimination loss.

A standard objection to L1 or L2 distortion measures is that they do not represent “perceptual distortion” — the degree of difference between two images as perceived by a human observer.  One interpretation of perceptual distortion is that two images are perceptually similar if the are both “natural” and carry “the same information”.  In defining what we mean by the same information we might invoke predictive coding or the information bottleneck method. The basic idea is to find an image representation that achieves compression while preserving mutual information with other (perhaps future) images.  This can be viewed as an information theoretic separation of “signal” from “noise”.  When we define the information in an image we should be disregarding noise.  So while it is nice to have a unification of GANs, VAEs and signal compression, it would seem better to have a theoretical framework providing a distinction between signal and noise. Ultimately we would like a rate-utility metric for perceptual representations.

This entry was posted in Uncategorized. Bookmark the permalink.