I just read VQ-VAE-2 (Vector-Quantized – Variational AutoEncoders – 2) by Razavi et al. This paper gives very nice results on modeling image distributions with vector quantization. It dominates BigGANs under classification accuracy score (CAS) for class-conditional ImgeNet image generation. For reasons listed below I believe that vector quantization is inevitable in deep architectures. This paper convinces me that the time of VQ-VAEs has arrived.

Hinton’s grand vision of AI has always been that there are simple general principles of learning, analogous to the Navier-Stokes equations of fluid flow, from which complex general intelligence emerges. I think Hinton under-estimates the complexity required for a general learning mechanism, but I agree that we are searching for some general (i.e., minimal-bias) architecture. For the following reasons I believe that vector quantization is an inevitable component of the architecture we seek.

**A better learning bias.** Do the objects of reality fall into categories? If so, shouldn’t a learning architecture be designed to categorize? A standard theory of language learning is that the child learns to recognize certain things, like mommy and doggies, and then later attaches these learned categories to the words of language. It seems natural to assume that categorization precedes language in both development and evolution. The objects of reality do fall into categories and every animal must identify potential mates, edible objects, and dangerous predators.

It is not clear that the vector quanta used in VQ-VAE-2 correspond to meaningful categories. It is true, however, that the only meaningful distribution models of ImageNet images are class-conditional. A VQ-VAE with a vector quanta for the image as a whole at least has the potential to allow class-conditioning to emerge from the data.

**Interpretability. **Vector quantization shifts the interpretability question from that of interpreting linear threshold units to that of interpreting emergent symbols — the embedded tokens that are the emergent vector quanta. A VQ-VAE with a vector-quantized full image representation would cry out for a class interpretation for the emergent image symbol.

A fundamental issue is whether the vectors being quantized actually fall into natural discrete clusters. This form of interpretation is often done with t-SNE. But if vectors naturally fall into clusters then it seems that our models should seek and utilize that clustering. Interpretation can then focus on the meaning of the emergent symbols.

**The rate-distortion training objective. **VQ-VAEs support rate-distortion training as discussed in my previous blog post on rate-distortion metrics for GANs. I have always been skeptical of GANS because of the lack of meaningful performance metrics for generative models lacking an encoder. While the new CAS metric does seem more meaningful than previous metrics, I still feel that training on cross-entropy loss (negative log likelihood) should ultimately be more effective than adversarial training. Rate-distortion metrics assume a discrete compressed representation defining a “rate” (a size of a compressed image file) and some measure of distortion between the original image and its reconstruction from the compressed file. The rate is measured by negative log-likelihood (a kind of cross-entropy loss). Rate-distortion training is also different from differential cross-entropy training as used in Flow networks (invertible generative networks such as GLOW). Differential cross-entropy can be unboundedly negative. To avoid minimizing an unboundedly negative quantity, when training on differential cross-entropy one must introduce a scale parameter for the real numbers in the output (the parameter “a” under equation (2) in the GLOW paper). This scale parameter effectively models the output numbers as discrete integers such as the bytes in the color channels of an image. The unboundedly negative differential cross-entropy then becomes a non-negative discrete cross-entropy. However, this treatment of differential cross-entropy still fails to support a rate-distortion tradeoff parameter.

**Unifying vision and language architectures.** A fundamental difference between language models and image models involves word vectors. Language models use embedded symbols where the symbols are manifest in the data. The vast majority of image models do not involve embedded symbols. Vector quantization is a way for embedded symbols to emerge from continuous signal data.

**Preserving parameter values under retraining.** When we learn to ski we do not forget how to ride a bicycle. However, when a deep model is trained on a first task (riding a bicycle) and then on a second task (skiing), optimizing the parameters for the second task degrades the performance on the first. But note that when a language model is training on a new topic, the word embeddings of words not used in new topic will not change. Similarly a model based on vector quanta will not change the vectors for the bicycle control symbols not invoked when training on skiing.

**Improved transfer learning.** Transfer learning and few-shot learning (meta-learning) may be better supported with embedded symbols for the same reason that embedded symbols reduce forgetting — embedded symbols can be class or task specific. The adaptation to a new domain can be restricted to the symbols that arise (under some non-parametric nearest neighbor scheme) in the new domain, class or task.

**Emergent symbolic representations.** The historical shift from symbolic logic-based representations to distributed vector representations is typically viewed as one of the cornerstones of the deep learning revolution. The dramatic success of distributed (vector) representations in a wide variety of applications cannot be disputed. But it also seems true that mathematics is essential to science. I personally believe that logical symbolic reasoning is necessary for AGI. Vector quantization seems to be a minimal-bias way for symbols to enter into deep models.