I just read VQ-VAE-2 (Vector-Quantized – Variational AutoEncoders – 2) by Razavi et al. This paper gives very nice results on modeling image distributions with vector quantization. It dominates BigGANs under classification accuracy score (CAS) for class-conditional ImgeNet image generation. For reasons listed below I believe that vector quantization is inevitable in deep architectures. This paper convinces me that the time of VQ-VAEs has arrived.

Hinton’s grand vision of AI has always been that there are simple general principles of learning, analogous to the Navier-Stokes equations of fluid flow, from which complex general intelligence emerges. I think Hinton under-estimates the complexity required for a general learning mechanism, but I agree that we are searching for some general (i.e., minimal-bias) architecture. For the following reasons I believe that vector quantization is an inevitable component of the architecture we seek.

**A better learning bias.** Do the objects of reality fall into categories? If so, shouldn’t a learning architecture be designed to categorize? A standard theory of language learning is that the child learns to recognize certain things, like mommy and doggies, and then later attaches these learned categories to the words of language. It seems natural to assume that categorization precedes language in both development and evolution. The objects of reality do fall into categories and every animal must identify potential mates, edible objects, and dangerous predators.

It is not clear that the vector quanta used in VQ-VAE-2 correspond to meaningful categories. It is true, however, that the only meaningful distribution models of ImageNet images are class-conditional. A VQ-VAE with a vector quanta for the image as a whole at least has the potential to allow class-conditioning to emerge from the data.

**Interpretability. **Vector quantization shifts the interpretability question from that of interpreting linear threshold units to that of interpreting emergent symbols — the embedded tokens that are the emergent vector quanta. A VQ-VAE with a vector-quantized full image representation would cry out for a class interpretation for the emergent image symbol.

A fundamental issue is whether the vectors being quantized actually fall into natural discrete clusters. This form of interpretation is often done with t-SNE. But if vectors naturally fall into clusters then it seems that our models should seek and utilize that clustering. Interpretation can then focus on the meaning of the emergent symbols.

**The rate-distortion training objective. **VQ-VAEs support rate-distortion training as discussed in my previous blog post on rate-distortion metrics for GANs. I have always been skeptical of GANS because of the lack of meaningful performance metrics for generative models lacking an encoder. While the new CAS metric does seem more meaningful than previous metrics, I still feel that training on cross-entropy loss (negative log likelihood) should ultimately be more effective than adversarial training. Rate-distortion metrics assume a discrete compressed representation defining a “rate” (a size of a compressed image file) and some measure of distortion between the original image and its reconstruction from the compressed file. The rate is measured by negative log-likelihood (a kind of cross-entropy loss). Rate-distortion training is also different from differential cross-entropy training as used in Flow networks (invertible generative networks such as GLOW). Differential cross-entropy can be unboundedly negative. To avoid minimizing an unboundedly negative quantity, when training on differential cross-entropy one must introduce a scale parameter for the real numbers in the output (the parameter “a” under equation (2) in the GLOW paper). This scale parameter effectively models the output numbers as discrete integers such as the bytes in the color channels of an image. The unboundedly negative differential cross-entropy then becomes a non-negative discrete cross-entropy. However, this treatment of differential cross-entropy still fails to support a rate-distortion tradeoff parameter.

**Unifying vision and language architectures.** A fundamental difference between language models and image models involves word vectors. Language models use embedded symbols where the symbols are manifest in the data. The vast majority of image models do not involve embedded symbols. Vector quantization is a way for embedded symbols to emerge from continuous signal data.

**Preserving parameter values under retraining.** When we learn to ski we do not forget how to ride a bicycle. However, when a deep model is trained on a first task (riding a bicycle) and then on a second task (skiing), optimizing the parameters for the second task degrades the performance on the first. But note that when a language model is training on a new topic, the word embeddings of words not used in new topic will not change. Similarly a model based on vector quanta will not change the vectors for the bicycle control symbols not invoked when training on skiing.

**Improved transfer learning.** Transfer learning and few-shot learning (meta-learning) may be better supported with embedded symbols for the same reason that embedded symbols reduce forgetting — embedded symbols can be class or task specific. The adaptation to a new domain can be restricted to the symbols that arise (under some non-parametric nearest neighbor scheme) in the new domain, class or task.

**Emergent symbolic representations.** The historical shift from symbolic logic-based representations to distributed vector representations is typically viewed as one of the cornerstones of the deep learning revolution. The dramatic success of distributed (vector) representations in a wide variety of applications cannot be disputed. But it also seems true that mathematics is essential to science. I personally believe that logical symbolic reasoning is necessary for AGI. Vector quantization seems to be a minimal-bias way for symbols to enter into deep models.

an interpretation I wasn’t aware of when I first learned about the new generative model .. (all I learned was that it achieved higher metrics than BigGAN)

Thanks this is a perspective new to me.

an interpretation I wasn’t aware of when I first learned about the new generative model .. (all I learned was that it achieved higher metrics than BigGAN)

I am led here through Vinyals’s tweet. Very inspiring interpretation of the VQ-VAE work!

I recently thought about language and

Language as a means to give names to things

data (including programs)..

external to the NN.. storage .. memory

keys in a hashmap or

building tools

sync with other agents

build knowledge base, automated research in math, auto-formalization .. acquired by running algorithm need to be verified formally .. also auto-programming, program synthesis..

discrete/digitized .. subject to encoding ..lossless communication.. error-correcting

condensed, compressed info, constant .. always ongoing synchronization process ..

if I understand correctly VQ-VAE incorporates a means (or inductive bias) of naming things, instead of relying on e.g. RL to optimize the process of naming.

recorded text .. can be deciphered .. fixture of knowledge/stored description of procedure

Can NN invent symbolic logic?

Can NN invent programming languages?

do you really need to name things in order to do math or to program?

definitely need a token for theorems .. apply a theorem, otherwise using the whole statement is too long ..? the point is that the NN only need to remember of the theorem, let the pattern matching to the exact theorem statement be handled by mechanical procedure

chained memory tree? Knowing rule of language allows one to traverse/access a exponential sized storage ?