## GPT-3 Semantic Parsing with Simple Sentence Graphs

For a couple years now Michael Douglas and I have been contemplating a meaning representation — simple sentence graphs — based on breaking sentences down into simple sentence components. Over the last week or so we have been playing with chain-of-thought prompting to get GPT-3 to do semantic parsing to generate simple sentence graphs. An example with just a single verb is

We are using a chain-of-thought prompt consisting of seven examples of simple sentence parses. The training parses do not contain the word “jealous”. We follow the training parses with the input sentence and a request for a simple sentence parse for the phrase headed by “being jealous”. GPT-3 then generates a semantic parse consisting of a generic “lexical entry” together with an “episodic instantiation”. The lexical entry consists of a simple sentence specifying roles as parenthesized generic definite noun phrases. The lexical entry also contains other sentences that are likely to be true of instances of this lexical entry. The episodic instantiation specifies the particular entities and events filling the roles of the semantic lexical entry.

One can also ask GPT-3 to decide for itself what lexical entries to retrieve. For example,

when given

INPUT: Russian lawmakers Monday ratified the illegal annexation of four partially occupied Ukrainian regions.

GPT-3 generates

Retrieving lexical entries for (ratifying), (annexing), and (occupying).

(ratifying) is when (the ratifier) ratifies (the treaty);
(the treaty) is a legal document that has been signed by (the ratifier) and (the other party);
(the treaty) creates an agreement between (the ratifier) and (the other party).

(annexing) is when (the annexer) annexes (the territory);
(the territory) is under the control of (the annexer).

(occupying) is when (the occupier) occupies (the territory);
(the territory) is under the control of (the occupier).

Episodic Instantiation:
(the ratifying):
(the ratifier) is Russian lawmakers
(the treaty) is the annexation of four partially occupied Ukrainian regions
(the other party) is unknown

(the annexing):
(the annexer) is Russian lawmakers
(the territory) is four partially occupied Ukrainian regions

(the occupying):
(the occupier) is Russian lawmakers
(the territory) is four partially occupied Ukrainian regions

when given

INPUT: Rex Tillerson testified that he never asked Tom Barrack to conduct diplomacy.

GPT-3 generates

Retrieving lexical entries for (testifying), (asking to do), (conducting diplomacy).

(testifying) is when (the witness) testifies (the information) to (the court); before (the testimony), (the witness) should know (the information); before (the testimony), (the witness) is under oath; (the testimony) causes (the court) to know (the information).

(asking to do) is when (the asker) asks (the hearer) to do (the action); before (the asking), (the asker) wants (the hearer) to do (the action); (the asking) causes (the hearer) to try to do (the action).

(conducting diplomacy) is when (the diplomat) conducts diplomacy with (the other party); before (the diplomacy), (the diplomat) wants (the diplomacy) to happen; during (the diplomacy), (the diplomat) tries to resolve (the issue) with (the other party).

Episodic Instantiation:

(the testimony): (the witness) is Rex Tillerson; (the information) is that he never asked Tom Barrack to conduct diplomacy; (the court) is unknown

(the asking): (the asker) is Rex Tillerson; (the hearer) is Tom Barrack; (the action) is to conduct diplomacy.

(the diplomacy): (the diplomat) is Tom Barrack (the other party) is unknown; (the issue) is unknown

A simple sentence parse can be viewed as a graph structure (a simple sentence graph) where episodic instantiations connect (instantiate) semantic lexical entries with particular entities and events. A given entity can occur in various different instantiations.

An advantage of simple sentence parsing (SSP) is that it is largely devoid of any formal linguistic theory. At the same time SSP seems to provide compelling semantic parses. At the end of this post we give the seven training parses appearing in the chain of thought text prepended to the above inputs.

I will consider three questions raised by SSP. First, do these semantic parses contribute the current debate between Yann LeCun and Gary Markus on Nativisim vs. Empiricism for symbolic processing? Second, does the fact that GPT-3 can do this suggest that GPT-3 is already doing some form of semantic parsing? Does SSP with GPT-3 in some way open GPT-3’s black box? Finally, does SSP suggest new memory-based transformer architectures?

Nativism vs. Empiricism. The issue here is whether we have to “bake in” some mechanism explicitly designed to handle symbolic reasoning. Does progress in AI require, as Marcus believes, something like the innate language faculty assumed by many linguists? Or, as LeCun believes, can symbolic reasoning emerge from deep learning alone without any prior bias? GPT-3’s ability to do semantic parsing would seem to support LeCun’s position. However GPT-3’s ability to do SSP, while good, is not perfect. Furthermore, GPT-3 training seems extremely data inefficient. Baking in symbolic reasoning might greatly reduce the data requirements. So I think the question remains open. My intuition is to agree with Marcus. However, I expect that the “language faculty” will turn out to be a deep learning architecture inspired by symbolic reasoning — a neuro-symbolic compromise between nativism and empiricism.

Does SSP open the back box of GPT-3? It is very tempting to view the above examples of semantic parses as a kind of demonstration of GPT-3’s understanding of words and how that understanding is applied to sentences. But this could easily be an illusion — an overly anthropomorphic view of GPT-3. On the other hand it seems intuitively plausible that factoring knowledge into semantic knowledge and episodic instantiations would reduce the number of parameters needed to store the knowledge. The semantic knowledge can be shared across many episodic instances. It would be somewhat surprising if GPT-3 was not doing some form of semantic-episodic factoring. Lexical semantics could be nothing more than word (or token) embeddings. GPT-3’s token embeddings have over 12 thousand dimensions — enough to store fairly elaborate lexical information.

Might SSP suggest new deep learning architectures? This is of course a difficult question. One idea is to try to design memory architectures around the separation of “semantic” and “episodic” memory. A semantic memory would hold lexical entries and facilitate few-shot learning of new words. The episodic memory would hold episodic instantiations of lexical entries. In both cases the thing remembered is the output of language processing rather than raw input text as in existing memory architectures. My intuition is that the set of memories should be a discrete set (symbols?) but that the contents of the memories should be combinations of vectors and pointers to other memories (defining a graph structure).

I will close with two simple examples showing word sense disambiguation.

When Given

INPUT: Mary painted the door red.

Retrieving lexical entry for (painting).

GPT-3 generates

(painting) is when (the painter) paints (the painting surface) with (the paint); before (the painting), (the painting surface) is a different color than (the paint); after (the painting), (the painting surface) is the same color as (the paint); during (the painting), (the painter) is applying (the paint) to (the painting surface).

But when given

Mary painted the flowers

Retrieving lexical entry for (painting).

GPT-3 generates:

(painting) is when (the painter) paints (the painting); before (the painting), (the painter) wants to paint (the painting); (the painting) is a picture of (the thing painted); after (the painting), (the painting) is finished.

The training parses prepended to these examples are given below. Davinci-002 is used with temperature 0, top P 0, frequency penalty 0 and presence penalty 0, no stop token, no start text and no restart text. The training examples can undoubtedly be improved.

ENJOY!

======== Training Parses ========

INPUT: Fred ate scrambles eggs.

Retrieving lexical entry for (eating) and (an egg)

(eating) is when (the eater) eats (the food);
before (the eating), (the eater) is hungry;
after (the eating), (the eater) is not hungry;
after (the eating), (the food) is inside (the eater);
after (the eating), (the food) is gone.

(an egg) is a means of reproduction for (the animal);
(the egg) is laid by (the animal);
(the egg) hatches to produce a new (the animal).

Episodic Instantiation:

(the eating):
(the eater) is Fred
(the food) is scrambled eggs

(the egg):
(the animal) is a chicken

INPUT: George Floyd was murdered by the police.

Rretrieving lexical entry for (murdering) and (police).

(murdering) is when (the victim) is murdered by (the murderer);
before (the murder), (the victim) is alive;
(the murder) causes (the victim) to be dead;
before (the murder), (the victim) does not want (the murder) to happen;
(the murder) is a crime.

(police) are people with the authority to enforce the law.

Episodic Instantiation:

(the murder):
(the victim) is George Floyd
(the murderer) is the police

(the police) are unknown

INPUT: Putin told the general to attack.

Retrieving lexical entries for (telling to do) and (a general).

(telling to do) is when (the teller) tells (the hearer) to do (the action);
(the teller) has authority over (the hearer);
after (the telling to do), (the hearer) should do (the action).

(a general) is a person who controls an army.

Episodic Instantiation:

(the telling to do):
(the teller) is Putin
(the hearer) is the general
(the action) is to attack

(the general) is unkown.

INPUT: Putin told the general to attack.

Retrieving lexical entry for (attacking).

(attacking) is when (the attacker) attacks (the attacked);
before (the attack), (the attacked) does not want (the attack) to happen;
during (the attack), (the attacker) tries to damage (the attacked);
after (the attack), (the attacked) is angy at (the attacker).

Episodic Instantiation:

(the attack):
(the attacker) is the general
(the one attacked) is unknown

INPUT: Mary told John about the party.

Retrieving lexical entries for (telling about) and (a party).

(a telling about) is when (the teller) tells (the hearer) about (the thing);
before (the telling about), (the teller) wants (the hearer) to know about (the thing);

(a party) is when (the partiers) get together at (the place) and (the time) to have fun;
during (the party), (the partiers) have fun.

Episodic Instantiation:

(the teller) is Mary
(the hearer) is John
(the thing) is the party

(the party):
(the parters) are unknown
(the place) is unknown
(the time) is unknown

INPUT: The principal said that the teacher had resigned when they were actually on vacation.

Retrieving lexical entries for (saying), (resigning), and (vacationing).

(saying) is when (the sayer) says (the statement) to (the hearer);
before (the saying), (the sayer) wants (the hearer) to hear (the statement);
(the saying) causes (the hearer) to hear (the statement);
after (the saying) (the hearer) has heard (the statement).

(resigning) is when (the resigner) resigns from (the job);
before (the resignation), (the resigner) has (the job);
before (the resignation), (the resigner) does not want (the job);
(the resignation) causes (the resigner) to not have (the job).

(vacationing) is when (the vacationer) is on vacation from (the job) and in (the location);
before (the vacation), (the vacationer) wants (the vacation);
before (the vacation), (the vacationer) is doing (the job);
during (the vacation), (the vacationer) is not doing (the job);
during (the vacation), (the vacationer) is at (the location);
after (the vacation), (the vacationer) is again doing (the job).

Episodic Instantiation:

(the saying):
(the sayer) is the principal
(the statement) is (the teacher had resigned)
(the statement) is false
(the hearer) is unknown

(the resigning):
(the resigner) is the teacher
(the job) is teaching

(the vacationing):
(the vacationer) is the teacher
(the job) is same as (the job of (the resignation))
(the location) is unknown

INPUT: Herschel Walker pledges to sue over the report that he paid for an abortion.

Retrieving lexical entries for (pledging), (suing), (reporting), (paying) and (an abortion).

(pledging) is when (the pledger) pledges to do (the action);
(the pledge) is a public promise by (the pledger) to do (the action).

(suing) is when (the plaintiff) sues (the defendant) over (the issue) and is decided by (the judge);
(the plaintiff) wins (the lawsuit) if (the judge) rules in favor of (the plaintiff);
(the defendant) wins (the lawsuit) if (the judge) rules in favor of (the defendant).

(reporting) is when (the reporter) reports (the information) to (the recipient);
before (the report), (the recipient) expects (the report);
before (the report), (the reporter) should make (the report);
(the report) causes the (the recipient) to know (the information).

(paying) is when (the payer) pays (the amount) to (the payee) for (the thing);
before (the payment), the payer has (the amount);
before (the payment), (the recipient) wants (the payment);
(the payment) causes (the recipient) to have (the amount).

(an abortion) is when (the woman) has (the abortion);
before (the abortion), (the woman) is pregnant;
before (the abortion), (the woman) wants to not be pregnant;
(the abortion) causes (the woman) to not be pregnant.

Episodic Instantiation:

(the pledge):
(the pledger) is Herschel Walker
(the action) is to sue

(the suit):
(the plaintiff) is Herschel Walker
(the defendant) is unknown
(the issue) is (the report)
(the judge) is unknown

(the report):
(the reporter) the same as (the defendant of (the report))
(the information) is that Herschel Walker paid for an abortion

(the payment):
(the payer) is Herschel Walker
(the payee) is unknown
(the amount) is unknown
(the thing) is (the abortion)

(the abortion):
(the pregnant woman) is unknown
(the procedure) is unknown
(the pregnancy) is unknown

Posted in Uncategorized | 4 Comments

## The Case Against Grounding

A recent NOEMA essay by Jacob Browning and Yann LeCun put forward the proposition that “an artificial intelligence system trained on words and sentences alone will never approximate human understanding”.  I will refer to this claim as the grounding hypothesis — the claim that understanding requires grounding in direct experience of the physical world with, say, vision or manipulation, or perhaps direct experience of emotions or feelings.  I have long argued that language modeling — modeling the probability distribution over texts — should in principle be adequate for language understanding. I therefore feel compelled to write a rebuttal to Browning and LeCun.

I will start with a clarification of what I mean by “understanding.” For the purposes of this essay I will define understanding (or maybe even AGI) as the ability to perform language-based tasks as well as humans. I think it is fair to say that this includes the tasks of lawyers, judges, CEOs, and university presidents. It also includes counselors and therapists whose job requires a strong understanding of the human condition. One might object that a judge, say, might in some cases want to examine some physical evidence for themself. However, the task of being a judge or therapist remains meaningful even when the interaction is limited to text and speech.[1]  I am defining understanding to be the ability to do language-in/language-out tasks as well as humans.

A language-in/language-out conception of understanding seems required to make the grounding hypothesis meaningful.  Of course we expect that learning to see requires training on images or that learning robot manipulation requires training on a robot arm.  So the grounding hypothesis seems trivially true unless we are talking about language-in/language-out tasks such as being a competent language-in/language-out therapist or judge.

The grounding hypothesis, as stated by Browning and LeCun, is not about how children learn language.  It seems clear that non-linguistic experience plays an important role in the early acquisition of language by toddlers.  But, as stated, the grounding hypothesis says that no learning algorithm, no matter how advanced, can learn to understand using only a corpus of text.  This is a claim about the limitations of (deep) learning.

It is also worth pointing out that the grounding hypothesis is about what training data is needed, not about what computation takes place in the end task. Performing any language-in/language-out task is, by definition, language processing independent of what kind of computation is done. Transformer models such as GPT-3 use non-symbolic deep neural networks. However, these models are clearly processing language.

Browning and LeCun argue that the knowledge underlying language understanding can only be acquired non-linguistically.  For example the meaning of the phrase “wiggly line” might only be learnable from image data. The inference that “wiggly lines are not straight” could be a linguistically observable consequence of image-acquired understanding. Similar arguments can be made for sounds such as “whistle” or “major chord” vs “minor chord”.

On the surface this position seems reasonable.  However, a first counterargument is simply the fact that most language involves concepts that cannot even be represented in images or sounds. As of this writing the first sentence of the first article of Google news is

FBI agents have already finished their examination of possibly privileged documents seized in an Aug. 8 search of Donald Trump’s Mar-a-Lago home, according to a Justice Department court filing Monday that could undercut the former president’s efforts to have a special master appointed to review the files.

I do not see a single word in this sentence, with the possible exception of the names Donald Trump and Mar-a-Lago, whose meaning is plausibly acquired through images. Even for the names, the sentence is completely understandable without having seen images of the person or place.

A second counterargument is that, while there may be some minor artifacts in the language of the congenitally blind[2], people who are blind from birth generally do not have any significant linguistic impairment.

Browning and LeCun discuss how people use various forms of visual input. For example IKEA assembly instructions have no words.  No one is arguing that vision is useless.  However, it is very limited. The old silent movies would be meaningless without the subtitles.  A book, on the other hand, can give the reader a vivid human experience with no visual input whatsoever. The above sentence about Trump cannot be represented by a non-linguistic image.

Another argument given for the grounding hypothesis is that language seems more suitable for communication than for understanding.  The idea is that language is good for communication because it is concise but is bad for understanding because understanding requires painstaking decompression. They go so far as to suggest that we need to study in literature classes to be able to “decompress” (deconstruct?) language.  But of course preliterate societies have well developed language.  Furthermore, in my opinion at least, it is exactly the abstract (concise) representations provided by language that makes understanding even possible.  Consider again the above sentence about Trump.

The essay also notes that intelligence seems to be present in nonhuman animals such as corvids, octopi and primates.  They seem to implicitly assume that nonhuman intelligence is non-symbolic. I find this to be speculative. Symbolic intelligence seems completely compatible with a lack of symbolic communication. Memories of discrete events and individuals seem fundamental to intelligence.  Discrete representation may be a prerequisite for, rather than a consequence of, external linguistic communication.

Much of their essay focuses on the details of current language models.  But we have no idea what models will appear in the future. The real question is whether large language corpora contain the information needed for understanding.  If the information is there then we may someday (maybe someday soon) discover novel architectures that allow it to be extracted.

There is, for me, a clear argument that language modeling alone should ultimately be sufficient for extracting understanding.  Presumably understanding is required to generate meaningful long texts such as novels. The objective of a language model is to determine the distribution of texts (such as novels).  If defining the true distribution of novels requires understanding, then fully solving the language modeling problem requires the ability to understand.

Even if extracting understanding from static text proves too challenging, it could be the case that meaning can be extracted by pure language-in/language-out interaction between people and machines.  Interaction is not discussed by Browning and LeCun but I would presume that they would argue that something other than language, even interactive language, is required for learning meaning.

I strongly expect that the grounding hypothesis will turn out to be false and that deep learning will prevail. However, I can accept the grounding hypothesis as an open empirical question — a question that seems very hard to settle. Constructing a competent artificial judge using image data would not prove that image data is required.  Ideally we would like to find some meaningful language-in/language-out evaluation of understanding.  We could then track the state-of-the-art performance of various systems and see if image data is needed. All the evidence thus far indicates that language data alone suffices for language-based measures of understanding.

[1] I think we should consider speech corpora to be “language data”. The emotional content of text should be learnable from text alone but to learn to express emotion in speech will require some amount of speech data.

[2]  It has been shown that the congenitally blind have similarity judgements between different kinds of fruit that do not take color into account as much as do sighted people.

Posted in Uncategorized | 5 Comments

## Quo Vadis Language Model? Will I ever be able to trust you?

This spring I had a disagreement with an old friend about AGI. They claimed there was essentially no chance of AGI arriving in the next, say, fifty years. I have always said we just don’t know. They also wanted a definition of “AGI”. I said I would consider AGI to have arrived when a majority of the US population takes AI agents to be sentient after speaking with them for at least several months. Soon after that discussion Blake Lemoine claimed that the language model LaMBDA was sentient. I don’t for a moment take that claim seriously. But it did get me thinking about whether language models might start fooling more people in the near future. What might a language model five years from now be like what dangers might it present?

There are two fundamental reasons to think that language models might become operationally sentient (fooling most people) some time this decade. First, progress up to now has been very fast and is possibly accelerating. In January a prominent paper introduced an influential “chain of thought” model. While earlier proposals were similar, this particular paper seems to have driven an increased interest in having language models generate explicit reasoning before answering a question (or generating a response). Chain of thought approaches have led to significant advances on various benchmarks in the last six months. The second reason for thinking that operational sentience might arrive sooner (five or ten years) rather than later (fifty years) is the enormous amount of research effort being devoted to this endeavor.

Let me try to paint a picture of a future large language model (LLM). I expect the LLM to have long term memory. This will including a memory of all the conversations it has had and when it had them. An enormous amount of research has been and is continuing to be done into the incorporation of memory into language models. I also expect the LLM to include some form of chain-of-thought. It will have a total memory of its internal thoughts (internally generated sentences) tagged with when those thoughts occurred. The LLM will be able to honestly say things like “I was thinking this morning about what you said last night”. Third, I expect future language models to be much better at maintaining consistency in what they say. This will include consistency in how they describe the world and themselves. Blake Lemoine’s “interview” of LaMDA showed that a language model can already generate a lot of compelling first person sentences — statements about what it believes and wants. Assuming memory, the things that a language model says or thinks about itself becomes part of its background knowledge — a kind of self model. The language model should be able to do a good job of seeming to be self-aware.

While I have always looked forward to the arrival of AGI, I am finding this picture of operationally sentient LLMs rather dystopian. The fundamental problem is the black-box nature of an LLM in combination with the scale of its training data. By definition a language model is trained to say what a person would say. Ultimately predicting what a person would say seems to require a model of human nature — what do we want and how does that influence what we say and believe. The language model’s self understanding will be based on its understanding of people. It seems likely, therefore, that its self model and its speech will exhibit human tendencies such as a drive for power and respect. The language model’s understanding of human nature, and hence its understanding of itself, will be buried in its many trillions of parameters and would seem to be impossible to control.

In the past I have always assumed that we could control intelligent machines by specifying a mission — the purpose of the machine. A machine with an explicit mission would not have all the self interests that complicate human relations. I have advocated the “servant mission” where each AI agent is given a mission of serving a particular individual. We could each have our own computer advocate or “advobot”. But if language models can become sufficiently human just by reading, with human nature woven into its many of trillions of parameters, control becomes much more subtle …

## Encoder Autonomy

As in previous years, teaching my course on the fundamentals of deep learning has inspired some blog posts.

This year I realized that VAEs are non-parametrically consistent as models of the observed data even when the encoder is held fixed and arbitrary. This is best demonstrated with a nonstandard derivation of VAEs bypassing the ELBO.

Let $y$ range over observable data and let $z$ range over latent values. Let the encoder be defined by a probability distribution $P_\Theta(z|y)$. We then have a joint distribution $P_{\mathrm{Pop},\Theta}(y,z)$ where $y$ is drawn from the population ($\mathrm{Pop}$) and $z$ is drawn from the encoder. We let the entropies and the mutual information $H_{\mathrm{Pop}}(y)$, $H_{\mathrm{Pop},\Theta}(z)$, $H_{\mathrm{Pop},\Theta}(y|z)$, $H_{\mathrm{Pop},\Theta}(z|y)$ and $I_{\mathrm{Pop},\Theta}(y,z)$ all be defined by this joint distribution. To derive the VAE objective we start with the following basic information theoretic equalities.

$\begin{array}{rcl} I_{\mathrm{Pop},\Theta}(y,z) & = & H_{\mathrm{Pop}}(y) - H_{\mathrm{Pop},\Theta}(y|z) \\ \\ H_{\mathrm{Pop}}(y) & = & I_{\mathrm{Pop},\Theta}(y,z) + H_{\mathrm{Pop},\Theta}(y|z) \\ \\ & = & H_{\mathrm{Pop},\Theta}(z) - H_{\mathrm{Pop},\Theta}(z|y) + H_{\mathrm{Pop},\Theta}(y|z) \;\;(1)\end{array}$

Assuming that we can sample $z$ from the encoder distribution $P_\Theta(z|y)$, and that we can compute $P_\Theta(z|y)$ for any $y$ and $z$, the conditional entropy $H_{\mathrm{Pop},\Theta}(z|y)$ can be estimated by sampling. However that is not true of $H_{\mathrm{Pop},\Theta}(z)$ or $H_{\mathrm{Pop},\Theta}(y|z)$ because we have no way of computing $P_{\mathrm{Pop},\Theta}(z)$ or $P_{\mathrm{Pop},\Theta}(y|z)$. However, entropies defined in terms of the population can be upper bounded (and estimated) by cross-entropies and we introduce two models $P_\Phi(z)$ and $P_\Psi(y|z)$ with which to define cross-entropies.

$\begin{array}{rcl} \hat{H}_{\mathrm{Pop},\Theta,\Phi}(z) & = & E_{\mathrm{Pop}, \Theta}\;-\ln P_\Phi(z) \\ \\ \Phi^* & = & \mathrm{argmin}_\Phi \;\hat{H}_{\mathrm{Pop},\Theta,\Phi}(z)\;\;(2)\\ \\\hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z) & = & E_{\mathrm{Pop}, \Theta}\;-\ln P_\Psi(y|z) \\ \\ \Psi^* & = & \mathrm{argmin}_\Psi \;\hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z)\;\;(3)\end{array}$

Inserting these two cross entropy upper bounds (or entropy estimators) into (1) gives

$H_{\mathrm{Pop}}(y) \leq \hat{H}_{\mathrm{Pop},\Theta,\Phi}(z) - H_{\mathrm{Pop},\Theta}(z|y) + \hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z). \;\;\;(4)$

The right hand side of (4) is the standard VAE objective function in terms of the prior, the encoder and the decoder. However, this derivation of the upper bound (4) from the exact equality (1) shows that we get a consistent non-parametric estimator of $H_{\mathrm{Pop}}(y)$ by optimizing the prior and posterior according to (2) and (3) while holding the encoder fixed. This follows directly from the fact that cross entropy is a consistent non-parametric estimator of entropy in the sense that $\inf_Q\;H(P,Q) = H(P)$. Furthermore, we expect that $P_\Phi(z)$ estimates $P_{\mathrm{Pop},\Theta}(z)$ and that $P_\Psi(y|z)$ estimates $P_{\mathrm{Pop},\Theta}(y|z)$ again independent of
the choice of $\Theta$.

This observation gives us freedom in designing latent variable objective functions that produce useful or interpretable latent variables. We can train the prior and posterior by (2) and (3) and train the encoder by any choice of an encoder objective function. For example, a natural choice might be

$\begin{array}{rcl} \Theta^* & = & \mathrm{argmin}_\Theta\; \hat{I}_{\mathrm{Pop},\Theta,\Phi}(z,y) + \lambda \hat{H}_{\mathrm{Pop},\Theta,\Psi}(y|z) \;\;(5)\\ \\ \hat{I}_{\mathrm{Pop},\Theta,\Phi}(z,y) & = & \hat{H}_{\mathrm{Pop},\Theta,\Phi}(z) - H_{\mathrm{Pop},\Theta}(z|y)\end{array}$

The weight $\lambda$ in (5) can be interpreted as providing a rate-distortion trade-off where the mutual information (upper bound) expresses the channel capacity (information rate) of $z$ as a communication channel for the message $y$. This is exactly the $\beta$-VAE which weights the rate by $\beta$ rather than the distortion by $\lambda$.

However, there can be different encoders achieving the same rate and distortion. The consistency of (4) independent of $\Theta$ allows additional desiderata to be placed on the encoder. For example, we might want $z$ to be a sequence $z_1,\ldots,z_k$ with the $z_i$ independent and the mutual information with $y$ evenly balanced between the $z_i$ yielding a VAE similar to an InfoGAN.

Here we are designing different objectives for different model components — the objectives defined by (2) and (3) for $\Phi$ and $\Psi$ are intended to be independent of any designed objective for $\Theta$ and the objective for $\Theta$ can be designed independent of (2) and (3). Multiple objective functions yield a multiplayer game with Nash equilibria. In practice we will need to insert stop gradients to prevent, for example, the objective for (player) $\Theta$ from interfering with the objective for (player) $\Phi$ and vice-versa.

The bottom line is that we can select any objective for the encoder while preserving non-parametric consistency of the VAE as a model of the observed data.

## Reinterpreting AlphaZero

While teaching reinforcement learning I kept asking myself what AlphaZero teaches us about RL. That question has lead to this post.  This post generalizes AlphaZero to a larger class of RL algorithms by reinterpreting AlphaZero’s policy network as a belief function — as the probability that $a$ is the optimal action at state $s$.  This gives rise to a “belief gradient” algorithm.

A fundamental question is whether belief gradient is a better conceptual framework for RL than policy gradient.  I have always felt that there is something wrong with policy gradient methods.  The optimal policy is typically deterministic while policy gradient methods rely on significant exploration (policy stochasticity) to compute a gradient.  It just seems paradoxical.  The belief gradient approach seems to resolve this paradox.

The action belief function and a belief gradient algorithm. For a given state in a given Markov decision process (MDP) with a finite action space ${\cal A}$ there exists a deterministic optimal policy $\pi^*$ mapping each state to an optimal action $\pi^*(s) \in {\cal A}$. For each decision (in life?) there is some best choice that is generally impossible to know. I propose reinterpreting AlphaZero’s policy network $\pi_\Phi(a|s)$ as giving an estimator for the probability that $a$ is the best choice — that $a = \pi^*(s)$.  To make this reinterpretation explicit I will change notation and write $B_\Phi(a|s)$ for the belief that $a$ is the optimal choice.  The notation $\pi_\Phi(a|s)$ is generally interpreted as specifying stochastic (erratic?) behavior.

To make the mathematics clearer I will assume that the belief $B_\Phi(a|s)$ is computed by a softmax

$B_\Phi(a|s) = \mathrm{softmax}_a \;s_\Phi(a|O_\Phi(s))$

where $O_\Phi(s)$ is a vector representation of the state $s$.  The softmax is not as important here as the idea that $O_\Phi(s)$ provides only incomplete knowledge of $s$.  Further observations on $s$ are possible.  For example, we can grow a search tree $T_\Phi(s)$ rooted as $s$.  For a given distribution on states $s$ we then get a distribution on the pairs $(O_\Phi(s),T_\Phi(s))$.  We assume that we also have some way of computing a more informed belief $B_\Phi(a\;|\;O_\Phi(s),T_\Phi(s))$. AlphaZero computes a replay buffer containing pairs $(s,B(a))$ where the stored distribution $B(a)$ is $B_\Phi(a\;|\;O_\Phi(s),\;T_\Phi(s))$.  Ideally the belief $B_\Phi(a\;|\;O_\Phi(s))$ would match the marginal beliefs over the more informed search tree results. This motivates the following were $R$ denotes the replay buffer.

(1) $\Delta \Phi \propto E_{s,B(a) \sim R,\;a\sim B(a)}\;\nabla_\Phi\;\ln B_\Phi(a\;|\;O_\Phi(s))$

This is just the gradient of a cross-entropy loss from the replay buffer to the belief $B_\Phi(a|s)$.

Some AlphaZero details. For the sake of completeness I will describe a few more details of the AlphaZero algorithm.  The algorithm computes rollouts to the end of an episode (game) where each action in the rollout is selected based on tree search. Each  rollout adds a set of pairs $(s,\;B(a))$ to the replay buffer where $s$ is a state in that rollout and $B(a)$ is a recording of $B_\Phi(a\;|\;O_\Phi(s),T_\Phi(s))$. Each rollout also has a reward $z \in \{-1,1\}$ (was the game won or lost).  AlphaZero stores a second set of replay values $(s,z)$ for each rollout state state $s$  where $z$ is the reward of that rollout. The replay pairs $(s,z)$ are used to train a value function $V_\Phi(s)$ estimating the reward that will be achieved from state $s$.  The value function $V_\Phi(s)$ acts as a kind of static board evaluator in growing the tree $T_\Phi(s)$.

Improved generality. The abstract mathematical formulation given in equation (1) is more general than tree search.  In the belief $B_\Phi(a\;|\;O_\Phi(s),\;T_\Phi(s))$ we can have that $T_\Phi(s)$ is any information about $s$ that usefully augments the information provided in $O_\Phi(s)$.

A belief gradient theorem. While I have reinterpreted policies as beliefs, and have recast AlphaZero’s algorithm in that light, I have not provided a belief gradient theorem.  The simplest such theorem is for imitation learning.  Assume an expert that labels states with actions.  Optimizing the cross-entropy loss for this labeled data yields a probability $B_\Phi(a|s)$.  Learning a probability does not imply that the agent should make a random choice. Belief is not the same as choice.

## Commonsense and Language Models

Modeling human commonsense reasoning has long been a goal of AI.  From a 1998 interview with Marvin Minsky we have

Marvin Minsky: We need common-sense knowledge – and programs that can use it. Common sense computing needs several ways of representing knowledge. It is harder to make a computer housekeeper than a computer chess-player …  There are [currently] very few people working with common sense problems.  … [one such person is] John McCarthy, at Stanford University, who was the first to formalize common sense using logics. …

The desire to understand human reasoning was the philosophical motivation for mathematical logic in the 19th and early 20th centuries. Indeed, the situation calculus of McCarthy, 1963  was a seminal point in the development of logical approaches to commonsense. But today the logicist approach to AI is generally viewed as a failure and, in spite of the recent advances with deep learning, understanding commonsense remains a daunting roadblock on the path to AGI.

Today we look to BERT and GPT and their descendants for some kind of implicit understanding of semantics.  Do these huge models, trained merely on raw text, contain some kind of implicit knowledge of commonsense?  I have always believed that truly minimizing the entropy of English text requires an understanding of whatever the text is about — that minimizing the entropy of text about everyday events requires an understanding of the events themselves (whatever that means).  Before the BERT era I generally got resistance to the idea that language modeling could do much interesting.  Indeed it remains unclear to what extend current language models actually embody semantics and to what extent semantics can actually be extracted from raw text.

In recent times the Winograd schema challenge (WSC) has been the most strenuous test of common sense reasoning.  The currently most prestigious version is the SuperGLUE-WSC.  In this version a sentence is presented with both a pronoun and a noun tagged. The task is to determine if the tagged pronoun is referring to the tagged noun. The sentences are selected with the intention of requiring commonsense understanding.  For example we have

The trophy would not fit in the suitcase because it was too big.

The task is to determine if tagged pronoun “it” refers to the tagged noun “suitcase” — here “it” actually refers to “trophy”. If we replace “big” by “small” the referent flips from “trophy” to “suitcase”. It is perhaps shocking (disturbing?) that the state of the art (SOTA) for this problem is 94% — close to human performance.  This is done by fine-tuning a huge language model on a very modest number of training sentences from WSC.

But do the language models really embody common sense?  A paper from Jackie Cheung’s group at McGill found that a fraction of the WSC problems can be answered by simple n-gram statistics.  For example consider

I’m sure that my map will show this building; it is very famous.

A language model easily determines that the phrase “my map is very famous” is less likely than “this building is very famous” so the system can guess that the referent is “building” and not “map”. But they report that this phenomenon only accounts for 13.5% of test problems.  Presumably there are other “tells” that allow correct guessing without understanding.  But hidden tells are hard to identify and we have no way of measuring what fraction of answers are generated legitimately.

A very fun, and also shocking, example of language model commonsense is the COMET  system from Yejin Choi’s group at the University of Washington. In COMET one presents the system with unstructured text describing an event and the system fills in answers to nine standard questions about events such as “before, the agent needed …”.  This can be viewed as asking for the action prerequisites in McCarthy’s situation calculus.  COMET gives an answer as unstructured text.  There is a web interface where one can play with it.  As a test I took the first sentence from the jacket cover of the novel “Machines like me”.

Charlie, drifting through life, is in love with Miranda, a brilliant student with a dark secret.

I tried “Charlie is drifting through life” and got I was shocked to see that the system suggested that Charlie had lost his job, which was true in the novel.  I recommend playing with COMET.  I find it kind of creepy.

But, as Turing suggested, the real test of understanding is dialogue. This brings me to the chatbot Meena. This is a chatbot derived from a very large language model — 2.6 billion parameters trained on 341 gigabytes of text.  The authors devised various human evaluations and settled on an average of a human-judged specificity score and a human-judged sensibility score — Specificity-Sensibility-Average or SSA.  They show that they perform significantly better than previous chatbots by this metric.  Perhaps most interestingly, they give the following figure relating the human-judged SSA performance to the perplexity achieved by the underlying language model.

They suggest that a much more coherent chatbot could be achieved if the perplexity could be reduced further.  This supports the belief that minimal perplexity requires understanding.

But in spite of the success of difficult-to-interpret language models, I still believe that interpretable entities and relations are important for commonsense.  It also seems likely that we will find ways to greatly reduce the data requirements of our language models.  A good place to look for such improvements is to somehow improve our understanding  of the relationship, assuming there is one, between language modeling and interpretable entities and relations.

## A Consistency Theorem for BERT

Last week I saw Noah Smith at a meeting in Japan where he mentioned that BERT seemed related to pseudo-likelihood.  After some thought I convinced myself that this observation should lead to a consistency theorem for BERT.  With an appropriate Google query I found 1902.04904 by Wang and Cho which also points out the connection between BERT and pseudo-likelihood. Wang and Cho view BERT as a Markov random field (MRF) and use Gibbs sampling to sample sentences. But Wang and Cho do not mention consistency.   This post explicitly observes that BERT is consistent as a language model — as a probability distribution over full sentences.

The classical proof of consistency for pseudo-likelihood assumes that the actual population distribution is defined by some setting of the MRF weights.  For BERT I will replace this assumption with the assumption that the deep model is capable of exactly modeling the various conditional distributions.  Because deep models are intuitively much more expressive than linear MRFs over hand-designed features,  this deep expressivity assumption seems much weaker than the classical assumption.

In addition to assuming universal expressivity,  I will assume that training finds a global optimum.  Assumptions of complete optimization currently underly much of our intuitive understanding of deep learning.  Consider the GAN consistency theorem. This theorem assumes both universal expressivity and complete optimization of both the generator and the discriminator.  While these assumptions seem outrageous, the GAN consistency theorem is the source of the design of the GAN architecture.  The value of such outrageous assumptions in architecture design should not be under-estimated.

For training BERT we assume a population distribution over blocks (or sentences) of $k$ words $\vec{y}=y_1,\;\ldots,y_k$.  I will assume that BERT is trained by blanking a single word in each block.  This single-blank assumption is needed for the proof but seems unlikely to matter in practice.  Also, I believe that the proof can be modified to handle XLNet which predicts a single held-out subsequence per block rather than multiple independently modeled blanks.

Let $\Phi$ be the BERT parameters and let $Q_\Phi(y_i|\vec{y}/i)$ be the distribution on words that BERT assigns to the $i$th word when the $i$th word is blanked.  The training objective for BERT is

$\begin{array}{rcl} \Phi^* & = & \mathrm{argmin}_\Phi\;\;E_{\vec{y} \sim \mathrm{Pop},\;i \sim 1,\ldots k}\;\;-\ln\;Q_\Phi(y_i|\vec{y}/i) \\ \\ & = & \mathrm{argmin}_\Phi \;\frac{1}{k} \sum_{i=1}^k\;E_{\vec{y}\sim\mathrm{Pop}}\;-\ln\;Q_\Phi(y_i|\vec{y}/i) \\ \\ & = &\mathrm{argmin}_\Phi\;\sum_{i=1}^k \;H(\mathrm{Pop}(y_i),Q_\Phi(y_i)\;|\;\vec{y}/i) \end{array}$

where $H(P(y),Q(y)\;|\;x)$ denotes cross-entropy conditioned on $x$.  Each cross entropy term is individually minimized when $Q_\Phi(y_i|\vec{y}/i) = \mathrm{Pop}(y_i|\vec{y}/i)$.  Our universality assumption is that there exists a $\Phi$ satisfying all of these conditional distributions simultaneously.  Under this assumption we have

$Q_{\Phi^*}(y_i|\vec{y}/i) = \mathrm{Pop}(y_i|\vec{y}/i)$

for all $i$ and $\vec{y}$.

I must now define the language model (full sentence distribution) defined by $Q_{\Phi^*}$.  For this I consider Gibbs sampling — the stochastic process defined on $\vec{y}$ by randomly selecting $i$ and replacing $y_i$ by a sample from $Q_\Phi(y_i|\vec{y}/i)$.  The language model is now defined to be the stationary distribution of this Gibbs sampling process. But this Gibbs process is the same process as Gibbs sampling using the population conditionals.  Therefore the stationary distribution must be $\mathrm{Pop}$.  Q.E.D.

## The Inevitability of Vector Quantization in Deep Architectures

I just read  VQ-VAE-2  (Vector-Quantized – Variational AutoEncoders – 2) by Razavi et al. This paper gives very nice results on modeling image distributions with vector quantization. It dominates BigGANs under classification accuracy score (CAS)  for class-conditional ImgeNet image generation.   For reasons listed below I believe that vector quantization is inevitable in deep architectures.  This paper convinces me that the time of VQ-VAEs has arrived.

Hinton’s grand vision of AI has always been that there are simple general principles of learning, analogous to the Navier-Stokes equations of fluid flow, from which complex general intelligence emerges.  I think Hinton under-estimates the complexity required for a general learning mechanism, but I agree that we are searching for some general (i.e., minimal-bias) architecture.  For the following reasons I believe that vector quantization is an inevitable component of the architecture we seek.

A better learning bias. Do the objects of reality fall into categories? If so, shouldn’t a learning architecture be designed to categorize?  A standard theory of language learning is that the child learns to recognize certain things, like mommy and doggies, and then later attaches these learned categories to the words of language.  It seems natural to assume that categorization precedes language in both development and evolution. The objects of reality do fall into categories and every animal must identify potential mates, edible objects, and dangerous predators.

It is not clear that the vector quanta used in VQ-VAE-2 correspond to meaningful categories.  It is true, however, that the only meaningful distribution models of ImageNet images are class-conditional.  A VQ-VAE with a vector quanta for the image as a whole at least has the potential to allow class-conditioning to emerge from the data.

Interpretability.  Vector quantization shifts the interpretability question from that of interpreting linear threshold units to that of interpreting emergent symbols — the embedded tokens that are the emergent vector quanta.  A VQ-VAE with a vector-quantized full image representation would cry out for a class interpretation for the emergent image symbol.

A fundamental issue is whether the vectors being quantized actually fall into natural discrete clusters.  This form of interpretation is often done with t-SNE.  But if vectors naturally fall into clusters then it seems that our models should seek and utilize that clustering.  Interpretation can then focus on the meaning of the emergent symbols.

The rate-distortion training objective.  VQ-VAEs support rate-distortion training as discussed in my previous blog post on rate-distortion metrics for GANs.  I have always been skeptical of GANS because of the lack of meaningful performance metrics for generative models lacking an encoder.  While the new CAS metric does seem more meaningful than previous metrics, I still feel that training on cross-entropy loss (negative log likelihood) should ultimately be more effective than adversarial training. Rate-distortion metrics assume a discrete compressed representation defining a “rate” (a size of a compressed image file) and some measure of distortion between the original image and its reconstruction from the compressed file.  The rate is measured by negative log-likelihood (a kind of cross-entropy loss).   Rate-distortion training is also different from differential cross-entropy training as used in Flow networks (invertible generative networks such as GLOW). Differential cross-entropy can be unboundedly negative. To avoid minimizing an unboundedly negative quantity, when training on differential cross-entropy one must introduce a scale parameter for the real numbers in the output (the parameter “a” under equation (2) in the GLOW paper).  This scale parameter effectively models the output numbers as discrete integers such as the bytes in the color channels of an image.  The unboundedly negative differential cross-entropy then becomes a non-negative discrete cross-entropy.  However, this treatment of differential cross-entropy still fails to support a rate-distortion tradeoff parameter.

Unifying vision and language architectures.  A fundamental difference between language models and image models involves word vectors. Language models use embedded symbols where the symbols are manifest in the data.  The vast majority of image models do not involve embedded symbols. Vector quantization is a way for embedded symbols to emerge from continuous signal data.

Preserving parameter values under retraining.  When we learn to ski we do not forget how to ride a bicycle.  However, when a deep model is trained on a first task (riding a bicycle) and then on a second task (skiing), optimizing the parameters for the second task degrades the performance on the first.   But note that when a language model is training on a new topic, the word embeddings of words not used in new topic will not change.  Similarly a model based on vector quanta will not change the vectors for the bicycle control symbols not invoked when training on skiing.

Improved transfer learning. Transfer learning and few-shot learning (meta-learning) may be better supported with embedded symbols for the same reason that embedded symbols reduce forgetting — embedded symbols can be class or task specific.  The adaptation to a new domain can be restricted to the symbols that arise (under some non-parametric nearest neighbor scheme) in the new domain, class or task.

Emergent symbolic representations. The historical shift from symbolic logic-based representations to distributed vector representations is typically viewed as one of the cornerstones of the deep learning revolution.  The dramatic success of distributed (vector) representations in a wide variety of applications cannot be disputed.  But it also seems true that mathematics is essential to science.  I personally believe that logical symbolic reasoning is necessary for AGI.  Vector quantization seems to be a minimal-bias way for symbols to enter into deep models.

Posted in Uncategorized | 1 Comment

## Thoughts from TTIC31230: Rate-Distortion Metrics for GANs.

Throughout my career I have believed that progress in AI arises from objective evaluation metrics.  When I first learned of GANs I was immediately skeptical because of the apparent lack of meaningful metrics.  While GANs have made enormous progress in generating realistic images, the problem of metrics remains (see Borji 2018).  I suspect that the lack of meaningful metrics is related to a failure of GANs to play an important role in unsupervised pre-training (feature learning) for discriminative applications.  This is in stark contrast to language models (ELMO, BERT and GPT-2) where simple cross-entropy loss has proved extremely effective for pre-training.

Cross-entropy loss is typically disregarded for GANs in spite of the fact that it is the de-facto metric for modeling distributions and in spite of its success in pre-training for NLP tasks. In this post I argue that rate-distortion metrics — a close relative of cross entropy loss — should be a major component of GAN evaluation (in addition to discrimination loss).  Furthermore, evaluating GANs by rate-distortion metrics leads to a conceptual unification of GANs, VAEs and signal compression.  This unification is already emerging from image compression applications of GANS such as in the work of Agustsson et al 2018. The following figure by Julien Despois can be interpreted in terms of VAEs, signal compression, or GANs.

$y \hspace{13em} z = z_\Psi(y) + \epsilon \hspace{10em}||y - \hat{y}_\Theta(z)||^2$

$p_\Phi(z)$

The VAE interpretation is defined by

$\Phi^*, \Psi^*,\Theta^* = \mathrm{argmin}_{\Phi,\Psi,\Theta}\;E_{y\sim \mathrm{Pop}}\;\left(\;KL(p_\Psi(z|y),p_\Phi(z)) + E_{z \sim P_\Psi(z|y)}\; -\ln p_\Theta(y|z)\right)\;\;\;\;\;\;(1)$

Now define $\Psi^*(\Phi,\Theta)$ as the minimum of (1) over $\Psi$ while holding $\Phi$ and $\Theta$  fixed. Using this to express the objective as a function of $\Phi$ and $\Theta$ , and assume universal expressiveness of $p_\Psi(z|y)$, the standard ELBO analysis shows that (1) reduces to minimizing cross-entropy loss of $p_{\Phi,\Theta}(y) =\int dz\;\;p_\Phi(z)p_\Theta(y|z)$.

$\Phi^*,\Theta^* = \mathrm{argmin}_{\Phi,\Theta}\; E_{y \sim \mathrm{Pop}}\;-\ln p_{\Phi,\Theta}(y)\;\;\;\;\;\;(2)$

It should be noted, however, that differential entropies and cross-entropies suffer from the following conceptual difficulties.

1. The numerical value of entropy and cross entropy depends on an arbitrary choice of units. For a distribution on lengths, probability per inch is numerically very different from probability per mile.
2. Shannon’s source coding theorem fails for continuous densities — it takes an infinite number of bits to specify a single real number.
3. The data processing inequality fails for differential entropy — $x/2$ has a different differential entropy than $x$.
4. Differential entropies can be negative.

For continuous data we can replace the differential cross-entropy objective with a more conceptually meaningful rate-distortion objective. Independent of conceptual objections to differential entropy, a rate-distortion objective allows for greater control of the model through a rate-distortion tradeoff parameter $\beta$ as is done in $\beta$-VAEs (Higgens et al. 2017Alemi et al 2017).  A special case of a $\beta$-VAE  is defined by

$\Phi^*, \Psi^*,\Theta^* = \mathrm{argmin}_{\Phi,\Psi}\;E_{y\sim \mathrm{Pop}}\;\left(\beta \;KL(p_\Psi(z|y),p_\Phi(z)) + E_{z \sim P_\Psi(z|y)}\;||y - \hat{y}_\Theta(z)||^2\right)\;\;\;\;(3)$

The VAE optimization (1) can be transformed into the rate-distortion equation (3) by taking

$p_\Theta(y|z) = \frac{1}{Z(\sigma)} \exp\left(\frac{-||y-z||^2}{2\sigma^2}\right)$

and taking $\sigma$ to be a fixed constant.   In this case (1) transforms into (3) with $\beta = 2\sigma^2$.  Distortion measures such as L1 and L2 preserve the units of the signal and are more conceptually meaningful than differential cross-entropy.  But see the comments below on other obvious issues with L1 and L2 distortion measures. KL-divergence is defined in terms of a ratio of probability densities and, unlike differential entropy, is conceptually well-formed.

Equation (3) leads to the signal compression interpretation of the figure above. It turns out that the KL term in (3) can be interpreted as a compression rate.  Let $\Phi^*(\Psi,\Theta)$ be the optimum $\Phi$ in (3) for a fixed value of $\Psi$ and $\Theta$. Assuming universality of $p_\Phi(z)$, the resulting optimization of $\Psi$ and $\Theta$ becomes the following where $p_\Psi(z) = \int dy \;\mathrm{Pop}(y)p_\Psi(z|y)$

$\Psi^*,\Theta^* = \mathrm{argmin}_{\Psi,\Theta}\;\beta E_y\; KL(p_\Psi(z|y), \;p_\Psi(z)) + E_y\;||y - \hat{y}_\Theta(z)||^2\;\;\;\;\;\;(4)$

The KL term can now be written as a mutual information between $y$ and $z$.

$E_{y \sim \mathrm{Pop}} KL(p_\Psi(z|y), \;p_\Psi(z)) = H(z) - H(z|y) = I(y,z)$

Hence (4) can be rewritten as

$\Psi^*,\Theta^* = \mathrm{argmin}_{\Psi,\Theta}\;\beta I(z,y) + E_{y,z}\;||y - \hat{y}_\Theta(z)||^2\;\;\;\;\;\;(5)$

A more explicit derivation can be found in slides 17 through 21 in my lecture slides on rate-distortion auto-encoders.

By Shannon’s channel capacity theorem, the mutual information $I(y,z)$ is the number of bits transmitted through a noisy channel from $y$ to $z$ — it is the number of bits from $y$ than reach the decoder $\hat{y}_\Theta(z)$.  In the figure $P_\Psi(z|y)$ is defined by the equation $z = z_\Psi(y) + \epsilon$ for some fixed noise distribution on $\epsilon$.  Adding noise can be viewed as limiting precision. For standard data compression, where $z$ must be a compressed file with a definite number of bits, the equation $z = z_\Psi(y) + \epsilon$ can be interpreted as a rounding operation that rounds $z_\Psi(y)$ to integer coordinates.  See Agustsson et al 2018.

We have now unified VAEs with data-compression rate-distortion models.  To unify these with GANs we can take $\Phi$ and $\Theta$ to be the generator of a GAN.  We can train the GAN generator $(\Phi,\Theta)$ in the traditional way using only adversarial discrimination loss and then measure a rate-distortion metric by training $\Psi$ to minimize (3) while holding $\Phi$ and $\Theta$ fixed.  Alternatively, we can add a discrimination loss to (3) based on the discrimination between $y$ and $\hat{y}_\Theta(z)$ and train all the parameters together. It seems intuitively clear that a low rate-distortion value on test data indicates an absence of mode collapse — it indicates that the model can efficiently represent novel images drawn from the population.  Ideally, the rate-distortion metric should not increase much as we add weight to a discrimination loss.

A standard objection to L1 or L2 distortion measures is that they do not represent “perceptual distortion” — the degree of difference between two images as perceived by a human observer.  One interpretation of perceptual distortion is that two images are perceptually similar if the are both “natural” and carry “the same information”.  In defining what we mean by the same information we might invoke predictive coding or the information bottleneck method. The basic idea is to find an image representation that achieves compression while preserving mutual information with other (perhaps future) images.  This can be viewed as an information theoretic separation of “signal” from “noise”.  When we define the information in an image we should be disregarding noise.  So while it is nice to have a unification of GANs, VAEs and signal compression, it would seem better to have a theoretical framework providing a distinction between signal and noise. Ultimately we would like a rate-utility metric for perceptual representations.

## Thoughts from TTIC31230: Rethinking Generalization.

Instigated by Zhang et al 2016, there has been considerable consternation in the deep learning community about why models capable of fitting random labels generalize well. It should be immediately noted that simple linear regression with more features than data points will also generalize well when appropriately regularized (ridge regression).  What is different about deep models is that the regularization seems to be implicit in SGD as described earlier by Neyshebur et al. 2014. Recent theoretical work on the regularization implicit in SGD, such as Arora et al. 2019, focuses on quenching shallow networks. In my previous post I argued that annealing seems more powerful than quenching. In this post I will give a PAC-Bayesian bound (a formal generalization guarantee) appropriate for annealed SGD.  I’m calling it the residual entropy bound.

First some comments on PAC-Bayesian theory. I coined the term “PAC-Bayes” in the late 90’s to describe a class of theorems giving PAC generalization guarantees in terms of arbitrarily chosen prior distributions.  Some such theorems (Occam bounds) pre-date my work.  Over last twenty years there has been significant refinement of these bounds by various authors. A concise general presentation of PAC-Bayesian theory can be found in my PAC-Bayes tutorial.

After about 2005, PAC-Bayesian analysis largely fell out of usage in the learning theory community in favor of more “sophisticated” concepts.  However, PAC-Bayesian bounds are now having a resurgence — their conceptual simplicity is paying off in the analysis of deep networks.  Attempts to apply VC dimension or Rademacher complexity to deep networks yield extremely vacuous guarantees — guarantees on binary error rates tens of orders of magnitude larger than 1.  PAC-Bayesian theorems, on the other hand, can produce non-vacuous guarantees — guarantees less than 1. Non-vacuous PAC-Bayesian guarantees for deep networks were first computed for MNIST by Dziugaite et al. and recently for ImageNet by Zhou et al.

In annealed SGD the learning rate acts as a temperature parameter which is set high initially and then cooled.  As we cool the temperature we can think of the model parameters as a glass that is cooled to some finite temperature at which it becomes solid — becomes committed to a particular local basin of the energy landscape. For a given parameter initialization $\Phi_0$, annealed SGD defines a probability density $p(\Phi|\;S,\Phi_0)$ over the final model $\Phi$ given initialization $\Phi_0$ and training data $S$.  Under Langevin dynamics we have that $p(\Phi\;|\;S,\Phi_0)$ is a smooth density.  If we are concerned that Langevin dynamics is only an approximation, we can add some small Gaussian noise to SGD to ensure that $p(\Phi\;|\;S,\Phi_0)$ is smooth.

The entropy of the distribution $p(\Phi\;|\;S,\Phi_0)$ is the residual entropy of the parameter vector. Note that if there was one final global optimum which was always found by SGD (quartz crystal) then the residual entropy would be zero. The distribution $p(\Phi\;|S,\Phi_0)$, and hence the residual entropy, includes all the possible local basins (all the possible solid structures of the glass).

To state the residual entropy bound we also need to define a parameter distribution in terms of the population independent of any particular sample. Let $N$ be the number of data points in the sample. Let $p(\Phi\;|\;\Phi_0,N)$ be the distribution defined by first drawing an IID sample $S$ of  size $N$ and then sampling $\Phi$ from $p(\Phi\;|\;S,\Phi_0)$.  The entropy of the distribution $p(\Phi\;|\;\Phi_0,N)$ is the residual entropy of annealing as defined by $\Phi_0$ and $N$ independent of the draw of any particular sample.  The residual entropy bound is governed by $E_S\;\mathrm{KL}(p(\Phi\;|\;S,\Phi_0),p(\Phi\;|\;\Phi_0,N))$.  This is an expected KL-divergence between two residual entropy distributions.  It is also equal to the mutual information between $\Phi$ and $S$ under the distribution $p(\Phi,S\;|\;\Phi_0,N)$.

To formally state the bound we assume that for a data point $z$ we have a loss ${\cal L}(\Phi,z) \in [0,L_{\mathrm{max}}]$.  I will write ${\cal L}(\Phi)$ for expectation of ${\cal L}(\Phi,z)$ when $z$ is drawn from the population, and write $\hat{\cal L}(\Phi)$ for the average of ${\cal L}(\Phi,z)$ over $z$ drawn from the training sample. The following residual entropy bound is a corollary of theorem 4 in my PAC-Bayes tutorial.

$\forall\Phi_0,N\;\;\;E_{S,\;\Phi \sim p(\Phi\;|\;S,\Phi_0)} \;{\cal L}(\Phi) \leq E_{S,\;\Phi\sim p(\Phi\;|\;S,\Phi_0)}\;\frac{10}{9}\left(\hat{\cal L}(\Phi) + \frac{5L_{\mathrm{max}}}{N}\mathrm{KL}(p(\Phi\;|\;S,\Phi_0),p(\Phi\;|\;\Phi_0,N))\right)\;\;\;\;(1)$

There are good reasons to believe that (1) is extremely tight. This is a PAC-Bayesian bound where the “prior” is $p(\Phi\;|\;\Phi_0,N)$.   PAC-Bayesian bounds yield non-vacuous values under Gaussian priors.  It can be shown that $p(\Phi\;|\;\Phi_0,N)$ is the optimal prior for posteriors of the form $p(\Phi\;|\;S,\Phi_0)$.

Unfortunately  analysis of (1) seems intractable. But difficulty of theoretical analysis does not imply that a conceptual framework is wrong.  The clear case that (1) is extremely tight would seem to cast doubt on analyses selected for tractability at the expense of realism.

Posted in Uncategorized | 4 Comments