## Superintelligence and The Truth

A superintelligence would presumably be capable of judging truth.  But what are the human consequences of superintelligent truth judgements? Should a friendly AI avoid taking sides in religion and politics?  Should a friendly AI avoid hiding the truth?  Are these two objectives compatible?

I have talked in previous posts about the “servant mission” — the AI mission of serving or advocating for a particular human.  In a email to Eric Horovitz I later suggested that such agents be called “advobots”.  An advobot  advocates in a variety of professional and personal ways for a particular individual human — its “client”.  I find the idea of a superintelligent personal advocate quite appealing.  If every AI is an advobot for a particular client, and every person has an advobot, then power is distributed and the AI dangers seem greatly reduced.  But there are subtleties here involving the relationship between advocacy and truth.

I will call an advobot strongly truthful if it judges truth independent of  its advocacy.  Strongly truthful advobots face some dilemmas:

Religion. A strongly truthful advobot will likely disagree with its client on various religious beliefs.  For example evolution or the authorship of the Bible.

Politics. A strongly truthful advobot will likely disagree with its client concerning politically charged statements.  Does immigration lead to increased crime? What will the level of sea rise due to CO2 emissions be over the next 50 years?   Does socialized health care lead to longer life expectancy than privatized health care?

Competence.  A strongly truthful advobot would almost certainly disagree with its client over the client’s level of competence. Should an advobot be required to be strongly truthful when speaking to an employer about the appropriateness of their client for a particular job opening?

These examples demonstrate the strong interaction between truth and advocacy.  Much of human speaking or writing involves advocacy.  Freshman are taught rhetoric — the art of writing persuasively.  Advocating for a cause inevitably involves advocating for a belief.

It just does not seem workable to require advobots to be strongly truthful. But if advobot statements must be biased,  it might be nice to have some other AI agents as a source of unbiased judgements.  We could call these “judgebots”.  A judgebot’s mission is simply to judge truth as accurately as possible independent of any external mission or advocacy.  I do believe that the truth, or degree of truth, or truth under different interpretations, can be judged objectively.  This is certainly true of the statements of mathematics. Presumably this is true of most scientific hypotheses.  I think that it is also true of many of the statements argued over in politics and religion.  Of course judgebots need not have any legal authority — defendants could still be tried by a jury of human peers or have cases decided by a human judge.  But the judgements of superintelligent judgebots would still presumably influence people.

In addition to judging truth, judgebots could directly judge decisions relative to missions. Consider a corporation with a mission statement and a choice — say opening a plant in either city A or city B or hiring either A or B as the new CEO.  We could ask a judgebot which of A or B is most faithful to the mission — which choice is best for the corporation as judged by its stated mission.  This kind of judgement is admittedly difficult.  But choices have to be made in any case.  Who is better able to judge choices than a superintelligent judgebot?  A human corporate CEO or board of directors could retain legal control of the corporation. The board could also control and change the mission statement, or refuse to publish a mission at all. But the judgements of superintelligent judgebots relative to various mission statements (published or not) would be available to the public and the stockholders.  Judgebots would likely have influence precisely because they themselves have no agenda other than the truth.

It is possible that different judgebots would have access to different data and computational resources.  Advobots would undoubtedly try to influence the judgebots by controlling the data and computation resources.  But it would also be possible to require that the resources underlying every judgebot judgement be public information.  A judgebot with all available data and large computational resources would intuitively seem most reliable — a good judge listens to all sides and thinks hard.

But as Pilot said to Jesus, what is truth?  What, at an operational level, is the mission of a judgebot?  That is a hard one.  But if we are going to build a superintelligence I believe we will need an answer.

## Predictive Coding and Mutual Information

Many people believe that unsupervised and predictive learning are currently the most important problems in deep learning.  From an AGI perspective we have to find some way of getting the machines to learn NLP semantics (and knowledge representation and reasoning) from unlabeled data. We do not know how to label semantics.

This post is based on two papers, my own note from February, Information-Theoretic Co-Training, and a paper from July,  Representation Learning with Contrastive Predictive Coding by Aaron van den Oord, Yazhe Li and Oriol Vinyals.  These two papers both focus on mutual information for predictive coding.   van den Oord et al. give empirical results which include what appears to be the best ImageNet pre-training results to date by a large margin.  Speech and NLP results are also reported. My paper gives a different model and is less developed.  Each approach has pros and cons.

Why Information Theory?

For me, the fundamental equation of deep learning is

$\Phi^* = \mathrm{argmin}_\Phi \;E_{(x,y)\sim{\cal P}}\; -\log P_\Phi(y|x)$

This is the classical log-loss where ${\cal P}$ is a population distribution and $P_\Phi(y|x)$ is the conditional probability of a model with parameters $\Phi$.  Here we think of $y$ as a label for input $x$.  For example, $x$ can be an image and $y$ some kind of image label. These days log loss goes by the name of cross-entropy loss.

$E_{(x,y)\sim {\cal P}} \;- \log P_\Phi(y|x) = E_{x \sim {\cal P}} \;H({\cal P}(y|x),\;P_\Phi(y|x))$

$H(P,Q) \equiv E_{y \sim P} -\log Q(y)$

$= H(P) + KL(P,Q)$

Here $H(P,Q)$ is the cross-entropy from $P$ to $Q$. If information theoretic training objectives are empirically effective in supervised learning, they should also be effective in unsupervised learning.

Why Not Direct Cross-Entropy Density Estimation?

An obvious approach to unsupervised learning is density estimation by direct cross-entropy minimization.

$\Phi^* = \mathrm{argmin}_\Phi \;E_{x \sim {\cal P}}\; -\log P_{\Phi}(x)$

$= \mathrm{argmin}_\Phi\;H({\cal P},P_\Phi)$

This is the perplexity objective of language models.  Language modeling remains extremely relevant to NLP applications.  The hidden states of LSTM language models yield “contextual word vectors” (see ELMO).  Contextual word vectors seem to be  replacing more conventional word vectors in a variety of NLP applications including leaders on the SQUAD question answering leader board.  Contextual word embeddings based on the hidden states of a transformer network language model seem to be an improvement on ELMO and lend further support to the transformer network architecture. (Deep learning is advancing at an incredible rate.)

The success of language modeling for pre-training would seem to support direct density estimation by cross-entropy minimization.  So what is the problem?

The problem is noise.  An image contains a large quantity of information that we do not care about.  We do not care about each blade of grass or each raindrop or the least significant bit of each color value.  Under the notion of signal and noise discussed below, sentences have higher signal-to-noise ratios than do images.  But sentences still have some noise. A sentence can be worded in different ways and we typically do not remember exact phrasing or word choice. From a data-compression viewpoint we want to store only the important information.  Density estimation by direct cross-entropy minimization models noise.  It seems intuitively clear that the modeling of the signal can get lost in the modeling of the noise, especially in low signal-to-noise settings like images and audio signals.

Mutual Information: Separating Signal from Noise

The basic idea is that we define “signal” to be a function of sensory input that carries mutual information with future sensation. This is apparently a very old idea.  van den Oord et al. cite a paper on “predictive coding” by Peter Elias from 1955.  This was a time shortly after Shannon’s seminal paper when information theory research was extremely active.  van den Oord et al. give the following figure to visualize their formulation of predictive coding.

Here $x_t$ is some kind of perceptual input like a frame of video, or a frame of speech, or a sentence in a document, or an article in a thread of related articles.  Here $c_t$ is intended to be  a high level semantic representation of the past perception and $z_{t+k}$ is intended to be high level semantic representation of a future perception.   My paper and van den Oord et al. both define training objectives centered on maximizing the mutual information

$I(c_t,z_{t+k}|k) = H(z_{t+k}) - H(z_{t+k}|c_t,k)\;\;\;\;\;\;\mathrm{(1)}$

Intuitively, we want to establish predictive power — demonstrate that observations at the present time provide information (predictive power) about the future and hence reduces the uncertainty (entropy) of the future observations.  The difference between the entropy before the observation and the entropy after the observation is one way of defining mutual information.  If we think of $I(x,y)$ as the amount of “signal” in $x$ about $y$ then we might define a purely information theoretic signal-to-noise ratio as $I(x,y)/H(x)$.  But I am on shaky ground here — there should be a reference for this.

There is an immediate problem with the mutual information training objective (1). The objective $I(c_t,z_{t+k})$ can be maximized simply by taking $c_t$ to be the input sequence $x_0,\;\ldots,\;x_t$ and taking $z_{t+k}$ to be the input $x_{t+k}$.  So we need additional constraints that force $c_t$ and $z_t$ to be higher level semantic representations. The two papers take different approaches to this problem.

My paper suggests bounding the entropy of  $z_t$ by restricting it to a finite set, such as a finite length sequence of tokens from a finite vocabulary.  ven den Oord et al. assume that both $c_t$ and $z_{t+k}$ are vectors and require that a certain discriminative task be solved by a linear classifier.  Requiring that $z_t$ be used in linear classifiers should force it to expose semantic properties.

A Naive Approach

My paper takes a naive approach.  I work directly with equation (1).  The model includes a network defining a coding probability $P_\Psi(z_t|x_t)$ where $z_t$ is restricted to a (possibly combinatorially large) finite set.  Equation (1) can then be written as follows where the probability distribution on the sequence $z_1,\ldots,z_{t+k}$ is determined by the coding distribution parameter $\Psi$.

$I((z_1,\ldots,z_t),\;z_{t+k}) = H_\Psi(z) - H_{\Psi}(z_{t+k}\;|\;z_1,\ldots,z_t)$

We now introduce two more models used to estimate the above entropies.  A model $P_\Theta(z)$ for estimating the first entropy  and and a model $P_\Phi(z_{t+k}\;|\;(z_1,\ldots,z_t))$ for the second.  We then define a nested optimization problem

$\Psi^* = \mathrm{argmax}_\Psi\;\hat{H}_{\Psi,\Theta^*(\Psi)} - \hat{H}_{\Psi,\Phi^*(\Psi)}$

$\Theta^*(\Psi) = \mathrm{argmin}_\Theta\; \hat{H}_{\Psi,\Theta}$

$\Phi^*(\Psi) = \mathrm{argmin}_\Phi\; \hat{H}_{\Psi,\Phi}$

$\hat{H}_{\Psi,\Theta} = E_{z\sim P_\Psi(z|x)}\;-\log P_\Theta(z)$

$\hat{H}_{\Psi,\Phi} = E_{(z_1,\ldots,z_{t+k}) \sim P_\Psi} \;-\log P_\Phi(z_{t+k}\;|\;z_1,\ldots z_t)$

Holding $\Psi$ fixed, the optimization objectives for $\Theta^*(\Psi)$ and $\Phi^*(\Psi)$ are just classical cross-entropy loss functions (log loss).  If $\Theta$ and $\Phi$ have been fully optimized for $\Psi$ fixed, then the gradients with respect to $\Theta$ and $\Phi$ must be zero and $\Psi$ can be updated ignoring the effect of that update on $\Theta$ and $\Phi$.  If the models are universally expressive, and the nested optimization can be done exactly, then we maximize the desired mutual information.

This naive approach has some issues.  First, the optimization problem is adversarial.  Note that the optimizations of $\Psi$ and $\Theta$ are pulling $\hat{H}_{\Psi,\Theta}$ in opposite directions. This raises all the issues in adversarial optimization.  Second, $P_\Psi(z|x)$ defines a discrete distribution raising the specter of REINFORCE (although see VQ-VAE for a more sane treatment of discrete latent variables).   Third, the model entropy estimates do not provide any guarantee on the true mutual information.  This is because model estimates of entropy are larger than the true entropy and the model estimate of the first entropy can be large due to modeling error.

The Discriminator Trick

The van den Oord et al. paper is based on a “discriminator trick” — a novel form of contrastive estimation (see also this) where a discriminator must distinguish in-context items from out-of-context items.  The formal relationship between contrastive estimation and mutual information appears to be novel in this paper. Suppose that we want to measure the mutual information $I(x,y)$ between two arbitrary random variables $x$ and $y$.  We draw $N$ pairs $(x_1,y_1),\ldots,\;(x_N,y_N)$ from the joint distribution.  We then randomly shuffle the $y$‘s and present a “discriminator” with $x_1$ and the shuffled $y$ values.  The discriminator’s job is to find $y_1$ (the value paired with $x_1$) among the shuffled values of $y$.  ven den Oord et al. show the following lower bound on $I(x,y)$ where $Y$ is the shuffled set of $y$‘s; the index $i$ is the position of $y_1$ in $Y$; and $\Phi$ is the model parameters of the discriminator.

$I(x,y) \geq \log N -L\;\;\;\;\;\mathrm{(2)}$

$L = E_{x_1,Y\sim{\cal P}}\;-\log P_\Phi(i\;|\;x_1,Y)$

I refer readers to the paper for the derivation. The discriminator is simply trained to minimize the above log loss (cross-entropy loss).  In the paper the discriminator is forced to be linear in $z$ which should force $z$ to be high level.

$P_\Phi(i\;|\;c_t,(z_1,\ldots,z_N)) = \mathrm{softmax}_i \;\; (c_t)^\top M_k z_i$

Here the networks producing $c_t$ and $z_t$ are part of the discriminator so training the discriminator trains the network for $z_t$.  The vector $z_t$ is then used as a pre-trained feature vector for the input $x_t$.

The main issue I see with the discriminator trick is that the discrimination task might be too easy when the mutual information is large.  If the discrimination task is easy the loss will be close to zero.  But by equation (2), the best possible guarantee on mutual information is $\log N$. This requires $N$ to be exponential in the size of the guarantee.

Summary

Predictive coding by maximizing mutual information feels like a very principled and promising approach to unsupervised and predictive learning.  A naive approach to maximizing mutual information involves adversarial objectives, discrete latent variables, and a lack of any formal guarantee. The discriminator trick avoids these issues but seems problematic in the case where the true mutual information is larger than a few bits.

Deep learning is progressing at an unprecedented rate.  I expect the next few years to yield considerable progress, one way or another, in unsupervised and predictive learning.

## RL and the Game of Mathematics

I recently gave a lecture on Alphago in a deep learning class.  I had of course heard of the recent accomplishments — learning in a matter of days, or even hours, and entirely from self-play, to become superhuman in go, chess and shogi.  I had not, however, appreciated the magnitude of this advance until I read the papers (Alphgo Zero and AlphaZero). Alphago Zero is about 1400 ELO points above Alphago Lee — the program that defeated Lee Sedol. The ELO scale is the one used to rate chess players — 200 ELO points corresponds to 75% chance of victory by the stronger player.  Measuring a huge gap like 1400 points requires using a sequence of versions of the program of increasing strength.  Also, Alphago Zero runs on a single machine with four GPUs.  Alphago Lee runs on a supercomputer.  Also, the algorithm underlying Alphago Lee is new and simpler with no rollouts or hand designed features.  Also, the same algorithm does the same trick in chess and shogi — quickly learning to be far superior to any previous program for these games.  Also, the introduction of a fundamentally new algorithm seems likely to open up yet more progress in RL. The new algorithm can be viewed as tree-search bootstrapping  applied to policy gradient.

Ten years ago Jonathan Schaeffer at the University of Alberta proved that checkers is a draw.  His group computed and stored full drawing strategies for each player.  This was listed by Science magazine as one of the ten greatest breakthroughs of 2007. The game of chess is also believed to be a draw, although computing full drawing strategies is completely infeasible.   Empirically, as the level of play increases draws become more common.  Computers have been super-human in chess for the last 20 years.  The prevalence of draws at super-human levels causes the ELO scale to break down. (In practice go games are never drawn.) The strongest computer program, stockfish, was thought to be unbeatable — equivalent to perfect play.  AlphaZero defeated Stockfish 25/50 games playing from white (and lost none) and defeated Stockfish 3/50 games from black (and lost none).

DeepMind’s success in RL is based on perfect simulation.  Perfect simulation is possible in Atari games and in board games such as go and chess.  Perfect simulation allows large-scale on-policy learning.  Large-scale on-policy learning is not feasible in the “game” of human dialogue (the Turing test).  We don’t know how to accurately simulate human dialogue.  Mathematics, on the other hand, can be formulated as a game with rules defined by a formal foundational logic.  This immediately raises the question of whether the recent dramatic results in RL for games might soon lead to super-human open-domain mathematicians.

The foundations of mathematics has been largely ignored by AI researchers.  I have always felt that the foundations — the rules of “correct thought” — reflect an innate cognitive architecture.  But independent of such beliefs, it does seem possible to formulate inference rules defining the general notion of proof accepted by the mathematics community.  For this we have ZFC or type theory.  More on this later.  But in addition to rules defining moves, we need an objective— a reward function.  What is the objective of mathematics?

I believe that a meaningful quantitative objective for mathematics must involve an appropriate formal notion of a mathematical ontology — a set of concepts and statement made in terms of them. It is easy to list off  basic mathematical concepts — groups, rings, fields, vector spaces, Banach spaces, Hilbert spaces, manifolds, Lie algebras, and many more.  For some concepts, such as the finite simple groups or the compact two manifolds, complete classification is possible — it is possible to give a systematic enumeration of the instances of the concept up to isomorphism. Formulating an objective in terms of a concept ontology requires, of course, being clear (formal) about what we mean by a “concept”.  This brings us to the importance of type theory.

Since the 1920’s (or certainly the 1950s) the mathematics community has generally accepted Zermello-Fraenkel set theory with the axiom of choice (ZFC) as the formal foundation.  However, the ontology of mathematics is clearly organized around the apparent fact that the instances of a concept are only classified up to the notion of isomorphism associated with that concept. The classification of finite simple groups or compact two manifolds is done up to isomorphism. The fact that each concept comes with an associated notion of isomorphism has never been formalized in the framework of ZFC.  To formalize concepts and their associated notion of isomorphism we need type theory.

I have been working for years on a set-theoretic type theory — a type theory fully compatible with the ZFC foundations.  Alternatively there is constructive type theory and its extension to homotopy type theory (HoTT).  I will not discuss the pros and cons of different type theories here except to say that I expect mathematicians will be more accepting of a type theory fully compatible with the grammar of informal mathematical language as well as being compatible with the currently accepted content of mathematics.

Type theory gives a set of formal rules defining the “well-formed” concepts and statements.  I believe that a meaningful objective for mathematics can then be defined by an a-prior probability distribution over type-theoretically well-formed mathematical questions.  The well-formedness constraint ensures that each question is meaningful. We do not want to ask if 2 is a member of 3. The objective is to be able to answer the largest possible fraction of these well-formed questions.

Of course the questions under investigation in human mathematics evolve over time and this evolution is clearly governed by a social process.  But I claim that this is consistent with the existence of an objective notion of mathematical significance.  A socially determined distribution over questions can be importance-corrected (in the statistical sense) in judging objective significance. The use of a sampling distribution different from the prior, which is then importance corrected, seems likely to be useful in any case.

Of course I am not the only one to be considering deep learning for mathematical inference.  There have now been three meetings of the Conference on AI and Theorem Proving.  Many of the papers in this conference propose neural architectures for “premise selection” — the problem of selecting facts from a mathematical library that are likely to be useful for a given problem.  Christian Szegedy, a co-author of batch-normalization and a principal developer of Google’s inception network for image classification, has been attending this conference and working on deep learning for theorem proving [1,2,3].  I myself have attended the last two meetings and have found them useful in framing my own thinking on deep architectures for inference.

In summary, DeepMind’s dramatic recent success in RL raises the question of whether RL can produce a super-human player of the game of mathematics.  I hereby throw this out as a challenge to DeepMind and to the AI community generally.  A super-human open-domain mathematician would seem a major step toward AGI.

Posted in Uncategorized | 4 Comments

## The Role of Theory in Deep Learning

This blog post is inspired by the recent NIPS talk by Ali Rahimi and the response by Yann LeCun.  The issue is fundamentally the role of theory in deep learning.   I will start with some quotes from Rahimi’s talk.

Rahimi:

Machine learning has become alchemy.

Alchemy worked [for many things].

But scientists had to dismantle two thousand years worth of alchemical theories.

I would like to live in a society whose systems are built on verifiable rigorous knowledge and not on alchemy.

LeCun:

Understanding (theoretical or otherwise) is a good thing.

[However, Rahimi’s statements are dangerous because] it’s exactly this kind of attitude that lead the ML community to abandon neural nets for over 10 years, despite ample empirical evidence that they worked very well in many situations.

I fundamentally agree with Yann that a drive for rigor can mislead a field.  Perhaps most dangerous is the drive to impress colleagues with one’s mathematical sophistication rather than to genuinely seek real progress.

But I would like to add my own spin to this debate.  I will start by again quoting Rahini:

Rahini:

[When a deep network doesn’t work] I think it is gradient descent’s fault.

Gradient descent is the cornerstone of deep learning. Gradient descent is a form of local search.  Here are some other examples of local search:

The evolution of the internal combustion engine from the 1790s through the twentieth century.

The evolution of semiconductor processes over the last fifty years of Moore’s law.

Biological evolution including the evolution of the human brain.

The evolution of mathematics from the ancient Greeks to the present.

The hours of training alphago(zero) takes to become the world’s strongest chess program through self play.

Local search is indeed mysterious.  But can we really expect a rigorous theory of local search that predicts or explains the evolution of the human brain or the historical evolution of mathematic knowledge?  Can we really expect to predict by some sort of second order analysis of gradient descent what mathematical theories will emerge in the next twenty years?  My position is that local search (gradient descent) is extremely powerful and fundamentally forever beyond any fully rigorous understanding.

Computing power has reached the level where gradient descent on a strong architecture on a strong GPU can only be understood as some form of very powerful general non-convex local search similar in nature to the above examples.  Yes, the failure of a particular neural network training run is a failure of gradient descent (local search).  But that observation provides very little insight or understanding.

A related issue is one’s position on the time frame for artificial general intelligence (AGI).  Will rigor help achieve AGI?  Perhaps even Rahini would find it implausible that a rigorous treatment of AGI is possible.  A common response by rigor-seekers is that AGI is too far away to talk about.  But I find it much more exciting to think we are close. I have written a blog post on the plausibility of near-term machine sentience.

I do believe that insight into architecture is possible and that such insight can fruitfully guide design.  LSTMs appeared in 1997 because of a “theoretical insight” about a way of overcoming vanishing gradients.  The understanding of batch normalization as a method of overcoming internal covariate shift is something I do feel that I understand at an intuitive level (I would be happy to explain it).  Intuitive non-rigorous understanding is the bread and butter of theoretical physics.

Fernando Pereira (who may have been quoting someone else) told me 20 years ago about the “explorers” and the “settlers”.  The explorers see the terrain first (without rigor) and the settlers clean it up (with rigor).  Consider calculus or Fourier analysis. But in the case of local search I don’t think the theorems (the settlers) will ever arrive.

Progress in general local search (AGI) will come, in my opinion, from finding the right models of computation — the right general purpose architectures — for defining the structure of “strong” local search spaces.  I have written a previous blog post on the search for general cognitive architectures.  Win or lose, I personally am going to continue to pursue AGI.

Posted in Uncategorized | 7 Comments

## Choice as a Natural Kind Term

This is a sequel to my previous post on determinism, free will and the existence of choice.  Here I want to consider the semantics of the word “choice” from the perspective of lexical semantics generally.  I will focus on the concept of a natural kind as formulated by Kripke, Putnam and Donnellan and which I encountered long ago in the introduction to the volume naming necessity and natural kinds.

The basic idea is that a word like “gold” or “lemon” is used meaningfully without any precise definition.  There is a substance that we call gold and the existence of gold was known, and gold was correctly referred to, long before the acceptance of the atomic theory.  Similarly one can meaningfully refer to “lemons” or “dogs” without knowing all the scientific facts and criteria that might be used in attempting to give a precise definition.  A term which is used meaningfully, and for practical purposes does have a well defined referent, but where the detailed nature of the referent is unknown, is called a natural kind term.

Of course we can scientifically investigate the nature of gold or lemons.  This does not mean that the referent can be changed.  The term gold had a well defined referent in roman times and it would be scientific folly to investigate the nature of gold by changing the meaning (or referent) of the term.

Now the issue here is whether the term “choice”, like the term “gold”, has a meaning by virtue of its colloquial usage.  To quote Wittgenstein, the meaning is the use.  It would be scientific folly to study the nature of “choice” by redefining the word to mean something else.  In particular, consider the claim that a deterministic computer does not have choice.  One should recognize that when we talk about choices we are referring to something, just as the word “gold” refers to something.  I claim that statements about the nature of choice can be true or false in the same way that statements about the nature of gold can be true or false.  Furthermore, I claim that the statement that a deterministic computer cannot have choice is scientifically false in the same way that the statement that gold is a compound (like salt) is scientifically false.  Does alphago make choices?

But if “choice”, like “gold”, is a natural kind term, what are choices really?  The notion of choice seems intimately related to the computational process of decision making.  Choices are the options considered in decision making.  The notion of choice, or option, seems central to decision theory and central to the understanding of any decision making process whether or not that process is truly stochastic or is classically deterministic.  The sentence “I decided to cancel the interview” seems to have a clear (if colloquial) meaning concerning some internal cognitive decision making process.  The statement that an option existed and a decision was made seems perfectly compatible with a deterministic decision making process (as in a chess program).

Furthermore, whatever choices are, it seems that a sentence of the form “x was an option” has truth conditions — it can be true in some situations and not in others.  The truth conditions may be complex and subtle, but the truth conditions, whatever they are, seem orthogonal to the issue of whether the decision making computation is deterministic.

It seems to me that “being free to choose” cannot be distinguished from “having options”.  The deterministic chess program does have options — the set of legal moves exists and defines the options.  We humans largely understand when we “have options” which we appropriately take to be synonymous with “being free to choose”.  These words refer to something real just as “gold” refers to something real.

The truth conditions of “x was an option” seem related to the truth conditions of counterfactuals such as “if I had chosen to do x then y would have happen”.  Such statements have clear meanings in the case of formal games. Such statements presumably also have truth conditions in normal human circumstances.

Choices are, by definition, the things people refer to when they talk about choices and options.  We can investigate what these things (choices) are, but we should avoid trying to redefine the referent of this natural kind term.  The meaning is the use.

## CTC and the EG Algotithm: Discrete Latent Choices without Reinforcement Learning

Section 20.9.1 of Goodfellow, Bengio and Courville is titled “Back-Propagating through Discrete Stochastic Operations”.  It discusses the REINFORCE algorithm which requires sampling sequences of discrete latent choices.  It is well known that sampling choices yields a high variance gradient estimate.  While such sampling seems necessary for reinforcement learning applications (such as games) there are other settings where one can train discrete latent choices using a closed form gradient and avoid high variance choice sampling.  This is done in the CTC algorithm (Connectionist Temporal Classification, Graves et al. 2006).  CTC is arguably the currently dominant approach to speech recognition.  It is mentioned in Goodfellow et al. in the section on speech recognition but not in the general context of training discrete latent choices.

This post is also a sequel to the previous post on variational auto-encoders and EM.  In CTC, as in various other models, we are interested in a marginal distribution over latent variables.  In this case we have training pairs $(x, y)$ and are interested in maximizing the marginal distribution.

$\Theta^* = \mathrm{argmax}_\Theta\; E_{(x,y)}\;\ln P_\Theta(y|x)$

$= \mathrm{argmax}_\Theta \; E_{(x,y)}\;\ln\; \sum_z \;P_\Theta(y,z|x)$

In CTC $x$ is a speech signal, $y$ is a (time-compressed) phonetic transcription, and $z$ is a sequence of latent “emission decisions”.  This is described in more detail below.

The previous post noted that both VAEs and EM can be phrased in terms of a unified objective function involving an “encoder distribution” $P_\Psi(z|x,y)$.  Here we have made everything conditional on an input $x$.

$\Psi^*,\Theta^* = \mathrm{argmax}_{\Psi,\Theta}\;E_{(x,y)}\;{\cal L}(\Psi,\Theta,x,y)$

${\cal L}(\Psi,\Theta,x,y) = E_{z \sim P_\Psi(z|x,y)}\; \ln P_\Theta(z,y|x) + H(P_\Psi(z|x,y))\;\;\;(1)$

$= \ln\;P_{\Theta}(y|x) - KL(P_\Psi(z|x,y),P_{\Theta}(z|x,y))\;\;\;(2)$

The equivalence between (1) and (2) is derived in the previous post.  In EM one performs alternating optimization where (2) is used to optimize the encoder distribution $P_\Psi(z|x,y)$ while holding the model distribution $P_\Theta(y,z|x)$ fixed, and (1) is used to optimize model distribution $P_\Theta(z,y)$ while holding the encoder distribution $P_\Psi(z|x,y)$ fixed.  In EM each of these optimizations is done exactly in closed form yielding an exact alternating maximization algorithm.  In CTC the E step optimization is computed exactly in closed form but the M step is replaced by an exact closed form calculation of the gradient of (1) with respect to $\Theta$.  I will call this the EG algorithm (expected gradient).

Example: A Deep HMM Language Model: A deep language model is typically an auto-regressive RNN1.

[Andrej Karpathy]

Here the model stochastically generates a word and then uses the generated word as input in computing the next hidden state.  In contrast to a classical finite-state HMM, the hidden state is determined by the observable word sequence.  In a classical HMM many different hidden state sequences can generate the same word sequence — the hidden states are latent.  We can introduce discrete latent hidden states $H_0, \ldots, H_T$ such that the full hidden state is a pair $(h_t, H_t)$ where $h_t$ is an RNN state vector determined by observable words and $H_t$ is a latent discrete state not determined by the observable sequence.  We allow the discrete hidden state transition probabilities to depend on the hidden vector of an RNN.  In equations we have

$h_{t+1} = \mathrm{RNNcell}_\Theta(h_t, e(w_t))$

$P_\Theta(H_{t+1} \;|\;w_1,\ldots,\;w_t,\;H_1,\ldots,\;H_t) = P_\Theta(H_{t+1}\;|\;h_{t+1},e(H_t))$

$P_\Theta(w_{t+1} \;|w_1,\ldots,w_t,\;H_1,\ldots,\;H_t) = P_\Theta(w_{t+1}\;|\;h_{t+1},e(H_{t+1})).$

Here $\Theta$ is the system of network parameters and $e(w)$ and $e(H)$ are vector embeddings of the word $w$ and discrete state $H$. This model allows $P_\Theta(w_1,\ldots,w_T)$ to be computed by dynamic programming over the possible hidden state sequences using the forward-backward procedure.  The gradient of the probability can also be computed in closed form and we can do SGD without sampling the discrete hidden states.  For dynamic programming to work it is important that the state vector $h_t$ is determined by $h_0$ and the observed words $w_1,\ldots,w_{t-1}$ and is not influenced by the hidden state sequence $H_1,\ldots,H_{t-1}$.

The general case: The general EG algorithm alternates an E step, which is a exact optimization as in EM, and a G step in which we take a gradient step in $\Theta$ defined by computing the gradient of (1):

$\nabla_\Theta {\cal L}(\Psi,\Theta,y) = E_{(x,y)}\;E_{z \sim P_\Psi(z|x,y)}\; \nabla_\Theta\;\ln\;P_\Theta(y,z|x)\;\;\;\;\;(3)$

By (2), and the fact that that the gradient of the KL divergence is zero when $P_\Psi(z|y) = P_\Theta(z|y)$, for the G step we also have

$\nabla_\Theta {\cal L}(\Psi, \Theta,y) = \nabla_\Theta \ln P_\Theta(y|x)\;\;\;(4)$

Combining (3) and (4) we see that that the gradient defined by (3) in the G step equals the gradient of the top level objective.

We now consider the case where $P_\Theta(z,y|x)$ is defined by a generative process where the pair $(z,y)$ is generated stochastically by a series of discrete decisions.  Each decision is assumed to be a draw from a certain probability distribution. I will write $Q \leadsto j$ for the decision (event) of drawing $j$ from distribution $Q$ .  Every decision made in the above HMM has the form

$P_\Theta(H|h_{t+1},e(H^i)) \leadsto H^j$

where $H^i$ and $H^j$ are particular discrete hidden states.  For a sequence of length $T$ and for $n$ discrete hidden states we have a total of $Tn^2$ possible choice events and these choice events form the edges of an HMM trellis.  Note that it is important that $h_t$ does not depend on the patch through the the trellis.

But getting back to the general case let ${\cal C}(z,y|x)$ be the choices made in generating $(z,y)$ under the model $P_\Theta(z,y|x)$.  We then have

$\ln P_\Theta(z,y|x) = \sum_{(Q \leadsto j) \in {\cal C}(z,y|x)}\;\ln P(j\;|\;Q).$

We now have

$\nabla_\Theta \ln P_\Theta (y|x) = E_{z \sim P_\Theta(z|x,y)}\;\nabla_\Theta \;\sum_{(Q \leadsto j) \in {\cal C}(y,z|x)} \;\; \ln P(j\;|\;Q)\;\;\;\;(5)$

Now let ${\cal C}(y|x) = \bigcup_z \;{\cal C}(y,z|x)$.  In the HMM example ${\cal C}(y|x)$ is the set of all edges in the HMM trellis while ${\cal C}(z,y|x)$ is one particular path through the trellis.  Also let $1((Q\leadsto j) \in {\cal C}(z,y|x))$ be the indicator function that the choice $Q \leadsto j$ occurs when generating $(z,y)$ from $x$. We can now rewrite (5) as

$\nabla_\Theta \ln P_\Theta(y|x) = \sum_{(Q \leadsto j) \in{\cal C}(y|x)} \left(E_{z \sim P_\Theta(z|x,y)} \; 1((Q \rightarrow j)\in{\cal C}(y,z|x))\right)\; \nabla_\Theta\;\ln P_\Theta(j|Q)$

The expectation $E_z 1((Q \leadsto j) \in {\cal C}(y,z|x))$ are assumed to be computable by dynamic programming.  In the HMM example these are the edge probabilities in the HMM lattice.

CTC: In CTC we are given a sequence $x_1, \ldots, x_T$ (typically acoustic feature vectors sampled at 100 frames per second) and we must output a sequence of symbols (typically letters or phonetic symbols) $y_1, \ldots, y_N$ with $N \leq T$ and typically $N << T$.  More precisely, for network parameters $\Theta$ the model defines $P_\Theta(y_1,\ldots,y_N|x_1,\ldots,x_T)$.  However, in the model there is a latent sequence $\hat{y}_1,\ldots,\hat{y}_T$ where each $\hat{y}_T$ is either an output symbol or the special symbol $\bot$.  The sequence $y_1,\ldots,y_N$ is derived from $\hat{y}_1,\ldots,\hat{y}_T$ by removing the occurrences of $\bot$. The model can be written as

$h_{t+1} = \mathrm{RNNcell}_\Theta(h_t, x_t)$

$P\Theta(\hat{y}_{t+1}|x_1,\ldots,x_t) = P_\Theta(\hat{y}_{t+1}|h_{t+1})$

We can use the EG algorithm to compute the gradient of $\ln P_\Theta(y_1,\ldots,y_N|x_1,\ldots,x_T)$ where EG sums over all possible sequences $\hat{y}_1,\ldots,\hat{y}_T$ that reduce to $y_1,\ldots,y_N$.  I leave it to the reader to work out the details of the dynamic programming table. For the dynamic programming to work it is important that $h_t$ does not depend on the latent sequence $\hat{y}_1,\ldots,\hat{y}_T$.

Footnotes:

1. A week after this post was originally published Gu et al. published a non-autoregressive language model.
Posted in Uncategorized | 1 Comment

## VAE = EM

I recently realized the connection between the expectation maximization algorithm (EM) and variational autoencoders (VAE).  Both optimize the same objective function where VAE performs gradient descent based on a sampling estimate of the gradient while EM performs exact alternating maximization in models where this is possible.  It turns out that the alternating optimization view of EM is described in section 9.4 of Bishop’s 2006 text on machine learning and also in section 19.2 of the text by Goodfellow, Bengio and Courville.   I had overlooked this connection between EM and VAEs, and it seems important, so here is a blog post that helps me, at least, organize my thinking.

First some background.

The expectation maximization (EM) algorithm is a time-honored cornerstone of machine learning.  Its general formulation was given in a classic 1977 paper by Dempster, Laird and Rubin, although many special cases had been developed prior to that.  The algorithm is used for estimating unknown parameters of a probability distribution involving latent variables.  The algorithm is usually introduced by looking at the special case of modeling a collection of points by a mixture of Gaussians.  Here we are given points where the model assumes that each point came from an unlabeled component of the mixture (the component associated with the point is a latent variable). Given the points one must estimate a component weight, a mean and a covariance matrix for each component of the mixture.

This is a non-convex optimization problem. EM iteratively re-estimates the parameters with a guarantee that for each iteration, unless one is already at a parameter setting with zero probability gradients (a stationary point), the probability of the points under the model improves.  There are numerous important special cases such as learning the parameters of a hidden Markov model or a probabilistic context free grammar.

Variational autoencoders (VAEs) (tutorial) were introduced in 2014 by Kingma and Welling in the context of deep learning.  Autoencoding is related to compression.  JPEG compresses an image into an encoded form that uses less memory but can then be decompressed to get the image back with some loss of resolution.

$x \hspace{1.0in}P_\Psi(z|x) \hspace{1.0in} z \hspace{1.0in} P_\Theta(z,x)\hspace{1.0in}x'$

[Kevin Franz]

Information theoretically, compression is closely relate to distribution modeling. Shannon’s source coding theorem states that for any probability distribution, such as the distribution on “natural” images, one can (block) code the elements using a number of bits per element equal to the information-theoretic entropy of the distribution.  In the above figure the average number of bits in the compressed form in the middle should be equal to the entropy of the probability distribution over images.

VAEs, however, do not directly address the compression problem.  Instead they focus on fitting the parameters $\Theta$ of the distribution $P_\Theta(z,x)$ so as to maximize the marginal probability $P_\Theta(x)$ of a given set $x$ of samples (such as a set of points or a set of images). The encoder is just a tool to make this parameter fitting more efficient. The problem being solved by a VAE is the same as the problem being solved by EM — fitting the parameters of a probability distribution to given data where the model includes latent variables not specified in the data.

VAE = EM. Both EM and VAEs are formulated in terms of the following objective function on the two distributions $P_\psi(z|y)$ (the encoder) and $P_\Theta(y,z)$ (the model defining $P(z)$ and $P(y|z)$).

$\Psi^*,\Theta^* = \mathrm{argmax}_{\Psi,\Theta}\;\sum_y\;{\cal L}(\Psi,\Theta,y)$

${\cal L}(\Psi,\Theta,y) = E_{z \sim P_\Psi(z|y)}\; \ln P_\Theta(z,y) + H(P_\Psi(z|y))\;\;\;(1)$

$= \ln\;P_{\Theta}(y) - KL(P_\Psi(z|y),P_{\Theta}(z|y))\;\;\;(2)$

Here $y$ is summed over training data. The equivalence of  (1) and (2) is important as they provide different outlooks on the objective.   The equivalence is implied by the following observation.

$E_{z \sim P_\Psi(z|y)} \;\ln P_\Theta(z,y)$
$= E_{z \sim P_\Psi(z|y)} \;\ln P_\Theta(y)P_\Theta(z|y)$
$= E_{z \sim P_\Psi(z|y)}\;\ln P_\Theta(y) + \;E_{z \sim P_\Psi(z|y)}\;\ln P_\Theta(z|y)$
$= \ln P_\Theta(y) + \;E_{z \sim P_\Psi(z|y)}\;\ln P_\Theta(z|y)$

EM can be defined as alternating optimization of $\sum_y {\cal L}(\Psi,\Phi,y)$ where (2) supports the optimization of $\Psi$ (the E step) and (1) supports the optimization of $\Theta$ (the M step).  The EM algorithm applies to models where both the E step and the M step can be solved efficiently exactly and for such models each M step is guaranteed to improve $\sum _y \ln P_\Theta(y)$.  For these models this alternating optimization is typically far superior to any form of gradient descent on $\sum_y \ln P_\Theta(y)$.

VAEs are used in models where the alternating optimization cannot be done efficiently in closed form.  A VAE performs gradient descent on (1) where the gradient is estimated by sampling $z$ from the encoder $P_\Psi(z|y)$.  This typically involves a “reparameterization trick”.  One represents $P_\Psi(z|y)$ by taking $z = g_\Psi(y,\eta)$ where $\eta$ is a random variable of a fixed distribution. We can then rewrite (1) as

${\cal L}(\Psi,\Theta,y) = E_{\eta}\; \ln P_\Theta(g_\Psi(y,\eta),y) + H(P_\Psi(g_\Psi(y,\eta)|y))\;\;\;(3)$

It is common to take $P(\eta)$$P_\Psi(z|y)$, $P_\Theta(z)$ and $P_\Theta (y|z)$ to all be high dimensional Gaussians each represented by a mean and a (diagonal) covariance . The noise variable $\eta$ is taken to be a fixed zero mean isotropic Gaussian while $P_\Psi(z|y)$, $P_\Theta(z)$ and $P_\Theta (y|z)$ have means and covariances computed by deep networks.  We can now calculate the gradient of (3) for any given sample of $\eta$.

It is not clear whether the term “variational auto encoder” should imply the use of Gaussians.  Goodfellow et al. seem to take the position that VAEs are defined by gradient descent on (1) where the gradient is estimated by sampling $z$ from $P_\Psi(z|y)$.  This does not necessarily involve Gaussians.  A version of HMM-VAEs has recently been formulated which includes discrete latent HMM states.  However the observed variable $y$ is still continuous and Gaussians are still involved.  Presumably one can formulate VAEs where both $y$ and $z$ are structured discrete variables, such a language model with discrete latent states.  These VAEs would presumably not involve Gaussians.

Other approaches to optimizing ${\cal L}(\Psi,\Theta,y)$ are possible. The speech recognition algorithm CTC uses an exact E step but a gradient M step and achieves gradient descent on deep networks with discrete latent variables but without latent variable sampling.  See the following blog post for a discussion of the CTC approach which we might call the EG (expected gradient) algorithm.