A Consistency Theorem for BERT

Last week I saw Noah Smith at a meeting in Japan where he mentioned that BERT seemed related to pseudo-likelihood.  After some thought I convinced myself that this observation should lead to a consistency theorem for BERT.  With an appropriate Google query I found 1902.04904 by Wang and Cho which also points out the connection between BERT and pseudo-likelihood. Wang and Cho view BERT as a Markov random field (MRF) and use Gibbs sampling to sample sentences. But Wang and Cho do not mention consistency.   This post explicitly observes that BERT is consistent as a language model — as a probability distribution over full sentences.

The classical proof of consistency for pseudo-likelihood assumes that the actual population distribution is defined by some setting of the MRF weights.  For BERT I will replace this assumption with the assumption that the deep model is capable of exactly modeling the various conditional distributions.  Because deep models are intuitively much more expressive than linear MRFs over hand-designed features,  this deep expressivity assumption seems much weaker than the classical assumption.

In addition to assuming universal expressivity,  I will assume that training finds a global optimum.  Assumptions of complete optimization currently underly much of our intuitive understanding of deep learning.  Consider the GAN consistency theorem. This theorem assumes both universal expressivity and complete optimization of both the generator and the discriminator.  While these assumptions seem outrageous, the GAN consistency theorem is the source of the design of the GAN architecture.  The value of such outrageous assumptions in architecture design should not be under-estimated.

For training BERT we assume a population distribution over blocks (or sentences) of $k$ words $\vec{y}=y_1,\;\ldots,y_k$.  I will assume that BERT is trained by blanking a single word in each block.  This single-blank assumption is needed for the proof but seems unlikely to matter in practice.  Also, I believe that the proof can be modified to handle XLNet which predicts a single held-out subsequence per block rather than multiple independently modeled blanks.

Let $\Phi$ be the BERT parameters and let $Q_\Phi(y_i|\vec{y}/i)$ be the distribution on words that BERT assigns to the $i$th word when the $i$th word is blanked.  The training objective for BERT is

$\begin{array}{rcl} \Phi^* & = & \mathrm{argmin}_\Phi\;\;E_{\vec{y} \sim \mathrm{Pop},\;i \sim 1,\ldots k}\;\;-\ln\;Q_\Phi(y_i|\vec{y}/i) \\ \\ & = & \mathrm{argmin}_\Phi \;\frac{1}{k} \sum_{i=1}^k\;E_{\vec{y}\sim\mathrm{Pop}}\;-\ln\;Q_\Phi(y_i|\vec{y}/i) \\ \\ & = &\mathrm{argmin}_\Phi\;\sum_{i=1}^k \;H(\mathrm{Pop}(y_i),Q_\Phi(y_i)\;|\;\vec{y}/i) \end{array}$

where $H(P(y),Q(y)\;|\;x)$ denotes cross-entropy conditioned on $x$.  Each cross entropy term is individually minimized when $Q_\Phi(y_i|\vec{y}/i) = \mathrm{Pop}(y_i|\vec{y}/i)$.  Our universality assumption is that there exists a $\Phi$ satisfying all of these conditional distributions simultaneously.  Under this assumption we have

$Q_{\Phi^*}(y_i|\vec{y}/i) = \mathrm{Pop}(y_i|\vec{y}/i)$

for all $i$ and $\vec{y}$.

I must now define the language model (full sentence distribution) defined by $Q_{\Phi^*}$.  For this I consider Gibbs sampling — the stochastic process defined on $\vec{y}$ by randomly selecting $i$ and replacing $y_i$ by a sample from $Q_\Phi(y_i|\vec{y}/i)$.  The language model is now defined to be the stationary distribution of this Gibbs sampling process. But this Gibbs process is the same process as Gibbs sampling using the population conditionals.  Therefore the stationary distribution must be $\mathrm{Pop}$.  Q.E.D.