A Consistency Theorem for BERT

Last week I saw Noah Smith at a meeting in Japan where he mentioned that BERT seemed related to pseudo-likelihood.  After some thought I convinced myself that this observation should lead to a consistency theorem for BERT.  With an appropriate Google query I found 1902.04904 by Wang and Cho which also points out the connection between BERT and pseudo-likelihood. Wang and Cho view BERT as a Markov random field (MRF) and use Gibbs sampling to sample sentences. But Wang and Cho do not mention consistency.   This post explicitly observes that BERT is consistent as a language model — as a probability distribution over full sentences.

Screen Shot 2019-07-14 at 10.45.20 AM

The classical proof of consistency for pseudo-likelihood assumes that the actual population distribution is defined by some setting of the MRF weights.  For BERT I will replace this assumption with the assumption that the deep model is capable of exactly modeling the various conditional distributions.  Because deep models are intuitively much more expressive than linear MRFs over hand-designed features,  this deep expressivity assumption seems much weaker than the classical assumption.

In addition to assuming universal expressivity,  I will assume that training finds a global optimum.  Assumptions of complete optimization currently underly much of our intuitive understanding of deep learning.  Consider the GAN consistency theorem. This theorem assumes both universal expressivity and complete optimization of both the generator and the discriminator.  While these assumptions seem outrageous, the GAN consistency theorem is the source of the design of the GAN architecture.  The value of such outrageous assumptions in architecture design should not be under-estimated.

For training BERT we assume a population distribution over blocks (or sentences) of k words \vec{y}=y_1,\;\ldots,y_k.  I will assume that BERT is trained by blanking a single word in each block.  This single-blank assumption is needed for the proof but seems unlikely to matter in practice.  Also, I believe that the proof can be modified to handle XLNet which predicts a single held-out subsequence per block rather than multiple independently modeled blanks.

Let \Phi be the BERT parameters and let Q_\Phi(y_i|\vec{y}/i) be the distribution on words that BERT assigns to the $i$th word when the $i$th word is blanked.  The training objective for BERT is

\begin{array}{rcl} \Phi^* & = & \mathrm{argmin}_\Phi\;\;E_{\vec{y} \sim \mathrm{Pop},\;i \sim 1,\ldots k}\;\;-\ln\;Q_\Phi(y_i|\vec{y}/i) \\ \\ & = & \mathrm{argmin}_\Phi \;\frac{1}{k} \sum_{i=1}^k\;E_{\vec{y}\sim\mathrm{Pop}}\;-\ln\;Q_\Phi(y_i|\vec{y}/i) \\ \\ & = &\mathrm{argmin}_\Phi\;\sum_{i=1}^k \;H(\mathrm{Pop}(y_i),Q_\Phi(y_i)\;|\;\vec{y}/i) \end{array}

where H(P(y),Q(y)\;|\;x) denotes cross-entropy conditioned on x.  Each cross entropy term is individually minimized when Q_\Phi(y_i|\vec{y}/i) = \mathrm{Pop}(y_i|\vec{y}/i).  Our universality assumption is that there exists a \Phi satisfying all of these conditional distributions simultaneously.  Under this assumption we have

Q_{\Phi^*}(y_i|\vec{y}/i) = \mathrm{Pop}(y_i|\vec{y}/i)

for all i and \vec{y}.

I must now define the language model (full sentence distribution) defined by Q_{\Phi^*}.  For this I consider Gibbs sampling — the stochastic process defined on \vec{y} by randomly selecting i and replacing y_i by a sample from Q_\Phi(y_i|\vec{y}/i).  The language model is now defined to be the stationary distribution of this Gibbs sampling process. But this Gibbs process is the same process as Gibbs sampling using the population conditionals.  Therefore the stationary distribution must be \mathrm{Pop}.  Q.E.D.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s