One of the holy grails of the modern deep learning community is to develop effective methods for unsupervised learning. I have always held out hope that the semantics of English could be learned from raw unlabeled text. The plausibility of a statement should affect the probability of that statement. It would seem that a perfect language model — one approaching the true perplexity or entropy of English — must incorporate semantics. For this reason I read with great interest the recent dramatic improvement in language modeling achieved by Jozefowicz et. al. at Google Brain. They report a perplexity of about 30 on Google’s One Billion Word Benchmark. Perplexities below 60 have been very difficult to achieve.
At TTIC we have been working on machine comprehension as another direction for the acquisition of semantics. We created a “Who Did What” benchmark for reading comprehension based on the LDC Gigaword Corpus. I will discuss reading comprehension a bit more below, but I first want to focus on the nature of news corpora. We found that the Gigaword corpus contained multiple articles on the same events. We exploited this fact in the construction of our Who-did-What dataset. I was curious how the Billion Word Benchmark handled this multiplicity. Presumably the training data contained information about the same entities and events that occur in the test sentences. What is the entropy of a statement under the listener’s probability distribution when the listener knows (the semantic content of) what the speaker is going to say? To make this issue clear, here are some examples of training and test sentences from Google’s One Billion Word Corpus:
Train: Saudi Arabia on Wednesday decided to drop the widely used West Texas Intermediate oil contract as the benchmark for pricing its oil.
Test: Saudi Arabia , the world ‘s largest oil exporter , in 2009 dropped the widely used WTI oil contract as the benchmark for pricing its oil to customers in the US.
Train: Bobby Salcedo was re-elected in November to a second term on the board of the El Monte City School District.
Test: Bobby Salcedo was first elected to the school board in 2004 and was re-elected in November.
These examples were found by running the Lucene information retrieval system on the training data using the test sentence as the query (thanks to Hai Wang). These examples suggest that language modeling is related to reading comprehension. Reading comprehension is the task of answering a multiple choice question about a passage involving entities and events not present in general structured knowledge sources. Recently several large scale reading comprehension benchmarks have been created with cloze-style questions — questions formed by deleting a word or phrase from a corpus sentence or article summary. Cloze-style reading comprehension benchmarks include the CNN/Daily Mail benchmark, the Children’s Book Test, and our Who-did-What benchmark mentioned above. There is also a recently constructed LAMBADA dataset which is intended as a language modeling benchmark but which can be approached most effectively as a reading comprehension task (Chu et. al.). I predict a convergence of reading comprehension and language modeling — comprehension based language modeling.
For comprehension based language modeling to make sense one should have training data describing the same entities and events as those occurring in the test data. It might be good to measure the entropy (or perplexity) of just the content words, or even just named entities. But one should of course avoid including the test sentences in the training data. It turns out that Google’s Billion Word Benchmark has a problem in this regard. We found the following examples:
Train: Al Qaeda deputy leader Ayman al-Zawahri called on Muslims in a new audiotape released Monday to strike Jewish and American targets in revenge for Israel ‘s offensive in the Gaza Strip earlier this month.
Test: Al-Qaida deputy leader Ayman al-Zawahri called on Muslims in a new audiotape released Monday to strike Jewish and American targets in revenge for Israel ‘s recent offensive in the Gaza Strip.
Train: RBS shares have only recently risen past the average level paid by the Government , but Lloyds shares are still low despite a recent banking sector rally.
Test: RBS shares have only recently risen past the average level paid by the government , but Lloyds shares are still languishing despite the recent banking sector rally .
Train: WASHINGTON – AIDS patients should have a genetic test before treatment with GlaxoSmithKline Plc ‘s drug Ziagen to see whether they face a higher risk of a potentially fatal reaction , U.S. regulators said on Thursday .
Test: WASHINGTON ( Reuters ) – AIDS patients should be given a genetic test before treatment with GlaxoSmithKline Plc ‘s drug , Ziagen , to see if they face a higher risk of a potentially fatal reaction , U.S. regulators said on Thursday .
These example occur when slightly edited versions of an article appear on different newswires or in article revisions. Although we have not done a careful study, it appears that something like half of the test sentences in Google’s Billion Word Benchmark are essentially duplicates of training sentences.
In spite of the problems with the Billion Word Benchmark, an appropriate benchmark for comprehension based language modeling should be easy to construct. For example, we could require that the test sentences be separated by, say, a week from any training sentence. Or we could simply de-duplicate the articles using a soft-match criterion for duplication. The training data should consist of complete articles rather than shuffled sentences. Also note that new test data is continuously available — it is hard to overfit next week’s news.
These are exciting times. Can we obtain a command of English in a machine simply by training on very large corpora of text? I find it fun to be optimistic.