Modeling human commonsense reasoning has long been a goal of AI. From a 1998 interview with Marvin Minsky we have
Marvin Minsky: We need common-sense knowledge – and programs that can use it. Common sense computing needs several ways of representing knowledge. It is harder to make a computer housekeeper than a computer chess-player … There are [currently] very few people working with common sense problems. … [one such person is] John McCarthy, at Stanford University, who was the first to formalize common sense using logics. …
The desire to understand human reasoning was the philosophical motivation for mathematical logic in the 19th and early 20th centuries. Indeed, the situation calculus of McCarthy, 1963 was a seminal point in the development of logical approaches to commonsense. But today the logicist approach to AI is generally viewed as a failure and, in spite of the recent advances with deep learning, understanding commonsense remains a daunting roadblock on the path to AGI.
Today we look to BERT and GPT and their descendants for some kind of implicit understanding of semantics. Do these huge models, trained merely on raw text, contain some kind of implicit knowledge of commonsense? I have always believed that truly minimizing the entropy of English text requires an understanding of whatever the text is about — that minimizing the entropy of text about everyday events requires an understanding of the events themselves (whatever that means). Before the BERT era I generally got resistance to the idea that language modeling could do much interesting. Indeed it remains unclear to what extend current language models actually embody semantics and to what extent semantics can actually be extracted from raw text.
In recent times the Winograd schema challenge (WSC) has been the most strenuous test of common sense reasoning. The currently most prestigious version is the SuperGLUE-WSC. In this version a sentence is presented with both a pronoun and a noun tagged. The task is to determine if the tagged pronoun is referring to the tagged noun. The sentences are selected with the intention of requiring commonsense understanding. For example we have
The trophy would not fit in the suitcase because it was too big.
The task is to determine if tagged pronoun “it” refers to the tagged noun “suitcase” — here “it” actually refers to “trophy”. If we replace “big” by “small” the referent flips from “trophy” to “suitcase”. It is perhaps shocking (disturbing?) that the state of the art (SOTA) for this problem is 94% — close to human performance. This is done by fine-tuning a huge language model on a very modest number of training sentences from WSC.
But do the language models really embody common sense? A paper from Jackie Cheung’s group at McGill found that a fraction of the WSC problems can be answered by simple n-gram statistics. For example consider
I’m sure that my map will show this building; it is very famous.
A language model easily determines that the phrase “my map is very famous” is less likely than “this building is very famous” so the system can guess that the referent is “building” and not “map”. But they report that this phenomenon only accounts for 13.5% of test problems. Presumably there are other “tells” that allow correct guessing without understanding. But hidden tells are hard to identify and we have no way of measuring what fraction of answers are generated legitimately.
A very fun, and also shocking, example of language model commonsense is the COMET system from Yejin Choi’s group at the University of Washington. In COMET one presents the system with unstructured text describing an event and the system fills in answers to nine standard questions about events such as “before, the agent needed …”. This can be viewed as asking for the action prerequisites in McCarthy’s situation calculus. COMET gives an answer as unstructured text. There is a web interface where one can play with it. As a test I took the first sentence from the jacket cover of the novel “Machines like me”.
Charlie, drifting through life, is in love with Miranda, a brilliant student with a dark secret.
I tried “Charlie is drifting through life” and got I was shocked to see that the system suggested that Charlie had lost his job, which was true in the novel. I recommend playing with COMET. I find it kind of creepy.
But, as Turing suggested, the real test of understanding is dialogue. This brings me to the chatbot Meena. This is a chatbot derived from a very large language model — 2.6 billion parameters trained on 341 gigabytes of text. The authors devised various human evaluations and settled on an average of a human-judged specificity score and a human-judged sensibility score — Specificity-Sensibility-Average or SSA. They show that they perform significantly better than previous chatbots by this metric. Perhaps most interestingly, they give the following figure relating the human-judged SSA performance to the perplexity achieved by the underlying language model.
They suggest that a much more coherent chatbot could be achieved if the perplexity could be reduced further. This supports the belief that minimal perplexity requires understanding.
But in spite of the success of difficult-to-interpret language models, I still believe that interpretable entities and relations are important for commonsense. It also seems likely that we will find ways to greatly reduce the data requirements of our language models. A good place to look for such improvements is to somehow improve our understanding of the relationship, assuming there is one, between language modeling and interpretable entities and relations.