A recent NOEMA essay by Jacob Browning and Yann LeCun put forward the proposition that “an artificial intelligence system trained on words and sentences alone will never approximate human understanding”. I will refer to this claim as the grounding hypothesis — the claim that understanding requires grounding in direct experience of the physical world with, say, vision or manipulation, or perhaps direct experience of emotions or feelings. I have long argued that language modeling — modeling the probability distribution over texts — should in principle be adequate for language understanding. I therefore feel compelled to write a rebuttal to Browning and LeCun.
I will start with a clarification of what I mean by “understanding.” For the purposes of this essay I will define understanding (or maybe even AGI) as the ability to perform language-based tasks as well as humans. I think it is fair to say that this includes the tasks of lawyers, judges, CEOs, and university presidents. It also includes counselors and therapists whose job requires a strong understanding of the human condition. One might object that a judge, say, might in some cases want to examine some physical evidence for themself. However, the task of being a judge or therapist remains meaningful even when the interaction is limited to text and speech. I am defining understanding to be the ability to do language-in/language-out tasks as well as humans.
A language-in/language-out conception of understanding seems required to make the grounding hypothesis meaningful. Of course we expect that learning to see requires training on images or that learning robot manipulation requires training on a robot arm. So the grounding hypothesis seems trivially true unless we are talking about language-in/language-out tasks such as being a competent language-in/language-out therapist or judge.
The grounding hypothesis, as stated by Browning and LeCun, is not about how children learn language. It seems clear that non-linguistic experience plays an important role in the early acquisition of language by toddlers. But, as stated, the grounding hypothesis says that no learning algorithm, no matter how advanced, can learn to understand using only a corpus of text. This is a claim about the limitations of (deep) learning.
It is also worth pointing out that the grounding hypothesis is about what training data is needed, not about what computation takes place in the end task. Performing any language-in/language-out task is, by definition, language processing independent of what kind of computation is done. Transformer models such as GPT-3 use non-symbolic deep neural networks. However, these models are clearly processing language.
Browning and LeCun argue that the knowledge underlying language understanding can only be acquired non-linguistically. For example the meaning of the phrase “wiggly line” might only be learnable from image data. The inference that “wiggly lines are not straight” could be a linguistically observable consequence of image-acquired understanding. Similar arguments can be made for sounds such as “whistle” or “major chord” vs “minor chord”.
On the surface this position seems reasonable. However, a first counterargument is simply the fact that most language involves concepts that cannot even be represented in images or sounds. As of this writing the first sentence of the first article of Google news is
FBI agents have already finished their examination of possibly privileged documents seized in an Aug. 8 search of Donald Trump’s Mar-a-Lago home, according to a Justice Department court filing Monday that could undercut the former president’s efforts to have a special master appointed to review the files.
I do not see a single word in this sentence, with the possible exception of the names Donald Trump and Mar-a-Lago, whose meaning is plausibly acquired through images. Even for the names, the sentence is completely understandable without having seen images of the person or place.
A second counterargument is that, while there may be some minor artifacts in the language of the congenitally blind, people who are blind from birth generally do not have any significant linguistic impairment.
Browning and LeCun discuss how people use various forms of visual input. For example IKEA assembly instructions have no words. No one is arguing that vision is useless. However, it is very limited. The old silent movies would be meaningless without the subtitles. A book, on the other hand, can give the reader a vivid human experience with no visual input whatsoever. The above sentence about Trump cannot be represented by a non-linguistic image.
Another argument given for the grounding hypothesis is that language seems more suitable for communication than for understanding. The idea is that language is good for communication because it is concise but is bad for understanding because understanding requires painstaking decompression. They go so far as to suggest that we need to study in literature classes to be able to “decompress” (deconstruct?) language. But of course preliterate societies have well developed language. Furthermore, in my opinion at least, it is exactly the abstract (concise) representations provided by language that makes understanding even possible. Consider again the above sentence about Trump.
The essay also notes that intelligence seems to be present in nonhuman animals such as corvids, octopi and primates. They seem to implicitly assume that nonhuman intelligence is non-symbolic. I find this to be speculative. Symbolic intelligence seems completely compatible with a lack of symbolic communication. Memories of discrete events and individuals seem fundamental to intelligence. Discrete representation may be a prerequisite for, rather than a consequence of, external linguistic communication.
Much of their essay focuses on the details of current language models. But we have no idea what models will appear in the future. The real question is whether large language corpora contain the information needed for understanding. If the information is there then we may someday (maybe someday soon) discover novel architectures that allow it to be extracted.
There is, for me, a clear argument that language modeling alone should ultimately be sufficient for extracting understanding. Presumably understanding is required to generate meaningful long texts such as novels. The objective of a language model is to determine the distribution of texts (such as novels). If defining the true distribution of novels requires understanding, then fully solving the language modeling problem requires the ability to understand.
Even if extracting understanding from static text proves too challenging, it could be the case that meaning can be extracted by pure language-in/language-out interaction between people and machines. Interaction is not discussed by Browning and LeCun but I would presume that they would argue that something other than language, even interactive language, is required for learning meaning.
I strongly expect that the grounding hypothesis will turn out to be false and that deep learning will prevail. However, I can accept the grounding hypothesis as an open empirical question — a question that seems very hard to settle. Constructing a competent artificial judge using image data would not prove that image data is required. Ideally we would like to find some meaningful language-in/language-out evaluation of understanding. We could then track the state-of-the-art performance of various systems and see if image data is needed. All the evidence thus far indicates that language data alone suffices for language-based measures of understanding.
 I think we should consider speech corpora to be “language data”. The emotional content of text should be learnable from text alone but to learn to express emotion in speech will require some amount of speech data.