The Case Against Grounding

A recent NOEMA essay by Jacob Browning and Yann LeCun put forward the proposition that “an artificial intelligence system trained on words and sentences alone will never approximate human understanding”.  I will refer to this claim as the grounding hypothesis — the claim that understanding requires grounding in direct experience of the physical world with, say, vision or manipulation, or perhaps direct experience of emotions or feelings.  I have long argued that language modeling — modeling the probability distribution over texts — should in principle be adequate for language understanding. I therefore feel compelled to write a rebuttal to Browning and LeCun.

I will start with a clarification of what I mean by “understanding.” For the purposes of this essay I will define understanding (or maybe even AGI) as the ability to perform language-based tasks as well as humans. I think it is fair to say that this includes the tasks of lawyers, judges, CEOs, and university presidents. It also includes counselors and therapists whose job requires a strong understanding of the human condition. One might object that a judge, say, might in some cases want to examine some physical evidence for themself. However, the task of being a judge or therapist remains meaningful even when the interaction is limited to text and speech.[1]  I am defining understanding to be the ability to do language-in/language-out tasks as well as humans.

A language-in/language-out conception of understanding seems required to make the grounding hypothesis meaningful.  Of course we expect that learning to see requires training on images or that learning robot manipulation requires training on a robot arm.  So the grounding hypothesis seems trivially true unless we are talking about language-in/language-out tasks such as being a competent language-in/language-out therapist or judge.

The grounding hypothesis, as stated by Browning and LeCun, is not about how children learn language.  It seems clear that non-linguistic experience plays an important role in the early acquisition of language by toddlers.  But, as stated, the grounding hypothesis says that no learning algorithm, no matter how advanced, can learn to understand using only a corpus of text.  This is a claim about the limitations of (deep) learning.

It is also worth pointing out that the grounding hypothesis is about what training data is needed, not about what computation takes place in the end task. Performing any language-in/language-out task is, by definition, language processing independent of what kind of computation is done. Transformer models such as GPT-3 use non-symbolic deep neural networks. However, these models are clearly processing language.

Browning and LeCun argue that the knowledge underlying language understanding can only be acquired non-linguistically.  For example the meaning of the phrase “wiggly line” might only be learnable from image data. The inference that “wiggly lines are not straight” could be a linguistically observable consequence of image-acquired understanding. Similar arguments can be made for sounds such as “whistle” or “major chord” vs “minor chord”.

On the surface this position seems reasonable.  However, a first counterargument is simply the fact that most language involves concepts that cannot even be represented in images or sounds. As of this writing the first sentence of the first article of Google news is

FBI agents have already finished their examination of possibly privileged documents seized in an Aug. 8 search of Donald Trump’s Mar-a-Lago home, according to a Justice Department court filing Monday that could undercut the former president’s efforts to have a special master appointed to review the files.

I do not see a single word in this sentence, with the possible exception of the names Donald Trump and Mar-a-Lago, whose meaning is plausibly acquired through images. Even for the names, the sentence is completely understandable without having seen images of the person or place.

A second counterargument is that, while there may be some minor artifacts in the language of the congenitally blind[2], people who are blind from birth generally do not have any significant linguistic impairment.

Browning and LeCun discuss how people use various forms of visual input. For example IKEA assembly instructions have no words.  No one is arguing that vision is useless.  However, it is very limited. The old silent movies would be meaningless without the subtitles.  A book, on the other hand, can give the reader a vivid human experience with no visual input whatsoever. The above sentence about Trump cannot be represented by a non-linguistic image.

Another argument given for the grounding hypothesis is that language seems more suitable for communication than for understanding.  The idea is that language is good for communication because it is concise but is bad for understanding because understanding requires painstaking decompression. They go so far as to suggest that we need to study in literature classes to be able to “decompress” (deconstruct?) language.  But of course preliterate societies have well developed language.  Furthermore, in my opinion at least, it is exactly the abstract (concise) representations provided by language that makes understanding even possible.  Consider again the above sentence about Trump.

The essay also notes that intelligence seems to be present in nonhuman animals such as corvids, octopi and primates.  They seem to implicitly assume that nonhuman intelligence is non-symbolic. I find this to be speculative. Symbolic intelligence seems completely compatible with a lack of symbolic communication. Memories of discrete events and individuals seem fundamental to intelligence.  Discrete representation may be a prerequisite for, rather than a consequence of, external linguistic communication.

Much of their essay focuses on the details of current language models.  But we have no idea what models will appear in the future. The real question is whether large language corpora contain the information needed for understanding.  If the information is there then we may someday (maybe someday soon) discover novel architectures that allow it to be extracted.

There is, for me, a clear argument that language modeling alone should ultimately be sufficient for extracting understanding.  Presumably understanding is required to generate meaningful long texts such as novels. The objective of a language model is to determine the distribution of texts (such as novels).  If defining the true distribution of novels requires understanding, then fully solving the language modeling problem requires the ability to understand.

Even if extracting understanding from static text proves too challenging, it could be the case that meaning can be extracted by pure language-in/language-out interaction between people and machines.  Interaction is not discussed by Browning and LeCun but I would presume that they would argue that something other than language, even interactive language, is required for learning meaning.

I strongly expect that the grounding hypothesis will turn out to be false and that deep learning will prevail. However, I can accept the grounding hypothesis as an open empirical question — a question that seems very hard to settle. Constructing a competent artificial judge using image data would not prove that image data is required.  Ideally we would like to find some meaningful language-in/language-out evaluation of understanding.  We could then track the state-of-the-art performance of various systems and see if image data is needed. All the evidence thus far indicates that language data alone suffices for language-based measures of understanding.


[1] I think we should consider speech corpora to be “language data”. The emotional content of text should be learnable from text alone but to learn to express emotion in speech will require some amount of speech data.

[2]  It has been shown that the congenitally blind have similarity judgements between different kinds of fruit that do not take color into account as much as do sighted people. 

This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to The Case Against Grounding

  1. Mark Johnson says:

    There’s a sense in which Browning & Le Cun’s point is kind of obvious: appropriately diverse training data often leads a model to learn more quickly and stably than less diverse data, and it’s quite likely that multi-modal data would highlight useful generalisations for language understanding. They could well be correct that multi-modal training data is the most practical path to building a machine that behaves as if it understands language. I’m sure you remember the “naive physics” discussions from decades ago.

    But I think some people are attracted to grounding because it seems to be a way to make language “meaningful” to the machine. However, I suspect multi-modal training data on its own can’t imbue meaning to linguistic or sensory data: you can do “uninterpreted bit-pushing” of visual, haptic, etc., data just as easily as with language data. Searle’s Chinese Room argument isn’t mitigated by giving the room multi-modal information.

  2. McAllester says:

    I am framing this as an empirical question. I will be surprised if someone convincingly demonstrates a boost on some language-based metric of understanding from the addition of non-linguistic training data. It is certainly not obvious to me that non-linguistic data will help even if it does make the training data more diverse.

    • Mark Johnson says:

      And my point is that if you define “language understanding” to include common-sense world knowledge, as our field generally does (see e.g., the Winograd Schema) then it is quite plausible that multi-modal training data will improve both the sample efficiency and the stability of training.

      My own opinion is that we should distinguish knowledge of language from world knowledge, even though there’s no question that world knowledge is essential for understanding language. If someone believes that pizzas are for playing frisbee with (rather than eating), then I would say they are confused about pizzas, not that they don’t know English. Of course such a person would find it hard to understand language about pizzas, even if their English was perfect. My rule of thumb is: if an example translates into other languages (like many of the Winograd Schema do), then it probably relies on world knowledge, rather than knowledge of language.

      My second point is that I suspect Browning & Le Cun are proposing multi-modal data as a way of solving the Symbol Grounding problem. But I think multi-modal data on its own cannot solve the Symbol Grounding problem — you can do ungrounded computation with multi-modal data.

  3. Sam Wolfstone says:

    I used to doubt that a machine could get good understanding (accurate internal modelling) of the world through language learning alone, until I read an interest point: language, having been made to facilitate (amongst a few other things) describing the world, is isomorphic to the world itself, to some extent. So being able to model language should enable you to model the world, to whatever extent the relationships that exist between words mirror the relationships between those things and concept in physical reality.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s