I ended my last post by saying that I might write a follow-up post on current work that seems to exhibit progress toward natural language understanding. I am going to discuss a couple sampled papers but of course these are not the only promising papers. I just want to give some examples to argue that significant progress is being made and that there is no obvious limit. I will start with the following quotation which has gained considerably currency lately.
“You can’t cram the meaning of a whole %&!$ing sentence into a single $&!*ing vector!”
This quotation seems to be becoming a legacy similar to that of Jelinek’s quotation that “every time I fire a linguistic our speech recognition error rate improves”. However, while the underlying phenomenon pointed to by Jelinek — the dominance of learning over design — has been controversial over the years, I suspect that there is already a large consensus that Mooney is correct — we need more than “thought vectors”.
Meaning as a sequence of vectors: Attention
The first step beyond representing a sentence as a single vector is to represent a sentence as a sequence of vectors. The attention mechanism of machine translation systems essentially does this. It takes the meaning of the input sentence to be the sequence of hidden state vectors of an RNN (LSTM or GRU). As the translation is generated the attention mechanism extracts information from interior of the input sentence. There is no clear understanding of what is being represented, but the attention mechanism is now central to high performance translation systems such as that recently introduced by Google. Attention is old news so I will say no more.
Meaning as a sequence of vectors: Graph-LSTMs
Here I will mention two very recent papers. The first is a paper that appeared at ACL this year by Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-tau Yih, entitled “Cross-Sentence N-ary Relation Extraction with Graph LSTMs“. This paper is on relation extraction — extracting gene-mutation-drug triples from the literature on cancer mutations. However, here I am focused on the representation used for the meaning of text. As in translation, text is converted to a sequence of vectors with a vector for each position in the text. However, this sequence is not computed by a normal LSTM or GRU. Rather the links of a dependency parse are added as edges between the positions in the sentence resulting in a graph that includes both the dependency edges and the “next-word” edges as in the following figure from the paper.
The edges are divided into those that go left-to-right and those that go right-to-left as shown in the second part of the figure. Each of these two edge sets forms a DAG (a directed acyclic graph). Tree LSTMs immediately generalize to DAGS and I might prefer the term DAG-LSTM to the term Graph-LSTM used in the title. They then run a DAG-LSTM from left to right over the left-to-right edges and a DAG-LSTM from right-to-left over the right-to-left edges and concatenate the two vectors at each position. The dependency parse is provided by an external resource. Importantly, different transition parameters are used for different edge types — the “next-word” edge parameters are different from the “subject-of” edge parameters which are different from the “object-of” edge prameters.
A startlingly similar but independent paper was posted on arXiv in March by Bhuwan Dhingra, Zhilin Yang, William W. Cohen and Ruslan Salakhutdinov, entitled “Linguistic Knowledge as Memory for Recurrent Neural Networks“. This paper is on reading comprehension but here I will again focus on the representation of the meaning of text. They also add edges between the positions in the text. Rather than add the edges of a dependency parse they add coreference edges and edges for hyponym-hypernym relations as in the following figure from the paper.
As in the previous paper, they then run a DAG-RNN (a DAG-GRU) from left-to-right and also right-to-left. But they use every edge in both directions with different parameters for different types of edges and different parameters for the left-to-right and right-to-left directions of the same edge type. Again the coreference edges and hypernym/hyponym edges are provided by an external resource.
Meaning as a sequence of vectors: Self-Attention
As a third pass I will consider self attention as developed in “A Structured Self-attentive Sentence Embedding” by Lin et al. (IBM Watson and the University of Montreal) and “Attention is all you need” by Vaswani et al. (Google Brain, Google Research and U. of Toronto). Although only a few months old these papers are already well known. Unlike the Graph-LSTM papers above, these papers learn graph structure on the text without the use of external resources. They also do not use any form of RNN (LSTM or GRU) but rely entirely on learned graph structure. The graph structure is created through self-attention. I will take the liberty of some simplification —I will ignore the residual skip connections and various other details. We start with the sequence of input vectors. For each position we apply three matrices to get the three different vectors — a key vector, a query vector, and a value vector. For each position we take the query vector at that position inner product the key vector at every other position to get an attention weighting (or set of weighted edges) from that position to the other positions in the sequence (including itself). We then then weight the value vectors by that weighting and pass the result through a feed forward network to get a vector at that position at the next layer. This is repeated for some number of layers. The sentence representation is the sequence of vectors in the last layer.
The above description ignores the multi-headed feature described in both papers. Rather than just compute one attention graph they compute compute several different attention graphs each of which is defined by different parameters. Each of these attention graphs might be interpreted as a different type of edge, such as the agent semantic role relation between event verbs and their agents. But the interpretations of deep models is always difficult. The following figure from Vaswani et al. shows the edges from the word “making” where different colors represent different edge types. Although the sentence is shown twice, the edges are actually between the words of a single sequence at a single layer of the model.
The following figure shows one particular edge type which is possibly interpretable as a coreference relation.
We might call these networks self-attention networks (SANs). Note that they are sans LSTMs and sans CNNs :).
Meaning as an embedded knowledge graph
I want to complain at this point that you can’t cram the meaning of a bleeping sentence into a bleeping sequence of vectors. The graph structures on the positions in the sentence used in the above models should be exposed to the user of the semantic representation. I would take the position that the meaning should be an embedded knowledge graph — a graph on embedded entity nodes and typed relations (edges) between them. A node representing an event can be connected through edges to entities that fill the semantic roles of the event type (a neo-Davidsonian representation of events in a graph structure).
One paper in this direction is “Dynamic Entity Representations in Neural Language Models” by Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi and Noah A. Smith (EMNLP 2017). This model jointly learns to identify mentions, coreference, and entity embeddings (a vector representing the object referred to).
Another related paper is “Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision” by Chen Lian, Jonathan Berant, Quoc Le, Kenneth D. Forbus and Ni Lao which appeared at ACL this year. This paper addresses the problem of natural language question answering using freebase. But again I will focus on the representation of the input sentence (question in this case). They convert the input question to a kind of graph structure where each node in the graph denotes a set of freebase entities. An initial set of nodes is constructed by an external linker that links mentions in the question, such as “city in the United states”, to a set of freebase entities. A sequence-to-sequence model is then used to introduce additional nodes each one of which is defined by a relation to previously introduced nodes. This “graph” can be viewed as a program with instructions of the form n_i = op(R, n_j) where n_i is a newly defined node, n_j is a previously defined node, R is a database relation such as “born-in” and op is one of “hop”, “argmax”, “argmin”, or “filter”. For example
n_1 = US-city; n_2 = hop(born-in, n_1); n_3 = argmax(population-of, n_1)
defines n_1 as the set of US cities, n_2 as the set of people born in a US city and n_3 as the largest US city. Answering the input question involves running the generated program on freebase. The sequence to sequence model is trained from question-answer pairs without any gold semantic graph output sequences. They use some fancy tricks to get REINFORCE to work. But the point here is that semantic parsing of this type converts input text to a kind of graph structure with embedded nodes (embedded by the sequence-to-sequence model). Another important point is that constructing the semantics involves background knowledge — freebase in this case. Language is highly referential and “meaning” must ultimately involve reference resolution — linking to a knowledge base.
A nice survey of semantic formalisms is given in “The State of the Art in Semantic Representation” by Omri Abend and Ari Rappoport which appeared at this ACL. A fundamental issue here is the increasing dominance of learning over design. I believe that in the near term it will not be possible to separate semantic formalisms from learning architectures. (In the long term there is the foundations of mathematics …) If the target semantic formalism is a neo-Davidsonian embedded knowledge graph, then this formalism must be unified with some learning architecture. The learning should be done in the presence of background knowledge — both semantic and episodic long term memory. Background knowledge could itself be an embedded knowledge graph. The paper “node2vec: Scalable Feature Learning for Networks” by Grover and Leskovec gives a method of embedding a given (unembedded) knowledge graph. But it ultimately seems better to generate the embedding jointly with acquiring the graph.
I have long believed that the most promising cognitive (learning) architecture is bottom-up logic programming. Bottom-up (forward chaining) logic programming has a long history in the form of production systems, datalog, and Jason Eisner’s Dyna programming language. There are strong theoretical arguments for the centrality of bottom-up logic programming. I have heard rumors that there is currently a significant effort at DeepMind based on deep inductive logic programming (Deep ILP) for bottom-up programs and that a paper will appear in the next few months. I have high hopes …