Deep Meaning Beyond Thought Vectors

I ended my last post by saying that I might write a follow-up post on current work that seems to exhibit progress toward natural language understanding.   I am going to discuss a couple sampled papers but of course these are not the only promising papers.  I just want to give some examples to argue that significant progress is being made and that there is no obvious limit. I will start with the following quotation which has gained considerably currency lately.

“You can’t cram the meaning of a whole %&!$ing sentence into a single$&!*ing vector!”

Ray Mooney

This quotation seems to be becoming a legacy similar to that of Jelinek’s quotation that “every time I fire a linguistic our speech recognition error rate improves”. However, while the underlying phenomenon pointed to by Jelinek — the dominance of learning over design — has been controversial over the years, I suspect that there is already a large consensus that Mooney is correct — we need more than “thought vectors”.

Meaning as a sequence of vectors: Attention

The first step beyond representing a sentence as a single vector is to represent a sentence as a sequence of vectors. The attention mechanism of machine translation systems essentially does this. It takes the meaning of the input sentence to be the sequence of hidden state vectors of an RNN (LSTM or GRU).  As the translation is generated the attention mechanism extracts information from interior of the input sentence.  There is no clear understanding of what is being represented, but the attention mechanism is now central to high performance translation systems such as that recently introduced by Google.  Attention is old news so I will say no more.

Meaning as a sequence of vectors: Graph-LSTMs

Here I will mention two very recent papers.  The first is a paper that appeared at ACL this year by Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-tau Yih, entitled “Cross-Sentence N-ary Relation Extraction with Graph LSTMs“.  This paper is on relation extraction  — extracting gene-mutation-drug triples from the literature on cancer mutations.  However, here I am focused on the representation used for the meaning of text.  As in translation, text is converted to a sequence of vectors with a vector for each position in the text.  However, this sequence is not computed by a normal LSTM or GRU.  Rather the links of a dependency parse are added as edges between the positions in the sentence resulting in a graph that includes both the dependency edges and the “next-word” edges as in the following figure from the paper.

The edges are divided into those that go left-to-right and those that go right-to-left as shown in the second part of the figure.  Each of these two edge sets forms a DAG (a directed acyclic graph).  Tree LSTMs immediately generalize to DAGS and I might prefer the term DAG-LSTM to the term Graph-LSTM used in the title.  They then run a DAG-LSTM from left to right over the left-to-right edges and a DAG-LSTM from right-to-left over the right-to-left edges and concatenate the two vectors at each position.  The dependency parse is provided by an external resource.  Importantly, different transition parameters are used for different edge types — the “next-word” edge parameters are different from the “subject-of” edge parameters which are different from the “object-of” edge prameters.

A startlingly similar but independent paper was posted on arXiv in March by Bhuwan Dhingra, Zhilin Yang, William W. Cohen and Ruslan Salakhutdinov, entitled “Linguistic Knowledge as Memory for Recurrent Neural Networks“.  This paper is on reading comprehension but here I will again focus on the representation of the meaning of text.  They also add edges between the positions in the text.  Rather than add the edges of a dependency parse they add coreference edges and edges for hyponym-hypernym relations as in the following figure from the paper.

As in the previous paper, they then run a DAG-RNN (a DAG-GRU) from left-to-right and also right-to-left.  But they use every edge in both directions with different parameters for different types of edges and different parameters for the left-to-right and right-to-left directions of the same edge type.  Again the coreference edges and hypernym/hyponym edges are provided by an external resource.

Meaning as a sequence of vectors: Self-Attention

As a third pass I will consider self attention as developed in “A Structured Self-attentive Sentence Embedding” by Lin et al. (IBM Watson and the University of Montreal) and “Attention is all you need” by Vaswani et al. (Google Brain, Google Research and U. of Toronto).  Although only a few months old these papers are already well known.  Unlike the Graph-LSTM papers above, these papers learn graph structure on the text without the use of external resources.  They also do not use any form of RNN (LSTM or GRU) but rely entirely on learned graph structure. The graph structure is created through self-attention.  I will take the liberty of some simplification —I will ignore the residual skip connections and various other details.  We start with the sequence of input vectors.  For each position we apply three matrices to get the three different vectors — a key vector, a query vector, and a value vector.  For each position we take the query vector at that position inner product the key vector at every other position to get an attention weighting (or set of weighted edges) from that position to the other positions in the sequence (including itself).  We then then weight the value vectors by that weighting and pass the result through a  feed forward network to get a vector at that position at the next layer.  This is repeated for some number of layers.  The sentence representation is the sequence of vectors in the last layer.

The above description ignores the multi-headed feature described in both papers.  Rather than just compute one attention graph they compute compute several different attention graphs each of which is defined by different parameters.  Each of these attention graphs might be interpreted as a different type of edge, such as the agent semantic role relation between event verbs and their agents. But the interpretations of deep models is always difficult.  The following figure from Vaswani et al. shows the edges from the word “making” where different colors represent different edge types.  Although the sentence is shown twice, the edges are actually between the words of a single sequence at a single layer of the model.

The following figure shows one particular edge type which is possibly interpretable as a coreference relation.

We might call these networks self-attention networks (SANs). Note that they are sans LSTMs and sans CNNs :).

Meaning as an embedded knowledge graph

I want to complain at this point that you can’t cram the meaning of a bleeping sentence into a bleeping sequence of vectors.  The graph structures on the positions in the sentence used in the above models should be exposed to the user of the semantic representation.  I would take the position that the meaning should be an embedded knowledge graph — a graph on embedded entity nodes and typed relations (edges) between them.  A node representing an event can be connected through edges to entities that fill the semantic roles of the event type (a neo-Davidsonian representation of events in a graph structure).

One paper in this direction is “Dynamic Entity Representations in Neural Language Models” by Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi and Noah A. Smith (EMNLP 2017). This model jointly learns to identify mentions, coreference, and entity embeddings (a vector representing the object referred to).

Another related paper is “Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision” by Chen Lian, Jonathan Berant, Quoc Le, Kenneth D. Forbus and Ni Lao which appeared at ACL this year.  This paper addresses the problem of natural language question answering using freebase.  But again I will focus on the representation of the input sentence (question in this case). They convert the input question to a kind of graph structure where each node in the graph denotes a set of freebase entities.  An initial set of nodes is constructed by an external linker that links mentions in the question, such as “city in the United states”, to a set of freebase entities. A sequence-to-sequence model is then used to introduce additional nodes each one of which is defined by a relation to previously introduced nodes. This “graph” can be viewed as a program with instructions of the form n_i = op(R, n_j) where n_i is a newly defined node, n_j is a previously defined node, R is a database relation such as “born-in” and op is one of “hop”, “argmax”, “argmin”, or “filter”.  For example

n_1 = US-city;  n_2 = hop(born-in, n_1);  n_3 = argmax(population-of, n_1)

defines n_1 as the set of US cities, n_2 as the set of people born in a US city and n_3 as the largest US city.  Answering the input question involves running the generated program on freebase.  The sequence to sequence model is trained from question-answer pairs without any gold semantic graph output sequences. They use some fancy tricks to get REINFORCE to work. But the point here is that semantic parsing of this type converts input text to a kind of graph structure with embedded nodes (embedded by the sequence-to-sequence model). Another important point is that constructing the semantics involves background knowledge — freebase in this case.  Language is highly referential and “meaning” must ultimately involve reference resolution — linking to a knowledge base.

What’s next?

A nice survey of semantic formalisms is given in “The State of the Art in Semantic Representation” by Omri Abend and Ari Rappoport which appeared at this ACL.  A fundamental issue here is the increasing dominance of learning over design.  I believe that in the near term it will not be possible to separate semantic formalisms from learning architectures.  (In the long term there is the foundations of mathematics …) If the target semantic formalism is a neo-Davidsonian embedded knowledge graph, then this formalism must be unified with some learning architecture. The learning should be done in the presence of background knowledge — both semantic and episodic long term memory.  Background knowledge could itself be an embedded knowledge graph.  The paper “node2vec: Scalable Feature Learning for Networks” by Grover and Leskovec gives a method of embedding a given (unembedded) knowledge graph. But it ultimately seems better to generate the embedding jointly with acquiring the graph.

I have long believed that the most promising cognitive (learning) architecture is bottom-up logic programming.  Bottom-up (forward chaining) logic programming has a long history in the form of production systems, datalog, and Jason Eisner’s Dyna programming language.  There are strong theoretical arguments for the centrality of bottom-up logic programming. I have heard rumors that there is currently a significant effort at DeepMind based on deep inductive logic programming (Deep ILP) for bottom-up programs and that a paper will appear in the next few months.  I have high hopes …

Posted in Uncategorized | 1 Comment

The Plausibility of Near-Term Machine Sentience.

When should we expect “operational sentience” — the point where the most effective way to interact with a machine is to assume it is sentient — to assume that it understands what we tell it.  I want to make an argument that near-term machine sentience in this sense is plausible where near-term means, say, ten years.

Deep learning has already dramatically improved natural language processing. Translation, question answering, parsing, and linking have all improved. A fundamental question is whether the recent advances in translation and linking provide evidence that we are getting close to “understanding”. I do not want to argue that we are getting close,
but rather just that we don’t know and that near-term sentience is “plausible”.

My case is based on the idea that paraphrase, entailment, linking, and structuring may be all that is needed. Paraphrase is closely related to translation — saying the same thing a different way or in a different language.  There has been great progress here. Entailment is determining if one statement implies another.  Classical logic was developed as model of entailment. But natural language entailment is far to slippery to be modeled by any direct application of formal logic. I interpret the progress in deep learning applied to NLP question answering as evidence for progress in entailment. Entailment is also closely related to paraphrase (no paraphrase is precisely meaning preserving) and the progress in translation seems related to potential progress in entailment. “Linking” means tagging natural language phrases with the database entries that the phrase is referring to.  The database can be taken to be freebase or wikidata or any knowledge graph. This is related to the philosophical problem of “reference” – what is our language referring to. But simply tagging phrases with database entries, such as “Facebook” or “the US constitution“, does not seem philosophically mysterious. I am using the term “structuring” to mean knowledge base population (KBP) – populating a knowledge base from unstructured text. Extracting entities and relationships (building a database) from unstructured text does not seem philosophically mysterious. It seems plausible to me that paraphrase, entailment, linking, and KBP will all see very significant near-term advances based on deep learning. The notions of “database” and “inference rule” (as in datalog) presumably have to be blended with the distributed representations of deep learning and integrated into “linking” and “structuring”. This seems related to memory networks with large memories. But I will save that for another post.

The plausibility of near-term machine sentience is supported by the plausibility that
language understanding is essentially unrelated to, and independent of, perception and action other than inputing and outputting language itself. There is a lot of language data out there. I have written previous blog posts against the idea of “grounding” the semantics of natural language in non-linguistic perception and action or in the physical position of material in space.

Average level human natural language understanding may prove to be easier than, say, average level human vision or physical coordination. There has been evolutionary
pressure on vision and coordination much longer than there has been evolutionary pressure on NLP understanding. For me the main question is how close are we to effective paraphrase, entialment, linking and structuring.  NLP researchers are perhaps the best AI practitioners to comment this blog post. I believe that Gene Charniak, a pioneer of machine parsing, believes that machine NLP understanding is at least a hundred years off. But I am more interested in laying out concrete arguments, one way or the other, than in taking opinion polls. Deep learning may have the ability to automatically handle the hundreds of thousands of linguistic phenomenon that seem to exist in, say, English. People learn language. Is there some reasoned argument that this cannot work within a decade?

Maybe at some point I will write a longer post outlining what current work seems to me to be on the path to machine sentience.

Formalism, Platonism and Mentalese

This is a sequel to an earlier post on Tarski and Mentalese.  I am writing this sequel for two reasons. First, I just posted a new version of my treatment of type theory which focuses on “naive semantics”. I want to explain here what that means. Second, after many attempts to get formalists to accept Platonic proofs I have decided that it is best to approach the issue from the formalist point of view. In this new approach I present Platonism as just a translation from a formal language (such as a programming language or mathematical logic) to some symbolic internal mentalese. Human mathematical thought can presumably be modeled, to some extent at least, as symbolic computation involving expressions of mentalese.  Formalists will accept the idea of a symbolic language of thought but some deep learning people resist the idea that symbolic computation is relevant to any form of AI.  I will avoid that latter discussion here and assume that a mentalese of mathematics exists.

To understand the issues of formalism vs. Platonism and the relationship of both to semantics it is useful to consider an example. In type theory we have dependent pair types. These are written fairly obscurely as $\Sigma_{x:\sigma} \;\tau[x]$.  This expression denotes a type (think class). An instance of this type (or class) is a pair $(x,y)$ where $x$ is an instance of the type $\sigma$ and $y$ is an instance of the type $\tau[x]$ where $\tau[x]$ is a type (class) whose definition involves the value $x$.  To make this more concrete consider the following example.

DiGraph = $\Sigma_{s:set}\;(s \times s \rightarrow \mathbf{Bool})$

This is the type (or class) of directed graphs where we are taking a directed graph to be a pair $(N,P)$ where $N$ is a set of “nodes” and $P$ is an “edge predicate” on the nodes which is true of two nodes if there is a directed edge between them.

The above paragraph defines the notation $\Sigma_{x:\sigma}\;\tau[x]$ in a way that would be sufficient when introducing notation in a mathematics class.   Mathematical notation is generally defined before it is used.  Above we define the notation $\Sigma_{x:\sigma}\;\tau[x]$ in much the same way that any mathematical notation is defined in the practice of mathematics. A simpler example would be to introduce the symbol $\cup$ by saying that for sets $s$ and $u$ write $s \cup u$ to denote the union of the sets $s$ and $u$.  This style of definition provides a translation of formal notation into the natural language (such as English) which is the language of mathematical discourse in practice.

But logicians are not satisfied with this style of definition.  They prefer a more formal treatment known as Tarskian compositional semantics.  My previous post introduces this by treating the semantics of arithmetic expressions.  We introduce a value function ${\cal V}[e]\rho$ where $e$ is an arithmetic expression and $\rho$ is a variable interpretation assigning a value to the variables occurring in $e$.  Now consider how notation is defined in mathematics.  We might say “we write $n!$ for the the product of the natural numbers from 1 to $n$”.  We can write a compositional semantics specification of the meaning of $n!$ as follows.

$V[e!]\rho \equiv \prod_{i=1}^{V[e]\rho}\;i$

Here $V[w]\rho$ is the notation for the value of $e$ under variable interpretation $\rho$.  This looks more explicitly like a translation between formal notations. However, this notation assumes that the product notation is already understood — can be reasoned about in internal mentalese.  This notation can still be interpreted as a translation from an external notation to mentalese.

In type theory, as in most programming languages, variables are typed.  A math class might start with the sentence “Let $B$ be a Banach space.”.  In type theory this is a declaration of the variable $B$ as being of type “Banach space”.  A set of variable declarations and assumptions constraining the variable values is called a “context”. Contexts are often denoted by $\Gamma$.  In my work on type theory I subscript the value function with a context declaring the types of the free variables that can occur in the expression. For example we might have

$V_{G:\mathbf{DiGraph}}\;[e[G]]\rho$

Here $\rho$ must assign a value to the variable $G$ and that value must be a directed graph.  The type of directed graphs can be expressed as a dependent pair type as described above.

The phrase “naive semantics”, or more specifically “naive type theory”, involves assigning the expressions  their naive meaning.  For example we have

$V_\Gamma[\Sigma_{x:\sigma}\;\tau[x]]\rho =\{(a,b),\;a \in V_\Gamma[\sigma]\rho,\;b \in V_{\Gamma;\;x:\sigma}[\tau[x]]\rho[x \leftarrow a]\}$

This is just the first definition of the dependent pair type notation given above expressed within a Tarskian compositional semantics.  The above definition can be viewed as simply providing a translation of a formal expression into the notation of practical mathematical discourse and hence as a translation of formal expressions into the language of mentalese.

Of course when push comes to shove mathematics grounds out in sets. The notion of “set” is taken to be primitive and is not defined in terms of more basic notions.  We have learned, however, to be precise about the properties of sets using the axioms of ZFC.  In the modern understanding of sets (since about 1925) we distinguish sets from classes where classes are collections too large to be sets such as the collection of all groups or the collection of all topological spaces.  Naive type theory starts from this (sophisticated) notion of sets and classes and then gives all the other expressions of type theory their naive meanings.  This is no different from the introduction of notation in any particular area of mathematics except that type theory is about all the areas of mathematics — it is part of the foundations of mathematics.

Naive type theory is almost trivial.  However, considerable effort is required to handle the notion of isomorphism within naive type theory.  Handling isomorphism within naive type theory is the main point of my work in this area.

Comprehension Based Language Modeling

One of the holy grails of the modern deep learning community is to develop effective methods for unsupervised learning.  I have always held out hope that the semantics of English could be learned from raw unlabeled text.  The plausibility of a statement should affect the probability of that statement. It would seem that a perfect language model — one approaching the true perplexity or entropy of English — must incorporate semantics.  For this reason I read with great interest the recent dramatic improvement in language modeling achieved by Jozefowicz et. al. at Google Brain.  They report a perplexity of about 30 on Google’s One Billion Word Benchmark.  Perplexities below 60 have been very difficult to achieve.

At TTIC we have been working on machine comprehension as another direction for the acquisition of semantics. We created a “Who Did What” benchmark for reading comprehension based on the LDC Gigaword Corpus.  I will discuss reading comprehension a bit more below, but I first want to focus on the nature of news corpora.  We found that the Gigaword corpus contained multiple articles on the same events.  We exploited this fact in the construction of our Who-did-What dataset.  I was curious how the Billion Word Benchmark handled this multiplicity.  Presumably the training data contained information about the same entities and events that occur in the test sentences. What is the entropy of a statement under the listener’s probability distribution when the listener knows (the semantic content of) what the speaker is going to say?  To make this issue clear, here are some examples of training and test sentences from Google’s One Billion Word Corpus:

Train: Saudi Arabia on Wednesday decided to drop the widely used West Texas Intermediate oil contract as the benchmark for pricing its oil.

Test: Saudi Arabia , the world ‘s largest oil exporter , in 2009 dropped the widely used WTI oil contract as the benchmark for pricing its oil to customers in the US.

Train: Bobby Salcedo was re-elected in November to a second term on the board of the El Monte City School District.

Test: Bobby Salcedo was first elected to the school board in 2004 and was re-elected in November.

For comprehension based language modeling to make sense one should have training data describing the same entities and events as those occurring in the test data.  It might be good to measure the entropy (or perplexity) of just the content words, or even just named entities.  But one should of course avoid including the test sentences in the training data. It turns out that Google’s Billion Word Benchmark has a problem in this regard.  We found the following examples:

Train: Al Qaeda deputy leader Ayman al-Zawahri called on Muslims in a new audiotape released Monday to strike Jewish and American targets in revenge for Israel ‘s offensive in the Gaza Strip earlier this month.

Test: Al-Qaida deputy leader Ayman al-Zawahri called on Muslims in a new audiotape released Monday to strike Jewish and American targets in revenge for Israel ‘s recent offensive in the Gaza Strip.

Train: RBS shares have only recently risen past the average level paid by the Government , but Lloyds shares are still low despite a recent banking sector rally.

Test: RBS shares have only recently risen past the average level paid by the government , but Lloyds shares are still languishing despite the recent banking sector rally .

Train: WASHINGTON – AIDS patients should have a genetic test before treatment with GlaxoSmithKline Plc ‘s drug Ziagen to see whether they face a higher risk of a potentially fatal reaction , U.S. regulators said on Thursday .

Test: WASHINGTON ( Reuters ) – AIDS patients should be given a genetic test before treatment with GlaxoSmithKline Plc ‘s drug , Ziagen , to see if they face a higher risk of a potentially fatal reaction , U.S. regulators said on Thursday .

These example occur when slightly edited versions of an article appear on different newswires or in article revisions. Although we have not done a careful study, it appears that something like half of the test sentences in Google’s Billion Word Benchmark are essentially duplicates of training sentences.

In spite of the problems with the Billion Word Benchmark, an appropriate benchmark for comprehension based language modeling should be easy to construct.  For example, we could require that the test sentences be separated by, say, a week from any training sentence.  Or we could simply de-duplicate the articles using a soft-match criterion for duplication.  The training data should consist of complete articles rather than shuffled sentences.  Also note that new test data is continuously available — it is hard to overfit next week’s news.

These are exciting times.  Can we obtain a command of English in a machine simply by training on very large corpora of text?  I find it fun to be optimistic.

Cognitive Architectures

Within the deep learning community there is considerable interest in neural architecture. There are convolutional networks, recurrent networks, LSTMs, GRUs, attention mechanisms, highway networksinception networksresidual networksfractal networks and many more.  Most of these architectures can be viewed as certain feed-forward circuit topologies.  Circuits are a universal model of computation.  However, human programmers find it more productive to specify algorithms in high level programing languages.  Presumably this applies to learning as well — learning should be easier with a higher level architecture.

Of course the deep community is aware of the relationship between neural architecture and models of computation. General “high level” architectures have been proposed. We have neural turing machines (actually random access machines),  parsing architectures with stacks, and neural architectures for functional programming.  There was a nice NIPS workshop on reasoning, attention and memory (RAM) addressing such fundamental architectural issues.

It seems reasonable to use classical models of computation as inspiration for neural architectures.  But it is important to be aware of the large variety of classical architectural ideas.  Various twentieth century discrete architectures may provide a rich source of inspiration for twenty first century differentiable architectures. Here is a list my favorite classical architectural ideas.

Mathematical Logic:   Starting with the ancient Greeks, logic has been developed directly as a model of knowledge representation and thought. Mathematical logic organizes knowledge around entities and relations. Databases are closely related to predicate calculus. Logic is capable of representing knowledge in any domain of discourse. While entities and relations are central to logic, logic involves a variety of additional features such as function application, quantification, and types. Logic also provides the intellectual framework underlying mathematics. Achieving the singularity will presumably require machines to be capable of programming computers. Computer programming seems to require sound analytical (mathematical) reasoning.

Production Systems and Logic Programming:  This style of architecture was championed by Herb Simon and Alan Newel.  It is a way of making logical rules compute efficiently. I will interpret production systems fairly broadly to include various rule-based languages such as SOAROps5, Prolog and Datalog. The cleanest of this family of architectures is bottom-up logic programming which has a nice relationship to general dynamic programming and is the foundation of the Dyna programming langauge. Dynamic programming algorithms can be viewed as feed-forward networks where each entry in a dynamic programming table can be viewed as a structured-output unit which computes its values form earlier units and provides its value to later units.

Inductive Logic ProgrammingThis is a classical unification of machine learning and logic programming championed by Stephen Muggleton. The basic idea is take a set of assertions in predicate calculus (observed data) and generalize them to a “theory” (a logic program) that is consistent and that implies the data.

Frames, Scripts, and Object-Oriented Programming:  Frames and scripts were championed as a general framework for knowledge representation by Marvin Minsky , Roger Schank, and Charles Fillmore. Frames are related to object-oriented programming in the sense that an instance of the “room frame” (or room class) has fillers for fields such as “ceiling”, “windows” and “furniture”.  Frames also seem related to the ontology of mathematics.  For example, a mathematical field consists of a set together with two operations (addition and multiplication) satisfying certain properties.  The term “structure” has a well defined meaning in model theory (a branch of logic) which is closely related to the notion of a class instance in object-oriented programming.  A specific mathematical field is a structure (in the technical sense) and is an instance of the general mathematical class of fields.

The Situation Calculus and Modal Logic:  In the situation calculus statements take meaning in”situations”.  A “fluent” is a mapping from situations to truth values.  Actions change one situation into another.  This leads to the STRIPS model of actions and planning. Situations are closely related to the possible worlds of modal logic.

Conclusion

I believe that human learning is based on a differentiable universal learning architecture and that domain specific priors are not required. But it is unclear how elaborate the general architecture is.  It seems worth considering the above list of classical architectural ideas and the possibility that these discrete architectures can be made differentiable.

Posted in Uncategorized | 3 Comments

Architectures and Language Instincts

This post is inspired by a recent book and article by Vyvyan Evans declaring the death of the Pinker-Chomsky notion of a language instinct.  I have written previous posts on what I call the Comsky-Hinton axis of innate-knowledge vs. general learning and I have declared that Hinton (general learning) is winning.  I still believe that.  However, I also believe that the learning architecture matters and that finding really powerful general purpose architectures is nontrivial.  It seems harder than just designing neural architectures emulating Turing machines (or random access machines or functional programs).  We should not ignore the possibility that the structure of language is related to the structure of thought and that the structure of thought is based on a nontrivial innate learning architecture.

A basic issue discussed in Evan’s article is the fact that sophisticated language and thought seems to be restricted to humans and seems to have taken at least 500 million years since the development of “simple” functions like vision. Presumably vision has bee based on learning all along.  I put no weight on the claim that other species have similar functions — humans are clearly qualitatively smarter than starlings (or any other animal). Evan’s seems to accept that this is a mystery and does not seem to have a good solution. He attributes it speculatively to socialization and the need to coordinate behaviors. However, there are many social species and only humans have this qualitatively different level of intelligence. Although we have only a sample of one species, this level of intelligence has perfect correlation with the existence of language.

At least one interpretation of the very late and sudden appearance of language is that it arose as a kind of phase transition in gradual improvements in learning.  A phase transition could occur when learning becomes adequate for the cultural transmission of software (language).  In this scenario thought precedes linguistic communication and is based on the internal data structures of a sophisticated neural architecture.  As learning becomes strong enough, it becomes capable of learning to interpret sophisticated communication signals.  This leads to the cultural transmission of “ideas” — concepts denoted by words such as “have” and “get”.  A phase transition then occurs as a culturally transmitted language (system of concepts) takes hold.  A deeper understanding of the world then arises from the cultural evolution of conceptual systems.  The value of absorbing cultural ideas then places  increased selective pressure on the learning system itself.

But the question remains of what general learning architecture is required. After all, it took 500 million years for the phase transition to occur.  I believe there are clues to the structure of the underlying architecture in language itself.  To the extent that this is true, it may not be meaningful to distinguish a “language instinct” from an innate “learning architecture”.  Maybe the innate architecture involves entities and relations, a kind of database architecture,  providing a general (universal) substrate of representation, computation, and learning.  Maybe such an architecture would provide a unification of Chomsky and Hinton …

Posted in Uncategorized | 3 Comments

Why we need a new foundation of mathematics

I just finished what I am confident is the final revision of my paper on the foundations of mathematics.  I want to explain here what this is all about. Why do we need a new foundation of mathematics?

ZFC set theory has stood as a fixed eight-axiom foundation of mathematics since the early 1920s. ZFC does does an excellent job of capturing the notion of provability in mathematics.  Things not provable in ZFC, such as the continuum hypothesis,the existence of inaccessible cardinals, or the consistency of ZFC, are forever out of the reach of mathematics.  As a circumscription of the provable, ZFC suffices.

The problem is that ZFC is untyped. It is a kind of raw assembly language. Mathematics is about “concepts” — groups, fields, vector spaces, manifolds and so on. We want a type system where the types are in correspondence with the naturally occurring concepts. In mathematics every concept (every type) is associated with a notion of isomorphism. We immediately understand what it means for two graphs to be isomorphic. Similarly for groups, fields and topological spaces, and in fact, all mathematical concepts. We need to define what we mean by a concept (a type) in general and what we mean by “isomorphism” at each type.  Furthermore,  we must prove an abstraction theorem that isomorphic objects are inter-substitutable in well-typed contexts.

We also want to prove “Voldemort’s theorem”. Voldemort’s theorem states that certain objects exist but cannot be named.  For example, it is impossible to name any particular point on a circle (unless coordinates are introduced) or any particular node in an abstract complete graph, any particular basis (coordinate system) for an abstract vector space, or any particular isomorphism between a finite dimensional vector space and its dual.

Consider Wigner’s famous statement on the unreasonable effectiveness of mathematics in physics. I feel comfortable paraphrasing this as stating the unreasonable effectiveness of mathematical concepts in physics – the effectiveness of concepts such as manifolds and group representations. Phrased this way, it can be seen as stating the unreasonable effectiveness of type theory in physics. Strange.

The abstraction theorem and Voldemort’s theorem are fundamentally about types and well-typed expressions.  These theorems cannot be formulated in classical (untyped) set theory.  The notion of canonicality or naturality (Voldemort’s theorem) is traditionally formulated in category theory.  But category theory does not provide a grammar of types defining the mathematical concepts. Nor does it give a general definition of the category associated with an arbitrary mathematical concept.  It reduces “objects” to “points” where the structure of the points is supposed to be somehow represented by the morphisms.  I have always had a distinct dislike for category theory.

I started working on this project about six years ago.  I found that defining the semantics of types in a way that the supports the abstraction theorem and Voldemort’s theorem was shockingly difficult (for me at least).  About three years ago I became aware of homotopy type theory (HoTT) and Voevodsky’s univalence axiom which establishes a notion of isomorphism and an abstraction theorem within the framework of Martin-Löf type theory. Vladimir Voevodsky is a fields medalist and full professor at IAS who organized a special year at IAS in 2012 on HoTT.

After becoming aware of HoTT I continued to work on Morphoid type theory (MorTT) without studying HoTT.  This was perhaps reckless but it is my nature to want to do it myself and then compare my results with other work.  In this case it worked out, at least in the sense that MorTT is quite different from HoTT. I quite likely could not have constructed MorTT had I been distracted by HoTT.  Those interested in details of MorTT and its relationship to HoTT should see my manuscript.

Posted in Uncategorized | 1 Comment