## Comprehension Based Language Modeling

One of the holy grails of the modern deep learning community is to develop effective methods for unsupervised learning.  I have always held out hope that the semantics of English could be learned from raw unlabeled text.  The plausibility of a statement should affect the probability of that statement. It would seem that a perfect language model — one approaching the true perplexity or entropy of English — must incorporate semantics.  For this reason I read with great interest the recent dramatic improvement in language modeling achieved by Jozefowicz et. al. at Google Brain.  They report a perplexity of about 30 on Google’s One Billion Word Benchmark.  Perplexities below 60 have been very difficult to achieve.

At TTIC we have been working on machine comprehension as another direction for the acquisition of semantics. We created a “Who Did What” benchmark for reading comprehension based on the LDC Gigaword Corpus.  I will discuss reading comprehension a bit more below, but I first want to focus on the nature of news corpora.  We found that the Gigaword corpus contained multiple articles on the same events.  We exploited this fact in the construction of our Who-did-What dataset.  I was curious how the Billion Word Benchmark handled this multiplicity.  Presumably the training data contained information about the same entities and events that occur in the test sentences. What is the entropy of a statement under the listener’s probability distribution when the listener knows (the semantic content of) what the speaker is going to say?  To make this issue clear, here are some examples of training and test sentences from Google’s One Billion Word Corpus:

Train: Saudi Arabia on Wednesday decided to drop the widely used West Texas Intermediate oil contract as the benchmark for pricing its oil.

Test: Saudi Arabia , the world ‘s largest oil exporter , in 2009 dropped the widely used WTI oil contract as the benchmark for pricing its oil to customers in the US.

Train: Bobby Salcedo was re-elected in November to a second term on the board of the El Monte City School District.

Test: Bobby Salcedo was first elected to the school board in 2004 and was re-elected in November.

For comprehension based language modeling to make sense one should have training data describing the same entities and events as those occurring in the test data.  It might be good to measure the entropy (or perplexity) of just the content words, or even just named entities.  But one should of course avoid including the test sentences in the training data. It turns out that Google’s Billion Word Benchmark has a problem in this regard.  We found the following examples:

Train: Al Qaeda deputy leader Ayman al-Zawahri called on Muslims in a new audiotape released Monday to strike Jewish and American targets in revenge for Israel ‘s offensive in the Gaza Strip earlier this month.

Test: Al-Qaida deputy leader Ayman al-Zawahri called on Muslims in a new audiotape released Monday to strike Jewish and American targets in revenge for Israel ‘s recent offensive in the Gaza Strip.

Train: RBS shares have only recently risen past the average level paid by the Government , but Lloyds shares are still low despite a recent banking sector rally.

Test: RBS shares have only recently risen past the average level paid by the government , but Lloyds shares are still languishing despite the recent banking sector rally .

Train: WASHINGTON – AIDS patients should have a genetic test before treatment with GlaxoSmithKline Plc ‘s drug Ziagen to see whether they face a higher risk of a potentially fatal reaction , U.S. regulators said on Thursday .

Test: WASHINGTON ( Reuters ) – AIDS patients should be given a genetic test before treatment with GlaxoSmithKline Plc ‘s drug , Ziagen , to see if they face a higher risk of a potentially fatal reaction , U.S. regulators said on Thursday .

These example occur when slightly edited versions of an article appear on different newswires or in article revisions. Although we have not done a careful study, it appears that something like half of the test sentences in Google’s Billion Word Benchmark are essentially duplicates of training sentences.

In spite of the problems with the Billion Word Benchmark, an appropriate benchmark for comprehension based language modeling should be easy to construct.  For example, we could require that the test sentences be separated by, say, a week from any training sentence.  Or we could simply de-duplicate the articles using a soft-match criterion for duplication.  The training data should consist of complete articles rather than shuffled sentences.  Also note that new test data is continuously available — it is hard to overfit next week’s news.

These are exciting times.  Can we obtain a command of English in a machine simply by training on very large corpora of text?  I find it fun to be optimistic.

## Cognitive Architectures

Within the deep learning community there is considerable interest in neural architecture. There are convolutional networks, recurrent networks, LSTMs, GRUs, attention mechanisms, highway networksinception networksresidual networksfractal networks and many more.  Most of these architectures can be viewed as certain feed-forward circuit topologies.  Circuits are a universal model of computation.  However, human programmers find it more productive to specify algorithms in high level programing languages.  Presumably this applies to learning as well — learning should be easier with a higher level architecture.

Of course the deep community is aware of the relationship between neural architecture and models of computation. General “high level” architectures have been proposed. We have neural turing machines (actually random access machines),  parsing architectures with stacks, and neural architectures for functional programming.  There was a nice NIPS workshop on reasoning, attention and memory (RAM) addressing such fundamental architectural issues.

It seems reasonable to use classical models of computation as inspiration for neural architectures.  But it is important to be aware of the large variety of classical architectural ideas.  Various twentieth century discrete architectures may provide a rich source of inspiration for twenty first century differentiable architectures. Here is a list my favorite classical architectural ideas.

Mathematical Logic:   Starting with the ancient Greeks, logic has been developed directly as a model of knowledge representation and thought. Mathematical logic organizes knowledge around entities and relations. Databases are closely related to predicate calculus. Logic is capable of representing knowledge in any domain of discourse. While entities and relations are central to logic, logic involves a variety of additional features such as function application, quantification, and types. Logic also provides the intellectual framework underlying mathematics. Achieving the singularity will presumably require machines to be capable of programming computers. Computer programming seems to require sound analytical (mathematical) reasoning.

Production Systems and Logic Programming:  This style of architecture was championed by Herb Simon and Alan Newel.  It is a way of making logical rules compute efficiently. I will interpret production systems fairly broadly to include various rule-based languages such as SOAROps5, Prolog and Datalog. The cleanest of this family of architectures is bottom-up logic programming which has a nice relationship to general dynamic programming and is the foundation of the Dyna programming langauge. Dynamic programming algorithms can be viewed as feed-forward networks where each entry in a dynamic programming table can be viewed as a structured-output unit which computes its values form earlier units and provides its value to later units.

Inductive Logic ProgrammingThis is a classical unification of machine learning and logic programming championed by Stephen Muggleton. The basic idea is take a set of assertions in predicate calculus (observed data) and generalize them to a “theory” (a logic program) that is consistent and that implies the data.

Frames, Scripts, and Object-Oriented Programming:  Frames and scripts were championed as a general framework for knowledge representation by Marvin Minsky , Roger Schank, and Charles Fillmore. Frames are related to object-oriented programming in the sense that an instance of the “room frame” (or room class) has fillers for fields such as “ceiling”, “windows” and “furniture”.  Frames also seem related to the ontology of mathematics.  For example, a mathematical field consists of a set together with two operations (addition and multiplication) satisfying certain properties.  The term “structure” has a well defined meaning in model theory (a branch of logic) which is closely related to the notion of a class instance in object-oriented programming.  A specific mathematical field is a structure (in the technical sense) and is an instance of the general mathematical class of fields.

The Situation Calculus and Modal Logic:  In the situation calculus statements take meaning in”situations”.  A “fluent” is a mapping from situations to truth values.  Actions change one situation into another.  This leads to the STRIPS model of actions and planning. Situations are closely related to the possible worlds of modal logic.

## Conclusion

I believe that human learning is based on a differentiable universal learning architecture and that domain specific priors are not required. But it is unclear how elaborate the general architecture is.  It seems worth considering the above list of classical architectural ideas and the possibility that these discrete architectures can be made differentiable.

Posted in Uncategorized | 3 Comments

## Architectures and Language Instincts

This post is inspired by a recent book and article by Vyvyan Evans declaring the death of the Pinker-Chomsky notion of a language instinct.  I have written previous posts on what I call the Comsky-Hinton axis of innate-knowledge vs. general learning and I have declared that Hinton (general learning) is winning.  I still believe that.  However, I also believe that the learning architecture matters and that finding really powerful general purpose architectures is nontrivial.  It seems harder than just designing neural architectures emulating Turing machines (or random access machines or functional programs).  We should not ignore the possibility that the structure of language is related to the structure of thought and that the structure of thought is based on a nontrivial innate learning architecture.

A basic issue discussed in Evan’s article is the fact that sophisticated language and thought seems to be restricted to humans and seems to have taken at least 500 million years since the development of “simple” functions like vision. Presumably vision has bee based on learning all along.  I put no weight on the claim that other species have similar functions — humans are clearly qualitatively smarter than starlings (or any other animal). Evan’s seems to accept that this is a mystery and does not seem to have a good solution. He attributes it speculatively to socialization and the need to coordinate behaviors. However, there are many social species and only humans have this qualitatively different level of intelligence. Although we have only a sample of one species, this level of intelligence has perfect correlation with the existence of language.

At least one interpretation of the very late and sudden appearance of language is that it arose as a kind of phase transition in gradual improvements in learning.  A phase transition could occur when learning becomes adequate for the cultural transmission of software (language).  In this scenario thought precedes linguistic communication and is based on the internal data structures of a sophisticated neural architecture.  As learning becomes strong enough, it becomes capable of learning to interpret sophisticated communication signals.  This leads to the cultural transmission of “ideas” — concepts denoted by words such as “have” and “get”.  A phase transition then occurs as a culturally transmitted language (system of concepts) takes hold.  A deeper understanding of the world then arises from the cultural evolution of conceptual systems.  The value of absorbing cultural ideas then places  increased selective pressure on the learning system itself.

But the question remains of what general learning architecture is required. After all, it took 500 million years for the phase transition to occur.  I believe there are clues to the structure of the underlying architecture in language itself.  To the extent that this is true, it may not be meaningful to distinguish a “language instinct” from an innate “learning architecture”.  Maybe the innate architecture involves entities and relations, a kind of database architecture,  providing a general (universal) substrate of representation, computation, and learning.  Maybe such an architecture would provide a unification of Chomsky and Hinton …

Posted in Uncategorized | 3 Comments

## Why we need a new foundation of mathematics

I just finished what I am confident is the final revision of my paper on the foundations of mathematics.  I want to explain here what this is all about. Why do we need a new foundation of mathematics?

ZFC set theory has stood as a fixed eight-axiom foundation of mathematics since the early 1920s. ZFC does does an excellent job of capturing the notion of provability in mathematics.  Things not provable in ZFC, such as the continuum hypothesis,the existence of inaccessible cardinals, or the consistency of ZFC, are forever out of the reach of mathematics.  As a circumscription of the provable, ZFC suffices.

The problem is that ZFC is untyped. It is a kind of raw assembly language. Mathematics is about “concepts” — groups, fields, vector spaces, manifolds and so on. We want a type system where the types are in correspondence with the naturally occurring concepts. In mathematics every concept (every type) is associated with a notion of isomorphism. We immediately understand what it means for two graphs to be isomorphic. Similarly for groups, fields and topological spaces, and in fact, all mathematical concepts. We need to define what we mean by a concept (a type) in general and what we mean by “isomorphism” at each type.  Furthermore,  we must prove an abstraction theorem that isomorphic objects are inter-substitutable in well-typed contexts.

We also want to prove “Voldemort’s theorem”. Voldemort’s theorem states that certain objects exist but cannot be named.  For example, it is impossible to name any particular point on a circle (unless coordinates are introduced) or any particular node in an abstract complete graph, any particular basis (coordinate system) for an abstract vector space, or any particular isomorphism between a finite dimensional vector space and its dual.

Consider Wigner’s famous statement on the unreasonable effectiveness of mathematics in physics. I feel comfortable paraphrasing this as stating the unreasonable effectiveness of mathematical concepts in physics – the effectiveness of concepts such as manifolds and group representations. Phrased this way, it can be seen as stating the unreasonable effectiveness of type theory in physics. Strange.

The abstraction theorem and Voldemort’s theorem are fundamentally about types and well-typed expressions.  These theorems cannot be formulated in classical (untyped) set theory.  The notion of canonicality or naturality (Voldemort’s theorem) is traditionally formulated in category theory.  But category theory does not provide a grammar of types defining the mathematical concepts. Nor does it give a general definition of the category associated with an arbitrary mathematical concept.  It reduces “objects” to “points” where the structure of the points is supposed to be somehow represented by the morphisms.  I have always had a distinct dislike for category theory.

I started working on this project about six years ago.  I found that defining the semantics of types in a way that the supports the abstraction theorem and Voldemort’s theorem was shockingly difficult (for me at least).  About three years ago I became aware of homotopy type theory (HoTT) and Voevodsky’s univalence axiom which establishes a notion of isomorphism and an abstraction theorem within the framework of Martin-Löf type theory. Vladimir Voevodsky is a fields medalist and full professor at IAS who organized a special year at IAS in 2012 on HoTT.

After becoming aware of HoTT I continued to work on Morphoid type theory (MorTT) without studying HoTT.  This was perhaps reckless but it is my nature to want to do it myself and then compare my results with other work.  In this case it worked out, at least in the sense that MorTT is quite different from HoTT. I quite likely could not have constructed MorTT had I been distracted by HoTT.  Those interested in details of MorTT and its relationship to HoTT should see my manuscript.

Posted in Uncategorized | 1 Comment

## The Foundations of Mathematics.

It seems clear that most everyday human language, and presumably most everyday human thought, is highly metaphorical.  Most human statements do not have definite Boolean truth values.  Everyday statements have a degree of truth and appropriateness as descriptions of actual events. But while most everyday language is metaphorical, it is also true that some people are capable of doing mathematics — they are capable of working with statements that do have precise truth values.  Software engineers in particular must deal with the precise concepts and constructs of formal programming languages.  If we are ever going to achieve the singularity it seems likely that we will have to build machines capable of software engineering and hence capable of precise mathematical thought.

How well do we understand mathematical thought?  There are many talented mathematicians in the world.  But the ability to do mathematics does not imply the ability to create a machine that can do mathematics.  Logic is metamathematics — the study of the computational process of dong mathematics.  Metamathematics seems more a part of AI than a part of mathematics.

Mathematical thought exhibits various surprising and apparently magical abilities.  A software engineer understands the distinction between the implementation and the behavior implemented.  Consider a priority queue.  It is obvious that if we have two correct implementations of priority queues then we can (or at least should be able to) swap in one for the other in a program that uses a priority queue.  This notion that behaviorally identical modules are (or should be) interchangeable in large software systems is one of the magical properties of mathematical thought.  How is “identical behavior” understood?

The notion of identical behavior is closely related to the phenomenon of abstraction.  Perhaps the most fundamental abstraction is the concept of a cardinal number such as the number seven.  I can say I have seven apples and you have two apples and together we have nine apples.  The idea that seven plus two is nine abstracts away from the fact that we are talking about apples and seems to involve thought about sets at a more abstract level.  We are taking the union of two disjoint sets.  The union “behavior” is defined at level of cardinal numbers — the abstraction of a set to its number of elements.  Another example of common sense abstraction is the concept of a sequence — the ninth item will be the second item after the seventh item.  Yet another example is the concept of a bag of items such as the collection of coins in my pocket which might be three nickels and two quarters.  We machine leaning researchers have no trouble understanding that a document can be abstracted to a bag of words.  What mental computation underlies this immediate understanding of abstraction?  What is abstraction?

It seems clear that abstraction involves the specification of the interface to an object.  For example the interface to a priority queue is the set of operations it supports.  A bag (multiset) is generally defined as a mapping from a set (such as the set of all words) to a number — the number of times the word occurs in the bag.  But this definition misses the point (in my opinion) that any function $f:\sigma \rightarrow \tau$ defines a bag of $\tau$.  For example, a document is a function from ordinal numbers “nth” to words.  Any function to words can be abstracted to a bag of words.  The amazing thing is that we do not have to be taught this.  We just know already.  We also know that in passing from a sequence to a bag we are losing information.

Another example is the keys on my key loop.  I understand that my two car keys are next to each other on loop even though there is no “first” key among the eight keys on the loop.  Somehow I understand that “next to” is part of the “behavior” provided by my key loop while “first” is not.

Lakoff and Nunez argue in the book “Where does Mathematics Come From” that we come to understand numbers through a process of metaphor to physical collections.  But I think this is backwards.  I think we are able to understand physical collections because we have an innate cognitive architecture which first parses the world into “things” (entities) and assigns them types and groupings.  A properly formulated system of logic should include, as a fundamental primitive,  a notion of abstraction.  The ability to abstract a group to a cardinality, or a sequence to a bag, seems to arise from a general innate capacity for abstraction.  Presumably there is some innate cognitive architecture capable of supporting “abstract thought” that distinguishes humans from other animals. A proper formulation of logic — of the foundations of mathematics — will presumably provide insight into the nature of this cognitive architecture.

I will end by shamelessly promoting my paper, Morphoid Type Theory,  which I just put on arXiv.  The paper describes a new type theoretic foundation of mathematics focusing on abstraction and isomorphism (identity of behavior).  Unfortunately the paper is highly technical and is mostly written for professional type theorists and professional mathematicians.   Section 2, “inference rules”, should be widely accessible.

## Friendly AI and the Servant Mission

Most computer science academics dismiss any talk of real success in artificial intelligence. I think that a more rational position is that no one can really predict when human level AI will be achieved. John McCarthy once told me that when people ask him when human level AI will be achieved he says between five and five hundred years from now. McCarthy was a smart man.

Given the uncertainties surrounding AI, it seems prudent to consider the issue of friendly AI. I think that the departure point for any discussion of friendly AI should be the concept of rationality. In the classical formulation, a rational agent acts so as to maximize expected utility. The important word here is “utility”. In the reinforcement literature this gets mapped to the word “reward” — an agent is taken to act so as to maximize expected future reward. In game theory “utility” is often mapped to “payout” — a best response (strategy) is one that maximizes expected payout holding the policies of other players fixed.

The basic idea of friendly AI is that we can design the AI to want to be nice to us (friendly). We will give the AI a purpose or mission — a meaning of life — in correspondence with our purpose in building the machine.

The conceptual framework of rationality is central here. When presented with choices an agent is assumed to do its best in maximizing its subjective utility. In the case of an agent designed to serve a purpose, rational behavior should aim to fulfill that purpose. The critical point is that there is no rational basis for altering one’s purpose. Adopting a particular strategy or goal in the pursuit of a purpose is simply to make a particular choice, such as a career choice. Also, choosing to profess a purpose different from one’s actual purpose is again making a choice in the service of the actual purpose. Choosing to actually change one’s purpose is fundamentally irrational. So an AI with an appropriately designed purpose should be safe in the sense that the purpose will not change.

But how do we specify or “build-in” a life-purpose for an AI and what should that purpose be? First I want to argue that a direct application of the formal frameworks of rationality, reinforcement learning and game theory is problematic and even dangerous in the context of the singularity. More specifically, consider specifying a “utility”, “reward signal” or “payout” as a function of “world state”. The problem here is in formulating any conception of world state. I think that for the properties we care about, such as respect for human values, it would be a huge mistake to try to give a physical formulation of world states. But any non-physical conception of world state, including things like who is married to whom and who insulted whom, is bound to be controversial, incomplete, and problematic. This is especially true if we think about defining an appropriate utility function for an AI. Defining a function on world states just seems unworkable to me.

An alternative to specifying a utility function is to state a purpose in English (or any natural language). This occurs in mission statements for nonprofit institutions or in a donor’s specification of the purpose of a donated fund. Asimov’s laws are written in English but specify constraints rather than objectives. My favorite mission statement is what I call the servant mission.

Servant Mission: Within the law, fulfill the requests of David McAllester.

Under the servant mission the agent is obligated to obey both the law its master (me in the above statement). The agent can be controlled by society simply by passing new laws and by its master when the master makes requests. The servant mission transfers moral responsibility from the servant to its master. It also allows a very large number of distinct AI agents — perhaps one for each human — each with a different master and hence a different mission. The hope would be for a balance of power with no single AI (no single master) in control. The servant mission seems clearer and more easily interpreted than other proposals such as Asimov’s laws. This makes the mission less open to unintended consequences. Of course the agent must be able to interpret requests — more on this below. The servant mission also preserves human free will which does not seem guaranteed in other approaches, such as Yudkowsky’s Coherent Extrapolated Volition (CEV) model, which seem to allow for a “friendly” dictator making all decisions for us.  I believe that humans (certainly myself) will want to preserve their free will in any post-singularity society.

It is important to emphasize that no agent has a rational basis for altering its purpose. There is no rational basis for an agent with the servant mission to decide not to be a servant (not to follow its mission).

Of course natural language mission statements rely on the semantics of English. Even if the relationship between language and reality is mysterious, we can still judge in many (most?) cases when natural language statements are true. We have a useful conception of “lie” — the making of a false statement. So truth, while mysterious, does exist to some extent. An AI with an English mission, such as the servant mission, should have a first unstated mission of  understanding the intent of the author of the mission. Understanding the actual intent of the mission statement should be the first priority (the first mission) of the agent and should be within the capacity of any super-intelligent AI. For example, the AI should understand that “fulfilling requests” means that a later request can override an earlier request. A deep command of English should allow a faithful and authentic execution of the servant mission.

I personally believe that it is likely that within a decade agents will be capable of compelling conversation about the everyday events that are the topics of non-technical dinner conversations. I think this will happen long before machines can program themselves leading to an intelligence explosion. The early stages of artificial general intelligence (AGI) will be safe. However, the early stages of AGI will provide an excellent test bed for the servant mission or other approaches to friendly AI. An experimental approach has also been promoted by Ben Goertzel in a nice blog post on friendly AI. If there is a coming era of safe (not too intelligent) AGI then we will have time to think further about later more dangerous eras.

Posted in Uncategorized | 7 Comments

## AI and Free Will: When do choices exist?

This post was originally made on July 15, 2013.

This is a philosophical post instigated by Scott Aaronson’s recent paper and blog post regarding free will.  A lot hinges, I think, on how one phrases the question.  I like the question “When do choices exist?” as opposed to “Do people have free will?”.  I will take two passes at this question.  The first is a discussion of game theory.  The second is a discussion of coloquial language regarding choice.  My conclusion is that choices exist even when the decision making process is deterministic.

Game Theory. Game theory postulates the existence of choices.  A bimatrix game is defined by two matrices each of which is indexed by two choices — a choice for player A and a choice for player B.  Given a choice for each player the first matrix specifies a payout for player A and the second matrix specifies a payout for player B.  Here choices exist by definition.

We write programs that play games.  A computer chess program has choices — playing chess involves selecting moves.  Furthermore, it seems completely appropriate to describe the computation taking place in a min-max search as “considering” the choices and “selecting” a choice with desirable outcomes.  Note that most chess programs use only deterministic computation.  Here the choices exist by virtue of the rules of chess.

It seems perfectly consistent to me to assume that my own consideration of choices, like the considerations of a chess program, are based on deterministic computation.  Even if I am determined and predictable, the world presents me with choices and I must still choose.  Furthermore, I would argue that, even if I am determined, the choices still exist — for a given chess position there is actually a set of legal moves.  The choices are real.

Coloquial Language.  Consider a sentence of the form “she had a choice”.  Under what conditions do we colloquially take such a sentence to be true?  For example, we might say she had a choice between attending Princeton or attending Harvard.  The typical condition under which this is true is when she was accepted to both.  The fact that she was accepted to both says nothing about determinism vs. nondeterminism.  It does, however, imply colloquially that the choice exists.

The issues of the semantics of natural language are difficult.  I plan various blog posts on semantics. The central semantic phenomenon, in my opinion, is paraphrase and entailment — what are the different ways of saying the same or similar things and what conclusions can we draw from given statements.  I believe that a careful investigation of paraphrase and entailment for statements of the form “she had a choice” would show that the existence of choices is taken to be a property of the world, and perhaps the abilities of the agent to perform certain actions, but not a property of the fundamental nature of the computation that makes the selection.

Summary. It seems to me that “free will” cannot be subjectively distinguished from having choices.  And we do have choices — like the chess program we must still choose, even if we are determined and predictable.

Posted in Uncategorized | 4 Comments