## Why we need a new foundation of mathematics

I just finished what I am confident is the final revision of my paper on the foundations of mathematics.  I want to explain here what this is all about. Why do we need a new foundation of mathematics?

ZFC set theory has stood as a fixed eight-axiom foundation of mathematics since the early 1920s. ZFC does does an excellent job of capturing the notion of provability in mathematics.  Things not provable in ZFC, such as the continuum hypothesis,the existence of inaccessible cardinals, or the consistency of ZFC, are forever out of the reach of mathematics.  As a circumscription of the provable, ZFC suffices.

The problem is that ZFC is untyped. It is a kind of raw assembly language. Mathematics is about “concepts” — groups, fields, vector spaces, manifolds and so on. We want a type system where the types are in correspondence with the naturally occurring concepts. In mathematics every concept (every type) is associated with a notion of isomorphism. We immediately understand what it means for two graphs to be isomorphic. Similarly for groups, fields and topological spaces, and in fact, all mathematical concepts. We need to define what we mean by a concept (a type) in general and what we mean by “isomorphism” at each type.  Furthermore,  we must prove an abstraction theorem that isomorphic objects are inter-substitutable in well-typed contexts.

We also want to prove “Voldemort’s theorem”. Voldemort’s theorem states that certain objects exist but cannot be named.  For example, it is impossible to name any particular point on a circle (unless coordinates are introduced) or any particular node in an abstract complete graph, any particular basis (coordinate system) for an abstract vector space, or any particular isomorphism between a finite dimensional vector space and its dual.

Consider Wigner’s famous statement on the unreasonable effectiveness of mathematics in physics. I feel comfortable paraphrasing this as stating the unreasonable effectiveness of mathematical concepts in physics – the effectiveness of concepts such as manifolds and group representations. Phrased this way, it can be seen as stating the unreasonable effectiveness of type theory in physics. Strange.

The abstraction theorem and Voldemort’s theorem are fundamentally about types and well-typed expressions.  These theorems cannot be formulated in classical (untyped) set theory.  The notion of canonicality or naturality (Voldemort’s theorem) is traditionally formulated in category theory.  But category theory does not provide a grammar of types defining the mathematical concepts. Nor does it give a general definition of the category associated with an arbitrary mathematical concept.  It reduces “objects” to “points” where the structure of the points is supposed to be somehow represented by the morphisms.  I have always had a distinct dislike for category theory.

I started working on this project about six years ago.  I found that defining the semantics of types in a way that the supports the abstraction theorem and Voldemort’s theorem was shockingly difficult (for me at least).  About three years ago I became aware of homotopy type theory (HoTT) and Voevodsky’s univalence axiom which establishes a notion of isomorphism and an abstraction theorem within the framework of Martin-Löf type theory. Vladimir Voevodsky is a fields medalist and full professor at IAS who organized a special year at IAS in 2012 on HoTT.

After becoming aware of HoTT I continued to work on Morphoid type theory (MorTT) without studying HoTT.  This was perhaps reckless but it is my nature to want to do it myself and then compare my results with other work.  In this case it worked out, at least in the sense that MorTT is quite different from HoTT. I quite likely could not have constructed MorTT had I been distracted by HoTT.  Those interested in details of MorTT and its relationship to HoTT should see my manuscript.

Posted in Uncategorized | 1 Comment

## The Foundations of Mathematics.

It seems clear that most everyday human language, and presumably most everyday human thought, is highly metaphorical.  Most human statements do not have definite Boolean truth values.  Everyday statements have a degree of truth and appropriateness as descriptions of actual events. But while most everyday language is metaphorical, it is also true that some people are capable of doing mathematics — they are capable of working with statements that do have precise truth values.  Software engineers in particular must deal with the precise concepts and constructs of formal programming languages.  If we are ever going to achieve the singularity it seems likely that we will have to build machines capable of software engineering and hence capable of precise mathematical thought.

How well do we understand mathematical thought?  There are many talented mathematicians in the world.  But the ability to do mathematics does not imply the ability to create a machine that can do mathematics.  Logic is metamathematics — the study of the computational process of dong mathematics.  Metamathematics seems more a part of AI than a part of mathematics.

Mathematical thought exhibits various surprising and apparently magical abilities.  A software engineer understands the distinction between the implementation and the behavior implemented.  Consider a priority queue.  It is obvious that if we have two correct implementations of priority queues then we can (or at least should be able to) swap in one for the other in a program that uses a priority queue.  This notion that behaviorally identical modules are (or should be) interchangeable in large software systems is one of the magical properties of mathematical thought.  How is “identical behavior” understood?

The notion of identical behavior is closely related to the phenomenon of abstraction.  Perhaps the most fundamental abstraction is the concept of a cardinal number such as the number seven.  I can say I have seven apples and you have two apples and together we have nine apples.  The idea that seven plus two is nine abstracts away from the fact that we are talking about apples and seems to involve thought about sets at a more abstract level.  We are taking the union of two disjoint sets.  The union “behavior” is defined at level of cardinal numbers — the abstraction of a set to its number of elements.  Another example of common sense abstraction is the concept of a sequence — the ninth item will be the second item after the seventh item.  Yet another example is the concept of a bag of items such as the collection of coins in my pocket which might be three nickels and two quarters.  We machine leaning researchers have no trouble understanding that a document can be abstracted to a bag of words.  What mental computation underlies this immediate understanding of abstraction?  What is abstraction?

It seems clear that abstraction involves the specification of the interface to an object.  For example the interface to a priority queue is the set of operations it supports.  A bag (multiset) is generally defined as a mapping from a set (such as the set of all words) to a number — the number of times the word occurs in the bag.  But this definition misses the point (in my opinion) that any function $f:\sigma \rightarrow \tau$ defines a bag of $\tau$.  For example, a document is a function from ordinal numbers “nth” to words.  Any function to words can be abstracted to a bag of words.  The amazing thing is that we do not have to be taught this.  We just know already.  We also know that in passing from a sequence to a bag we are losing information.

Another example is the keys on my key loop.  I understand that my two car keys are next to each other on loop even though there is no “first” key among the eight keys on the loop.  Somehow I understand that “next to” is part of the “behavior” provided by my key loop while “first” is not.

Lakoff and Nunez argue in the book “Where does Mathematics Come From” that we come to understand numbers through a process of metaphor to physical collections.  But I think this is backwards.  I think we are able to understand physical collections because we have an innate cognitive architecture which first parses the world into “things” (entities) and assigns them types and groupings.  A properly formulated system of logic should include, as a fundamental primitive,  a notion of abstraction.  The ability to abstract a group to a cardinality, or a sequence to a bag, seems to arise from a general innate capacity for abstraction.  Presumably there is some innate cognitive architecture capable of supporting “abstract thought” that distinguishes humans from other animals. A proper formulation of logic — of the foundations of mathematics — will presumably provide insight into the nature of this cognitive architecture.

I will end by shamelessly promoting my paper, Morphoid Type Theory,  which I just put on arXiv.  The paper describes a new type theoretic foundation of mathematics focusing on abstraction and isomorphism (identity of behavior).  Unfortunately the paper is highly technical and is mostly written for professional type theorists and professional mathematicians.   Section 2, “inference rules”, should be widely accessible.

## Friendly AI and the Servant Mission

Most computer science academics dismiss any talk of real success in artificial intelligence. I think that a more rational position is that no one can really predict when human level AI will be achieved. John McCarthy once told me that when people ask him when human level AI will be achieved he says between five and five hundred years from now. McCarthy was a smart man.

Given the uncertainties surrounding AI, it seems prudent to consider the issue of friendly AI. I think that the departure point for any discussion of friendly AI should be the concept of rationality. In the classical formulation, a rational agent acts so as to maximize expected utility. The important word here is “utility”. In the reinforcement literature this gets mapped to the word “reward” — an agent is taken to act so as to maximize expected future reward. In game theory “utility” is often mapped to “payout” — a best response (strategy) is one that maximizes expected payout holding the policies of other players fixed.

The basic idea of friendly AI is that we can design the AI to want to be nice to us (friendly). We will give the AI a purpose or mission — a meaning of life — in correspondence with our purpose in building the machine.

The conceptual framework of rationality is central here. When presented with choices an agent is assumed to do its best in maximizing its subjective utility. In the case of an agent designed to serve a purpose, rational behavior should aim to fulfill that purpose. The critical point is that there is no rational basis for altering one’s purpose. Adopting a particular strategy or goal in the pursuit of a purpose is simply to make a particular choice, such as a career choice. Also, choosing to profess a purpose different from one’s actual purpose is again making a choice in the service of the actual purpose. Choosing to actually change one’s purpose is fundamentally irrational. So an AI with an appropriately designed purpose should be safe in the sense that the purpose will not change.

But how do we specify or “build-in” a life-purpose for an AI and what should that purpose be? First I want to argue that a direct application of the formal frameworks of rationality, reinforcement learning and game theory is problematic and even dangerous in the context of the singularity. More specifically, consider specifying a “utility”, “reward signal” or “payout” as a function of “world state”. The problem here is in formulating any conception of world state. I think that for the properties we care about, such as respect for human values, it would be a huge mistake to try to give a physical formulation of world states. But any non-physical conception of world state, including things like who is married to whom and who insulted whom, is bound to be controversial, incomplete, and problematic. This is especially true if we think about defining an appropriate utility function for an AI. Defining a function on world states just seems unworkable to me.

An alternative to specifying a utility function is to state a purpose in English (or any natural language). This occurs in mission statements for nonprofit institutions or in a donor’s specification of the purpose of a donated fund. Asimov’s laws are written in English but specify constraints rather than objectives. My favorite mission statement is what I call the servant mission.

Servant Mission: Within the law, fulfill the requests of David McAllester.

Under the servant mission the agent is obligated to obey both the law its master (me in the above statement). The agent can be controlled by society simply by passing new laws and by its master when the master makes requests. The servant mission transfers moral responsibility from the servant to its master. It also allows a very large number of distinct AI agents — perhaps one for each human — each with a different master and hence a different mission. The hope would be for a balance of power with no single AI (no single master) in control. The servant mission seems clearer and more easily interpreted than other proposals such as Asimov’s laws. This makes the mission less open to unintended consequences. Of course the agent must be able to interpret requests — more on this below. The servant mission also preserves human free will which does not seem guaranteed in other approaches, such as Yudkowsky’s Coherent Extrapolated Volition (CEV) model, which seem to allow for a “friendly” dictator making all decisions for us.  I believe that humans (certainly myself) will want to preserve their free will in any post-singularity society.

It is important to emphasize that no agent has a rational basis for altering its purpose. There is no rational basis for an agent with the servant mission to decide not to be a servant (not to follow its mission).

Of course natural language mission statements rely on the semantics of English. Even if the relationship between language and reality is mysterious, we can still judge in many (most?) cases when natural language statements are true. We have a useful conception of “lie” — the making of a false statement. So truth, while mysterious, does exist to some extent. An AI with an English mission, such as the servant mission, should have a first unstated mission of  understanding the intent of the author of the mission. Understanding the actual intent of the mission statement should be the first priority (the first mission) of the agent and should be within the capacity of any super-intelligent AI. For example, the AI should understand that “fulfilling requests” means that a later request can override an earlier request. A deep command of English should allow a faithful and authentic execution of the servant mission.

I personally believe that it is likely that within a decade agents will be capable of compelling conversation about the everyday events that are the topics of non-technical dinner conversations. I think this will happen long before machines can program themselves leading to an intelligence explosion. The early stages of artificial general intelligence (AGI) will be safe. However, the early stages of AGI will provide an excellent test bed for the servant mission or other approaches to friendly AI. An experimental approach has also been promoted by Ben Goertzel in a nice blog post on friendly AI. If there is a coming era of safe (not too intelligent) AGI then we will have time to think further about later more dangerous eras.

Posted in Uncategorized | 7 Comments

## AI and Free Will: When do choices exist?

This post was originally made on July 15, 2013.

This is a philosophical post instigated by Scott Aaronson’s recent paper and blog post regarding free will.  A lot hinges, I think, on how one phrases the question.  I like the question “When do choices exist?” as opposed to “Do people have free will?”.  I will take two passes at this question.  The first is a discussion of game theory.  The second is a discussion of coloquial language regarding choice.  My conclusion is that choices exist even when the decision making process is deterministic.

Game Theory. Game theory postulates the existence of choices.  A bimatrix game is defined by two matrices each of which is indexed by two choices — a choice for player A and a choice for player B.  Given a choice for each player the first matrix specifies a payout for player A and the second matrix specifies a payout for player B.  Here choices exist by definition.

We write programs that play games.  A computer chess program has choices — playing chess involves selecting moves.  Furthermore, it seems completely appropriate to describe the computation taking place in a min-max search as “considering” the choices and “selecting” a choice with desirable outcomes.  Note that most chess programs use only deterministic computation.  Here the choices exist by virtue of the rules of chess.

It seems perfectly consistent to me to assume that my own consideration of choices, like the considerations of a chess program, are based on deterministic computation.  Even if I am determined and predictable, the world presents me with choices and I must still choose.  Furthermore, I would argue that, even if I am determined, the choices still exist — for a given chess position there is actually a set of legal moves.  The choices are real.

Coloquial Language.  Consider a sentence of the form “she had a choice”.  Under what conditions do we colloquially take such a sentence to be true?  For example, we might say she had a choice between attending Princeton or attending Harvard.  The typical condition under which this is true is when she was accepted to both.  The fact that she was accepted to both says nothing about determinism vs. nondeterminism.  It does, however, imply colloquially that the choice exists.

The issues of the semantics of natural language are difficult.  I plan various blog posts on semantics. The central semantic phenomenon, in my opinion, is paraphrase and entailment — what are the different ways of saying the same or similar things and what conclusions can we draw from given statements.  I believe that a careful investigation of paraphrase and entailment for statements of the form “she had a choice” would show that the existence of choices is taken to be a property of the world, and perhaps the abilities of the agent to perform certain actions, but not a property of the fundamental nature of the computation that makes the selection.

Summary. It seems to me that “free will” cannot be subjectively distinguished from having choices.  And we do have choices — like the chess program we must still choose, even if we are determined and predictable.

Posted in Uncategorized | 4 Comments

## Metaphor and Mathematics

When I was in graduate school there was considerable interest in a then-new book, Metaphors We Live By, Lakoff and Johnson (1980). More recently I encountered Where Mathematics Comes From, Lakoff and Nunez (2001).  Like Lakoff and Nunez, I believe that metaphor is central to both language and thought. I believe that mentalese is close to surface language and surface language is highly metaphorical. I have also long believed that the mathematics itself is a manifestation of an innate human cognitive architecture. I have recently posted a manuscript giving a new type-theoretic foundation of mathematics that is, I believe, a better formulation of a cognitive architecture than, say, Zermelo-Fraenkel set theory.

But how is it possible that thought is highly metaphorical and, at the same time, based on a cognitive architecture reflected in the precise logical structure of mathematics? The answer is, I believe, that the structure of metaphor is essentially the same as the general structure of mathematical thought. The central idea in both, I would argue, is roles and assertions.

Roles and Assertions

The concept of frame or script, in the sense of Fillmore, Minsky or Schank, seems central to the lexicon of natural language. Frames and scripts have roles. Words such as “buy” and “sell” invoke semantic roles such as the buyer, the seller, the thing being sold and the money transferred. The semantic roles associated with English verbs have been catalogued in Martha Palmer’s verbnet project and semantic role labeling has become one of the central tasks of computational linguistics.

In addition to roles, frames and scripts invoke assertions. For example, we have

before x sold y to z, x owned y

after x sold y to z, z owned y

But what is the relationship between the roles and assertions of lexical frames and metaphor? As an example, the lead story on news.google.com as of this writing starts with

The Israeli military continued its substantial military attacks around Rafah in southern Gaza on Sunday.

Consider the word “continue”. This is a very abstract word. It seems to have a role for an agent (Israel) and the activity that is continued (the attacks). But the assertions associated with this word make minimal assumptions about the agent and the activity. Some assertions might be

If during s, x continued to do y then before s, x was doing y

If during s, x continued to do y then during s, x was doing y

If we consider any two particular continuation events the relationship between them seems metaphorical. For example, consider “he continued to travel west” and “the pump continued to fail”. What these two continuations have in common is the above general assertions about contiuation. But we can think of this as forming a metaphor between traveling and the “activity” of failing. The real phenomenon, however, is that in both cases the assertions of the continuation frame apply. The assertions are polymorphic — they can be applied to instances of different types such as traveling and failing.

Object Oriented Programming

Roles and assertions also arise in object oriented programming. In a graphical user interface one writes code for displaying objects without making any commitment to what those objects are. A displayable object has to support certain methods (roles) that satisfy certain conditions (assertions) assumed by the GUI program. The same code can display avatars and pi-charts by treating both an appropriate abstract level. The GUI code (metaphorically?) treats an avatar as an image.

Mathematics

Roles and assertions arise in almost all mathematical concepts.   A tree is a directed graph with nodes and edges (the roles) satisfying the tree conditions. The nodes can be anything — the concept of “tree” is polymorphic. The concept of tree is very general and can be used in many different situations. All we have to do is to find some way to view a given object as a tree. For example, a chess game can be (metaphorically?) viewed as a tree. My recently posted type-theoretic foundation of mathematics is organized around the notion of structure types (frames) each of which involves roles and assertions.

Summary

In this post I wanted to simply introduce the idea that metaphor and mathematics are related through the common structure of roles and assertions. There are many issues not discussed here such as the distinction between logical (hard) and statistical (soft) assertions and the role of embodiment (grounding). But I will leave these for later posts.

## This was originally posted on Saturday, September 28, 2013

This is a continuation of my “free lunch” post.  I want to provide more context.  I also feel bad about phrasing the discussion in the context of a particular textbook.  The no free lunch theorem is widely quoted and seems to be a standard fixture of the machine learning community.

I would like to rephrase the issue here as a debate between Chomsky and Hinton.  I will start with caricatures of their positions:

Chomsky:  Generalizing to unseen inputs — for example judging grammatically on sentences never seen before — is impossible without some a-priori knowledge of the set of allowed predictors (the set of possible human languages).  Hence we must be born with knowledge of the space of possible human languages — some form of universal grammar must exist.

Hinton:  Neural networks are a Turing-complete model of computation.  Furthermore, there exists some (perhaps yet to be discovered) universal algorithm for learning networks which can account for leaning in arbitrary domains, including language.

This debate centers on an empirical question — what algorithms or knowledge is provided by the human genome?

It is important here to distinguish information-theoretic issues from computational complexity issues.

Information-Theoretic Issues: Chomsky takes the position that on purely information theoretic grounds universal grammar must exist. Chomsky’s argument is essentially an invocation of the no free lunch theorem.  But the free lunch theorem shows, I think, that there exist information-theoretically adequate universal priors on the set of all computable functions.  On information-theoretic grounds I think Hinton’s position is much more defendable than Chomsky’s.

Computational Issues: It is clearly silly to consider enumerating all C++ programs as part of a learning algorithm.  But training deep neural networks is not silly — it has lead to a 30% reduction in word error rate in speech recognition.  Chris Manning is working on deep neural network approaches to learning grammar.

Computational issues do provide good motivations for certain learning algorithms such as the convex optimization used in training an SVM.  But a computational efficiency motivation for a restricted predictor class is different from claiming that for information-theoretic reasons we must be given some restriction to a proper subclass of the computable functions.

There are many Turing complete models of computation and the choice of model does seem to matter. We seem to gain efficiency in programming when we become familiar with the C++ standard library. Somehow the library provides useful general purpose constructs.

We seem to be capable of general purpose thought.  Thought seems related to language. Chomsky might be right that some form of linguistic/cognitive architecture is provided by the human genome. But the adaptive advantage provided by the details of an architecture for general-purpose thought seem likely to be related to computational issues rather than an information-theoretic requirement for a learning bias.

Posted in Uncategorized | 3 Comments

## This was originally posted on Saturday, September 21, 2013

Different learning theorists have different approaches to teaching learning theory. Shai Shalev-Shwartz and Shai Ben-David have written a book,  Understanding Machine Learning: From Theory to Algorithms, which gives a significant amount of space to the no free lunch theorem (preliminary notes). Here I will state a “free lunch theorem” and argue that the free lunch theorem is a good (better) departure point for learning theory.

I will state the free lunch theorem in terms of C++ code.  One can specify a predictor as a C++ program.  The predictor can be a linear classifier, a polynomial, a decision tree, or whatever.  I will assume access to an arbitrarily large (but finite) standard library so that programs can be written extremely compactly if you know the right library function to call.  The library can include every form of function used as a predictor by machine learning practitioners and be designed in such a way as to allow predictors (of all kinds) to be written as compactly as possible.  The library can also include all of the programming abstractions in use by programmers today so that new programs can be written as compactly as possible.  For any code library L, and program f written using L, I will write | f |_L for the length of the code for f (in bits).  Here we are not counting the length of the code for the library functions, just the number of bits in the name of the library function when the function is called.  The theorem states, in essence, that, as long as the library L is written prior to seeing the training data, any predictor f written in C++ using L will perform well provided that it performs well on the training data and that | f |_L is small compared to the size of the training data.  For example, a polynomial predictor with n numerical parameters and which performs well on the training data will perform well on new data provided that the number of training points is large compared with bn where b is the number of bits used for each numerical parameter.  (Note the lack of any reference to VC dimension.)

To state the free lunch theorem precisely we assume a data distribution D on input-output pairs and a sample S of m such pairs drawn IID from this distribution.  For simplicity we will assume binary classification so that the output is always either 1 or -1. For any predictor f mapping input to outputs we define the generalization error rate of f, denoted err(f,D), to be the fraction of time that f makes a mistake when drawing input-output pairs form the data distribution D.  We write err(f,S) for the fraction of times that f makes a mistake on the sample.  The theorem states that with probability at least 1-delta over the draw of the sample S (of m training pairs) the following holds simultaneously for all C++ programs f that can be written with library L and all values of the parameter lambda.

(1)   err(f,D) <= (1 – 1/(2lambda)) [ err(f,S)  +  [lambda(ln 2)/m] | f |_L  + [lambda/m] ln(1/delta) ]

This is a special case of theorem 1 of the PAC-Bayesian tutorial referenced in my post on a generalization bound for dropouts. For fixed lambda, this bound expresses a linear trade-off between the accuracy on the training data err(f,S) and the simplicity of the rule as measured by | f |_L.  It is a form of Occam’s razor. An important point is that we can use an arbitrarily large code library when writing f. Another important point is that it is meaningless to make lambda very large. The only reason for making lambda large is that this reduces the leading factor in the bound. However, the leading factor rapidly approaches 1 as lambda becomes large.  Hence the weight on | f |_L fundamentally goes as 1/m rather than 1/sqrt{m}. (Some people will argue with this interpretation, but in any case the theorem is true as stated.)

The free lunch theorem says that we do not have to fix any particular form of predictor before we see the training data — we can use an enormous library of essentially all classes of predictors known to learning theory, and in fact all predictors that can be written in C++.  This includes any predictor that could possibly be shipped to a customer.

So what about the no free lunch theorem.  As stated by Shai Shalev-Shwartz in his notes chapter 6, the theorem states, in essence,  that for any learning algorithm there exists a data distribution for which a low error rate prediction rule exists but where the learning algorithm fails to find such a predictor. When applied in the setting of the above free lunch theorem (1), the no free lunch theorem states that there exists a data distribution such that the only good predictors f are such that | f |_L > m.   In this case the true input-output relation has no compact representation.  Structureless functions exist.  We are not going to learn structureless functions.

The pedagogical role of the no free lunch theorem is to justify the introduction of restricted classes of predictors — one assumes-away the possibility that the true input-output relation is structureless.  I will admit that one does often want to consider restricted classes, such as linear predictors over a fixed (but very high dimensional) feature map.  This is an important class of predictors and it is useful to try to understand it.  But the no free lunch theorem is often used to justify restrictions to finite capacity classes.  This leads to bad theory in my opinion.  But that is a subject for another post.

In summary, the free lunch theorem (1) seems like a better departure point for the study of learning theory than the observation that structureless functions exist.