Within the deep learning community there is considerable interest in neural architecture. There are convolutional networks, recurrent networks, LSTMs, GRUs, attention mechanisms, highway networks, inception networks, residual networks, fractal networks and many more. Most of these architectures can be viewed as certain feed-forward circuit topologies. Circuits are a universal model of computation. However, human programmers find it more productive to specify algorithms in high level programing languages. Presumably this applies to learning as well — learning should be easier with a higher level architecture.

Of course the deep community is aware of the relationship between neural architecture and models of computation. General “high level” architectures have been proposed. We have neural turing machines (actually random access machines), parsing architectures with stacks, and neural architectures for functional programming. There was a nice NIPS workshop on reasoning, attention and memory (RAM) addressing such fundamental architectural issues.

It seems reasonable to use classical models of computation as inspiration for neural architectures. But it is important to be aware of the large variety of classical architectural ideas. Various twentieth century discrete architectures may provide a rich source of inspiration for twenty first century differentiable architectures. Here is a list my favorite classical architectural ideas.

**Mathematical Logic:** Starting with the ancient Greeks, logic has been developed directly as a model of knowledge representation and thought. Mathematical logic organizes knowledge around entities and relations. Databases are closely related to predicate calculus. Logic is capable of representing knowledge in any domain of discourse. While entities and relations are central to logic, logic involves a variety of additional features such as function application, quantification, and types. Logic also provides the intellectual framework underlying mathematics. Achieving the singularity will presumably require machines to be capable of programming computers. Computer programming seems to require sound analytical (mathematical) reasoning.

**Production Systems and Logic Programming:** This style of architecture was championed by Herb Simon and Alan Newel. It is a way of making logical rules compute efficiently. I will interpret production systems fairly broadly to include various rule-based languages such as SOAR, Ops5, Prolog and Datalog. The cleanest of this family of architectures is bottom-up logic programming which has a nice relationship to general dynamic programming and is the foundation of the Dyna programming langauge. Dynamic programming algorithms can be viewed as feed-forward networks where each entry in a dynamic programming table can be viewed as a structured-output unit which computes its values form earlier units and provides its value to later units.

**Inductive Logic Programming: **This is a classical unification of machine learning and logic programming championed by Stephen Muggleton. The basic idea is take a set of assertions in predicate calculus (observed data) and generalize them to a “theory” (a logic program) that is consistent and that implies the data.

**Frames, Scripts, and Object-Oriented Programming:** Frames and scripts were championed as a general framework for knowledge representation by Marvin Minsky , Roger Schank, and Charles Fillmore. Frames are related to object-oriented programming in the sense that an instance of the “room frame” (or room class) has fillers for fields such as “ceiling”, “windows” and “furniture”. Frames also seem related to the ontology of mathematics. For example, a mathematical field consists of a set together with two operations (addition and multiplication) satisfying certain properties. The term “structure” has a well defined meaning in model theory (a branch of logic) which is closely related to the notion of a class instance in object-oriented programming. A specific mathematical field is a structure (in the technical sense) and is an instance of the general mathematical class of fields.

The Situation Calculus and Modal Logic: In the situation calculus statements take meaning in”situations”. A “fluent” is a mapping from situations to truth values. Actions change one situation into another. This leads to the STRIPS model of actions and planning. Situations are closely related to the possible worlds of modal logic.

Monads: Monads generalize the relationship between pure (stateless) functional programming, as in the programming language Haskel, and the more familiar effect-based programming as in C or C++ where assignment statements change the state of the computation. The mapping (or compilation) from an effect-driven program to a pure (stateless) program defines the state monad. There are different monads. The state monad treats each action as a mapping from an input state to an output state. The power set monad (or non-determinism monad) treats each action as a mapping from a set of states to a set of states. The probability monad treats each action as a mapping from a probability distribution to a probability distribution. The probability monad gives rise to probabilistic programming languages. There are also more esoteric monads such as the CPS monad which treats each action as mapping a state of the stack to a state of the stack thereby converting recursion to iteration. In a pure language such as Haskel the use of a monad to suppress a state argument typically makes code more readable. There also seem to be a relationship between the states of the common monads and the situations or possible worlds of the situation calculus and modal logic.

## Conclusion

I believe that human learning is based on a differentiable universal learning architecture and that domain specific priors are not required. But it is unclear how elaborate the general architecture is. It seems worth considering the above list of classical architectural ideas and the possibility that these discrete architectures can be made differentiable.

You might find relevant recent work with some Berkeley colleagues on using natural language parses to assemble neural networks on the fly from composable modules (http://arxiv.org/abs/1601.01705). Replace “parser” with “theorem prover” and you get something quite similar to a later paper by Tim Rocktäschel (http://www.akbc.ws/2016/papers/14_Paper.pdf). The combination of these two things (e.g. an architecture in which natural language makes available fragments of network structures for a neural reasoning engine to use, perhaps implemented using one of the differentiable stack machines you cite above) seems like it might be able to solve a pretty diverse range of tasks in question answering, planning and problem solving.

I agree that it is useful to characterise human learning as an instantiation of a differentiable universal learning architecture, but it is also extremely relevant that evolution built this mechanism by opportunistic re-use and adjustment of bits and pieces that may have served other purposes in other vertebrates. For example, it shouldn’t surprise anyone if the mechanisms that let us use language have a lot in common with the mechanisms that support vision. You might see a preference for topographic mappings in language, with the reason for that “design” feature being that these are pervasive for vision, proprioception and so on, therefore available for re-use. So yes to the universal learning architecture, but expect random peculiar biases carried over from things “designed” for other tasks.

it is a classic error to suppose that the way we think we solve (or would prefer to solve) problems actually is how we do it. Herb Simon understood this, but while Schank and Minsky may have

understood it, they didn’t emphasize it so much, which gave opportunities for error.

I agree that introspection cannot be trusted. That is a good reason to take an engineering approach and just do what works. The architectures I mentioned all seem plausible to me as relevant to the engineering of intelligence. Many of the ideas are clearly relevant to software engineering. Differentiable versions of these architectures may behave very differently from the classical discrete incarnations. Only time will tell what really works.