## This was originally posted on Saturday, September 28, 2013

This is a continuation of my “free lunch” post. I want to provide more context. I also feel bad about phrasing the discussion in the context of a particular textbook. The no free lunch theorem is widely quoted and seems to be a standard fixture of the machine learning community.

I would like to rephrase the issue here as a debate between Chomsky and Hinton. I will start with caricatures of their positions:

**Chomsky**: Generalizing to unseen inputs — for example judging grammatically on sentences never seen before — is impossible without some a-priori knowledge of the set of allowed predictors (the set of possible human languages). Hence we must be born with knowledge of the space of possible human languages — some form of universal grammar must exist.

**Hinton**: Neural networks are a Turing-complete model of computation. Furthermore, there exists some (perhaps yet to be discovered) universal algorithm for learning networks which can account for leaning in arbitrary domains, including language.

This debate centers on an empirical question — what algorithms or knowledge is provided by the human genome?

It is important here to distinguish information-theoretic issues from computational complexity issues.

**Information-Theoretic Issues**: Chomsky takes the position that on purely information theoretic grounds universal grammar must exist. Chomsky’s argument is essentially an invocation of the no free lunch theorem. But the *free lunch* theorem shows, I think, that there exist information-theoretically adequate universal priors on the set of all computable functions. On information-theoretic grounds I think Hinton’s position is much more defendable than Chomsky’s.

**Computational Issues**: It is clearly silly to consider enumerating all C++ programs as part of a learning algorithm. But training deep neural networks is not silly — it has lead to a 30% reduction in word error rate in speech recognition. Chris Manning is working on deep neural network approaches to learning grammar.

Computational issues do provide good motivations for certain learning algorithms such as the convex optimization used in training an SVM. But a computational efficiency motivation for a restricted predictor class is different from claiming that for information-theoretic reasons we must be given some restriction to a proper subclass of the computable functions.

There are many Turing complete models of computation and the choice of model does seem to matter. We seem to gain efficiency in programming when we become familiar with the C++ standard library. Somehow the library provides useful general purpose constructs.

We seem to be capable of general purpose thought. Thought seems related to language. Chomsky might be right that some form of linguistic/cognitive architecture is provided by the human genome. But the adaptive advantage provided by the details of an architecture for general-purpose thought seem likely to be related to computational issues rather than an information-theoretic requirement for a learning bias.

I’m intrigued by your claim that “there exist information-theoretically adequate universal priors on the set of all computable functions”. Really? I’ve been talking to people interested in determining whether animal languages are finite state or context-free (why these should be the only two hypotheses they look at beats me). But consider trying to distinguish between the following two hypotheses: H1 the corpus is generated by some (unknown) PCFG, and H2 the corpus is generated by some PCFG with a centre-embedding depth bound d (i.e., the corpus is actually drawn from a stochastic finite-state language, but parameterised via PCFG rules plus a centre-embedding depth bound). Suppose H1 really is correct. Then the extra MDL “cost” for encoding the corpus under H2 is at most the size of encoding the maximum centre embedding depth d’ for the corpus, which is logarithmic in d’. That is, it would take a very large amount of data to distinguish d from d’.

You seem to be observing that we can have hypotheses H1 and H2 where |H1| (the complexity of H1) is very near|H2|. But this just makes them easier to distinguish — they have essentially the same prior probability so any difference in explanatory power immediately shifts the posterior to the correct one.

Am I confused about your point?

But their likelihoods are going to be essentially the same also, so H1 and H2 will be very difficult to distinguish, even though they represent very important computational differences (is the language finite state or context free?)