This was originally posted on Saturday, September 28, 2013
This is a continuation of my “free lunch” post. I want to provide more context. I also feel bad about phrasing the discussion in the context of a particular textbook. The no free lunch theorem is widely quoted and seems to be a standard fixture of the machine learning community.
I would like to rephrase the issue here as a debate between Chomsky and Hinton. I will start with caricatures of their positions:
Chomsky: Generalizing to unseen inputs — for example judging grammatically on sentences never seen before — is impossible without some a-priori knowledge of the set of allowed predictors (the set of possible human languages). Hence we must be born with knowledge of the space of possible human languages — some form of universal grammar must exist.
Hinton: Neural networks are a Turing-complete model of computation. Furthermore, there exists some (perhaps yet to be discovered) universal algorithm for learning networks which can account for leaning in arbitrary domains, including language.
This debate centers on an empirical question — what algorithms or knowledge is provided by the human genome?
It is important here to distinguish information-theoretic issues from computational complexity issues.
Information-Theoretic Issues: Chomsky takes the position that on purely information theoretic grounds universal grammar must exist. Chomsky’s argument is essentially an invocation of the no free lunch theorem. But the free lunch theorem shows, I think, that there exist information-theoretically adequate universal priors on the set of all computable functions. On information-theoretic grounds I think Hinton’s position is much more defendable than Chomsky’s.
Computational Issues: It is clearly silly to consider enumerating all C++ programs as part of a learning algorithm. But training deep neural networks is not silly — it has lead to a 30% reduction in word error rate in speech recognition. Chris Manning is working on deep neural network approaches to learning grammar.
Computational issues do provide good motivations for certain learning algorithms such as the convex optimization used in training an SVM. But a computational efficiency motivation for a restricted predictor class is different from claiming that for information-theoretic reasons we must be given some restriction to a proper subclass of the computable functions.
There are many Turing complete models of computation and the choice of model does seem to matter. We seem to gain efficiency in programming when we become familiar with the C++ standard library. Somehow the library provides useful general purpose constructs.
We seem to be capable of general purpose thought. Thought seems related to language. Chomsky might be right that some form of linguistic/cognitive architecture is provided by the human genome. But the adaptive advantage provided by the details of an architecture for general-purpose thought seem likely to be related to computational issues rather than an information-theoretic requirement for a learning bias.