The Free Lunch Theorem

This was originally posted on Saturday, September 21, 2013

Different learning theorists have different approaches to teaching learning theory. Shai Shalev-Shwartz and Shai Ben-David have written a book, Understanding Machine Learning: From Theory to Algorithms, which gives a significant amount of space to the no free lunch theorem (preliminary notes). Here I will state a “free lunch theorem” and argue that the free lunch theorem is a good (better) departure point for learning theory.

I will state the free lunch theorem in terms of C++ code. One can specify a predictor as a C++ program. The predictor can be a linear classifier, a polynomial, a decision tree, or whatever. I will assume access to an arbitrarily large (but finite) standard library so that programs can be written extremely compactly if you know the right library function to call. The library can include every form of function used as a predictor by machine learning practitioners and be designed in such a way as to allow predictors (of all kinds) to be written as compactly as possible. The library can also include all of the programming abstractions in use by programmers today so that new programs can be written as compactly as possible. For any code library L, and program f written using L, I will write | f |_L for the length of the code for f (in bits). Here we are not counting the length of the code for the library functions, just the number of bits in the name of the library function when the function is called. The theorem states, in essence, that, as long as the library L is written prior to seeing the training data, any predictor f written in C++ using L will perform well provided that it performs well on the training data and that | f |_L is small compared to the size of the training data. For example, a polynomial predictor with n numerical parameters and which performs well on the training data will perform well on new data provided that the number of training points is large compared with bn where b is the number of bits used for each numerical parameter. (Note the lack of any reference to VC dimension.)

To state the free lunch theorem precisely we assume a data distribution D on input-output pairs and a sample S of m such pairs drawn IID from this distribution. For simplicity we will assume binary classification so that the output is always either 1 or -1. For any predictor f mapping input to outputs we define the generalization error rate of f, denoted err(f,D), to be the fraction of time that f makes a mistake when drawing input-output pairs form the data distribution D. We write err(f,S) for the fraction of times that f makes a mistake on the sample. The theorem states that with probability at least 1-delta over the draw of the sample S (of m training pairs) the following holds simultaneously for all C++ programs f that can be written with library L and all values of the parameter lambda.

(1) err(f,D) <= (1 – 1/(2lambda)) [ err(f,S) + [lambda(ln 2)/m] | f |_L + [lambda/m] ln(1/delta) ]

This is a special case of theorem 1 of the PAC-Bayesian tutorial referenced in my post on a generalization bound for dropouts. For fixed lambda, this bound expresses a linear trade-off between the accuracy on the training data err(f,S) and the simplicity of the rule as measured by | f |_L. It is a form of Occam’s razor. An important point is that we can use an arbitrarily large code library when writing f. Another important point is that it is meaningless to make lambda very large. The only reason for making lambda large is that this reduces the leading factor in the bound. However, the leading factor rapidly approaches 1 as lambda becomes large. Hence the weight on | f |_L fundamentally goes as 1/m rather than 1/sqrt{m}. (Some people will argue with this interpretation, but in any case the theorem is true as stated.)

The free lunch theorem says that we do not have to fix any particular form of predictor before we see the training data — we can use an enormous library of essentially all classes of predictors known to learning theory, and in fact all predictors that can be written in C++. This includes any predictor that could possibly be shipped to a customer.

So what about the no free lunch theorem. As stated by Shai Shalev-Shwartz in his notes chapter 6, the theorem states, in essence, that for any learning algorithm there exists a data distribution for which a low error rate prediction rule exists but where the learning algorithm fails to find such a predictor. When applied in the setting of the above free lunch theorem (1), the no free lunch theorem states that there exists a data distribution such that the only good predictors f are such that | f |_L > m. In this case the true input-output relation has no compact representation. Structureless functions exist. We are not going to learn structureless functions.

The pedagogical role of the no free lunch theorem is to justify the introduction of restricted classes of predictors — one assumes-away the possibility that the true input-output relation is structureless. I will admit that one does often want to consider restricted classes, such as linear predictors over a fixed (but very high dimensional) feature map. This is an important class of predictors and it is useful to try to understand it. But the no free lunch theorem is often used to justify restrictions to finite capacity classes. This leads to bad theory in my opinion. But that is a subject for another post.

In summary, the free lunch theorem (1) seems like a better departure point for the study of learning theory than the observation that structureless functions exist.