This blog post is inspired by the recent NIPS talk by Ali Rahimi and the response by Yann LeCun. The issue is fundamentally the role of theory in deep learning. I will start with some quotes from Rahimi’s talk.

Rahimi:

Machine learning has become alchemy.

Alchemy worked [for many things].

But scientists had to dismantle two thousand years worth of alchemical theories.

I would like to live in a society whose systems are built on verifiable rigorous knowledge and not on alchemy.

LeCun:

Understanding (theoretical or otherwise) is a good thing.

[However, Rahimi’s statements are dangerous because] it’s exactly this kind of attitude that lead the ML community to abandon neural nets for over 10 years, despite ample empirical evidence that they worked very well in many situations.

I fundamentally agree with Yann that a drive for rigor can mislead a field. Perhaps most dangerous is the drive to impress colleagues with one’s mathematical sophistication rather than to genuinely seek real progress.

But I would like to add my own spin to this debate. I will start by again quoting Rahini:

Rahini:

[When a deep network doesn’t work] I think it is gradient descent’s fault.

Gradient descent is the cornerstone of deep learning. Gradient descent is a form of local search. Here are some other examples of local search:

The evolution of the internal combustion engine from the 1790s through the twentieth century.

The evolution of semiconductor processes over the last fifty years of Moore’s law.

Biological evolution including the evolution of the human brain.

The evolution of mathematics from the ancient Greeks to the present.

The hours of training alphago(zero) takes to become the world’s strongest chess program through self play.

Local search is indeed mysterious. But can we really expect a rigorous theory of local search that predicts or explains the evolution of the human brain or the historical evolution of mathematic knowledge? Can we really expect to predict by some sort of second order analysis of gradient descent what mathematical theories will emerge in the next twenty years? My position is that local search (gradient descent) is extremely powerful and fundamentally forever beyond any fully rigorous understanding.

Computing power has reached the level where gradient descent on a strong architecture on a strong GPU can only be understood as some form of very powerful general non-convex local search similar in nature to the above examples. Yes, the failure of a particular neural network training run is a failure of gradient descent (local search). But that observation provides very little insight or understanding.

A related issue is one’s position on the time frame for artificial general intelligence (AGI). Will rigor help achieve AGI? Perhaps even Rahini would find it implausible that a rigorous treatment of AGI is possible. A common response by rigor-seekers is that AGI is too far away to talk about. But I find it much more exciting to think we are close. I have written a blog post on the plausibility of near-term machine sentience.

I do believe that insight into architecture is possible and that such insight can fruitfully guide design. LSTMs appeared in 1997 because of a “theoretical insight” about a way of overcoming vanishing gradients. The understanding of batch normalization as a method of overcoming internal covariate shift is something I do feel that I understand at an intuitive level (I would be happy to explain it). Intuitive non-rigorous understanding is the bread and butter of theoretical physics.

Fernando Pereira (who may have been quoting someone else) told me 20 years ago about the “explorers” and the “settlers”. The explorers see the terrain first (without rigor) and the settlers clean it up (with rigor). Consider calculus or Fourier analysis. But in the case of local search I don’t think the theorems (the settlers) will ever arrive.

Progress in general local search (AGI) will come, in my opinion, from finding the right models of computation — the right general purpose architectures — for defining the structure of “strong” local search spaces. I have written a previous blog post on the search for general cognitive architectures. Win or lose, I personally am going to continue to pursue AGI.