This blog post is inspired by the recent NIPS talk by Ali Rahimi and the response by Yann LeCun. The issue is fundamentally the role of theory in deep learning. I will start with some quotes from Rahimi’s talk.

Rahimi:

Machine learning has become alchemy.

Alchemy worked [for many things].

But scientists had to dismantle two thousand years worth of alchemical theories.

I would like to live in a society whose systems are built on verifiable rigorous knowledge and not on alchemy.

LeCun:

Understanding (theoretical or otherwise) is a good thing.

[However, Rahimi’s statements are dangerous because] it’s exactly this kind of attitude that lead the ML community to abandon neural nets for over 10 years, despite ample empirical evidence that they worked very well in many situations.

I fundamentally agree with Yann that a drive for rigor can mislead a field. Perhaps most dangerous is the drive to impress colleagues with one’s mathematical sophistication rather than to genuinely seek real progress.

But I would like to add my own spin to this debate. I will start by again quoting Rahini:

Rahini:

[When a deep network doesn’t work] I think it is gradient descent’s fault.

Gradient descent is the cornerstone of deep learning. Gradient descent is a form of local search. Here are some other examples of local search:

The evolution of the internal combustion engine from the 1790s through the twentieth century.

The evolution of semiconductor processes over the last fifty years of Moore’s law.

Biological evolution including the evolution of the human brain.

The evolution of mathematics from the ancient Greeks to the present.

The hours of training alphago(zero) takes to become the world’s strongest chess program through self play.

Local search is indeed mysterious. But can we really expect a rigorous theory of local search that predicts or explains the evolution of the human brain or the historical evolution of mathematic knowledge? Can we really expect to predict by some sort of second order analysis of gradient descent what mathematical theories will emerge in the next twenty years? My position is that local search (gradient descent) is extremely powerful and fundamentally forever beyond any fully rigorous understanding.

Computing power has reached the level where gradient descent on a strong architecture on a strong GPU can only be understood as some form of very powerful general non-convex local search similar in nature to the above examples. Yes, the failure of a particular neural network training run is a failure of gradient descent (local search). But that observation provides very little insight or understanding.

A related issue is one’s position on the time frame for artificial general intelligence (AGI). Will rigor help achieve AGI? Perhaps even Rahini would find it implausible that a rigorous treatment of AGI is possible. A common response by rigor-seekers is that AGI is too far away to talk about. But I find it much more exciting to think we are close. I have written a blog post on the plausibility of near-term machine sentience.

I do believe that insight into architecture is possible and that such insight can fruitfully guide design. LSTMs appeared in 1997 because of a “theoretical insight” about a way of overcoming vanishing gradients. The understanding of batch normalization as a method of overcoming internal covariate shift is something I do feel that I understand at an intuitive level (I would be happy to explain it). Intuitive non-rigorous understanding is the bread and butter of theoretical physics.

Fernando Pereira (who may have been quoting someone else) told me 20 years ago about the “explorers” and the “settlers”. The explorers see the terrain first (without rigor) and the settlers clean it up (with rigor). Consider calculus or Fourier analysis. But in the case of local search I don’t think the theorems (the settlers) will ever arrive.

Progress in general local search (AGI) will come, in my opinion, from finding the right models of computation — the right general purpose architectures — for defining the structure of “strong” local search spaces. I have written a previous blog post on the search for general cognitive architectures. Win or lose, I personally am going to continue to pursue AGI.

I think the observation that the failure of gradient descent provides a failure run makes sense. In most experiments, the failure comes from an inappropriate initialization. We need something more than SGD. Some search algorithm that searches in larger space in meta-learning make sense in this way.

Could you please write that intuitive explanation? “The understanding of batch normalization as a method of overcoming internal covariate shift is something I do feel that I understand at an intuitive level (I would be happy to explain it).”

ok. You called me on it. The batchnorm paper as says:

We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training.

The idea is that if we cut the network in the middle the top part is taking inputs from the bottom part. As the parameters of the bottom part change the inputs to the top part change. The inputs to the top part are the “covariates” (the features from which the prediction is to be made) for the top part of the network. Batch normalization does “diagonal whitening” of the covariates coming into that layer. I believe that input whitening is often viewed as helpful in for domain adaptation (covariate shift).

I should add that while I think this explanation is plausible, I am not at all sure that it really explains what is really going on. It may also simply be a matter of better conditioning of the gradient descent.

One concern I have is that the current progress in deep learning is going to stall unless we get a better understanding of why they do and do not work. For example, observing the huge proliferation of GANs gives me the sinking feeling that we are searching the space of Turing machines hoping to find a good one. I suspect our intuitions are too weak and our theoretical understanding too meager to give us good guidance for this search. After all, even local search requires an objective function

Pingback: 人工智能进步来自计算力？周志华：不同意；陈怡然笑谈“分饼”-AI与我

Pingback: Resumen de lecturas compartidas (diciembre de 2017) | Vestigium