Why does deep and cheap learning work so well Lin & Tegmark 2016
Deep learning works remarkably well, and has helped dramatically improve the state-of-the-art in areas ranging from speech recognition, translation, and visual object recognition to drug discovery, genomics, and automatic game playing. However, it is still not fully understood why deep learning works so well.
So begins a fascinating paper looking at connections between machine learning and the laws of physics – showing us how properties of the real world help to make many machine learning tasks much more tractable than they otherwise would be, and giving us insights into why depth is important in networks. It’s a paper I enjoyed reading, but my abilities stop at appreciating the form and outline of the authors’ arguments – for the proofs and finer details I refer you to the full paper.
How do neural networks with comparatively small numbers of neurons manage to approximate functions from a universe that is exponentially larger? Take every possible permutation of the neural network, and you still don’t get near the the number of possibilities for the functions you are trying to learn.
Consider a mega-pixel greyscale image, where each pixel has one of 256 values. Our task is to classify the image as a cat or a dog . There are 256 1,000,000 possible input images (the domain of the function we are trying to learn). Yet networks with just thousands or millions of parameters learn to perform this classification quite well!
In the next section we’ll see that the laws of physics are such that for many of the data sets we care about (natural images, sounds, drawings, text, and so on) we can perform a “combinatorial swindle”, replacing exponentiation by multiplication. Given n inputs with v values each, instead of needing v n parameters, we only need v x n parameters.
We will show that the success of the swindle depends fundamentally on physics…
The Hamiltonian connection between physics and machine learning
Neural networks search for patterns in data that can be used to model probability distributions. For example, classification looks at a given input vector x , and produces a probability distribution y over categories. We can express this as p(y| x ) . For example y could be animals, and y a cat.
We can rewrite p(y| x ) using Bayes’ theorem:
And recast this equation using the Hamiltonian H y ( x ) = -ln p( x |y) (i.e., simple subsitution of forms). In physics, the Hamiltonian is used to quantify the energy of x given the parameter y.
This recasting is useful because the Hamiltonian tends to have properties making it simple to evaluate.
In neural networks, the popular softmax layer normalises all vector elements such that they sum to unity. It is defined by:
Using this operator, we end up with a formula for the desired classification probability vector p( x ) in this form: