The first day I tried out Shazam, I was blown away. Next to GPS and surviving the fall down a flight of stairs, being able to recognize a song from a vast corpus of audio was the most incredible thing I'd ever seen my phone do. This recognition works though a process called audio fingerprinting . Examples include:
SoundHound / Midomi
After a few weekends of puzzling through academic papers and writing code, I came up with the Dejavu Project, an open-source audio fingerprinting project in Python. You can see it here on Github:
On my testing dataset, Dejavu exhibits 100% recall when reading an unknown wave file from disk or listening to a recording for at least 5 seconds.
Following is all the knowledge you need to understand audio fingerprinting and recognition, starting from the basics. Those with signals experience should skip to "Peak Finding".
Music as a signal
As a computer scientist, my familiarity with the Fast Fourier Transform (FFT) was only that it was a cool way to mutliply polynomials in O(nlog(n)) time. Luckily it is much cooler for doing signal processing, its canonical usage.
Music, it turns out, is digitally encoded as just a long list of numbers. In an uncompressed .wav file, there are a lot of these numbers - 44100 per second per channel. This means a 3 minute long song has almost 16 million samples.
3 min * 60 sec * 44100 samples per sec * 2 channels = 15,876,000 samples
A channel is a separate sequence of samples that a speaker can play. Think of having two earbuds - this is a "stereo", or two channel, setup. A single channel is called "mono". Today, modern surround sound systems can support many more channels. But unless the sound is recorded or mixed with the same number of channels, the extra speakers are redundant and some speakers will just play the same stream of samples as other speakers.
Why 44100 samples per second? The mysterious choice of 44100 samples per second seems quite arbitrary, but it relates to the Nyquist-Shannon Sampling Theorum . This is a long, mathematical way to say that there is a theoretical limit on the maximum frequency we can capture accurately when recording. This maximum frequency is based on how fast we sample the signal.
If this doesn't make sense, think about watching a fan blade that makes a full revolution at a rate of exactly once per second (1 Hz). Now imagine keeping your eyes closed, but opening them briefly once per second. If the fan still happens to be making exactly a full revolution every 1 second as well, it will appear as though the fan blade hasn't moved! Each time you open your eyes, the blade happens to be in the same spot. But there's a problem. In fact, as far as you know, the fan blade could be making 0, 1, 2, 3, 10, 100, or even 1 million spins per second and you would never know - it would still appear stationary! Thus in order to be assured you are correctly sampling (or "seeing") higher frequencies (or "spins"), you need to sample (or "open your eyes") more frequently. To be exact, we need to sample twice as frequently as the frequency we want to see to make sure we're detecting it.
In the case of recording audio, the accepted rule is that we're OK missing out on frequencies above 22050 Hz since humans can't even hear frequencies above 20,000 Hz. Thus by Nyquist, we have to sample twice that:
Samples per sec needed = Highest-Frequency * 2 = 22050 * 2 = 44100
The MP3 format compresses this in order to 1) save space on your hard drive, and 2) irritate audiophiles, but a pure .wav formatted file on your computer is just a list of 16 bit integers (with a small header).
Since these samples are a signal of sorts, we can repeatedly use an FFT over small windows of time in the song's samples to create a spectrogram of the song. Here's a spectrogram of the first few seconds of "Blurred Lines" by Robin Thicke.
As you can see, it's just a 2D array with amplitude as a function of time and frequency. The FFT shows us the strength (amplitude) of the signal at that particular frequency, giving us a column. If we do this enough times with our sliding window of FFT, we put them together and get a 2D array spectrogram.
It's important to note that the frequency and time values are discretized, each representing a "bin", while the amplitudes are real valued. The color shows the real value (red -> higher, green -> lower) of the amplitude at the discretized (time, frequency) coordinate.
As a thought experiment, if we were to record and create a spectrogram of a single tone, we'd get a straight horizontal line at the frequency of the tone. This is because the frequency does not vary from window to window.
Great. So how does this help us recognize audio? Well, we'd like to use this spectrogram to identify this song uniquely. The problem is that if you have your phone in your car and you try to recognize the song on the radio, you'll get noise - someone is talking in the background, another car honking its horn, etc. We have to find a robust way to capture unique "fingerprints" from the audio signal.
Now that we've got a specrogram of our audio signal, we can start by finding "peaks" in amplitude. We define a peak as a (time, frequency) pair corresponding to an amplitude value which is the greatest in a local "neighborhood" around it. Other (time, frequency) pairs around it are lower in amplitude, and thus less likely to survive noise.
Finding peaks is an entire problem itself. I ended up treating the spectrogram as an image and using the image processing toolkit and techniques from scipy to find peaks. A combination of a high pass filter (accentuating high amplitudes) and scipy local maxima structs did the trick.
Once we've extracted these noise-resistant peaks, we have found points of interest in a song that identify it. We are effectively "squashing" the spectrogram down once we've found the peaks. The amplitudes have served their purpose, and are no longer needed.
Let's plot them to see what it looks like:
You'll notice there are a lot of these. Tens of thousands per song, actually. The beauty is that since we've done away with amplitude, we only have two things, time and frequency, which we've conviently made into discrete, integer values. We've binned them, essentially.
We have a somewhat schizophrenic situation: on one hand, we have a system that will bin peaks from a signal into discrete (time, frequency) pairs, giving us some leeway to survive noise. On the other hand, since we've discretized, we've reduced the information of the peaks from infinite to finite, meaning that peaks found in one song could (hint: will!) collide, emitting the pairs as peaks extracted from other songs. Different songs can and most likely will emit the same peaks! So what now?
So we might have similar peaks. No problem, let's combine peaks into fingerprints! We'll do this by using a hash function.
A hash function takes an integer input and returns another integer as output. The beauty is that a good hash function will not only return the same output integer each time the input is the same, but also that very few different inputs will have the same output.
By looking at our spectrogram peaks and combining peak frequencies along with their time difference between them, we can create a hash, representing a unique fingerprint for this song.
[code]hash(frequencies of peaks, time difference between peaks) = fingerprint hash value[/code] There are lots of different ways to do this, Shazam has their own, SoundHound another, and so on. You can peruse my source to see how I do it, but the point is that by taking into account more than a single peak's values you create fingerprints that have more entropy and therefore contain more information. Thus they are more powerful identifiers of songs since they will collide less.
You can visualize what is going on with the zoomed in annotated spectrogram snipped below: