There have been countless discussions on Stackoverflow and beyond about FFT and pitch detection.
It's generally accepted that FFT, while fast, is not very accurate for a lot of applications but it's not often explained why.
I'd like to explain my understanding of why this is the case, in the hope that someone smarter than me can correct me and fill in the blanks where I cannot.
FFT converts input data from the time domain to the frequency domain.
Initially, we start with a series of data which, if we were to plot on a graph, would have the amplitude of the sound at a given point in time on the Y axis, and time along the X axis. This is in the time domain.
FFT converts these values of amplitudes at points in time, to amplitudes at different frequencies.
The number of data output from the FFT is the SAME as the number of data input
If we input the amplitudes for 10 points in time (10 samples) FFT will output the amplitudes for 10 different frequencies within those samples (after multiplying the sqrt of the imaginary and real numbers).
Which frequencies is determined by the following:
We call the output from an FFT a bin, the width of each bin is calculated by dividing the sample rate by the number of samples in the FFT:
bin width = Sample Rate(Hz)/FFT Length (n samples)
With some real values, that could be:
bin_width = 44100 / 512 = 86.132
So we have 512 bins from our FFT (remember it's the same no. of data in as out), and each one spans 86.132 Hz.
So for a given bin, we can calculate which frequency it represents by:
Bin Freq (Hz) = Bin number (n) * bin width (Hz)
Using the values from above, the 3rd bin in the FFT output would represent the amplitude at 258.398Hz:
Bin Freq (Hz) = 3 * 86.132 = 258.396Hz
This means that with the given sample rate and buffer size, the FFT output cannot be any more accurate than ± 86.132Hz.
If you require greater accuracy (say 1Hz) you'd have to either reduce the sample rate, or increase the buffer size (or both).
desired bin width: 1Hz = 44100 / 44100 # A buffer size of 44100 would work in this instance
As the buffer size gets closer to the sample rate, the issue of latency becomes more of a problem.
FFT Results per second = Sample Rate / Buffer Size = 44100/44100 = 1 FFT per second
(44100 samples per second, to fill a 44100 sample buffer = 1 full buffer per second).
I realise that there's more to FFT than merely calculating the fundamental frequency (the bin with the highest amplitude), but is my understanding of FFT in Pitch Detection correct so far?
Are there any ways of improving the accuracy of an FFT without sacrificing latency?
In addition to the nice answer by @HartmutPfitzinger recommending zero padding and interpolation, it's worth pointing out that there are important fundamental limits on the information you can get from the Fourier transform of a time-limited signal extract.
Think about the limiting case of zero padding -- e.g., taking a single sample, then padding it out to 1 sec duration in order to take a Fourier transform with 1 Hz resolution. It's clear that a very short fragment of signal simply doesn't contain the information about periodicity. Intuitively, we need a fragment longer than the period in question to be able to say anything about whether the signal really repeats at that period.
We can do better if we have constraints on the shape of our periodic signal. For instance, if we are only looking for single sinusoids (i.e., we know our signal is s(t) = A*cos(w*t + phi)), then we can solve for unknown amplitude A, frequency w, and phase phi using as few as three samples of s(t). However, it's pretty rare that we're looking at signals that exactly fit that formulation. At the very least we expect added noise, but most often we have lots of harmonics, i.e., an unknown, non-sinusoidal periodic waveform.
If you try implement interpolated peak picking and/or zero-padding suggested above, and then look at the results you get as you make your signal excerpt shorter (while keeping the FFT length the same), you'll see the uncertainty (error) growing as the fragment gets shorter - to the point where you'll likely get useless results when the fragment is shorter than around twice the cycle length of the periodicity you're trying to measure.
This illustrates a somewhat counter-intuitive but very fundamental limit: It's hard to decide the frequency of a signal to better than 1/T Hz based an observation shorter than T seconds. This is sometimes called the uncertainty principle, and mathematically it's the same as the Heisenberg uncertainty principle from quantum mechanics.
Finally, one other technique I've used to improve the resolution of discrete Fourier transforms is Instantaneous Frequency, as described in:
Toshihiko Abe, Takao Kobayashi, Satoshi Imai: Robust pitch estimation with harmonics enhancement in noisy environments based on instantaneous frequency. ICSLP 1996 (you can find the PDF online, I've run out of my link allowance).
Frequency is just the derivative of phase with respect to time; it turns out you can use "derivative by parts" to directly calculate the instantaneous frequency in each FFT bin by combining the real and imaginary parts of two FFTs using different window functions (one the derivative of the other). For a Matlab implementation, see
http://labrosa.ee.columbia.edu/matlab/chroma-ansyn/ifgram.m
or in Python:
https://github.com/bmcfee/librosa/blob/master/librosa/core.py#L343