I know there are tons of topics on finding pitch from the FFT, and I've gained a decent understanding of the whole process from turning data samples from time-domain -> frequency-domain, but there are still some areas (probably more advanced) that I'm a little stuck on.
I'm going to walk step by step through my current process and so hopefully someone can help me understand where I'm going wrong!
Before I start, the example I'm using here is a Wav file that I created in Logic which is simply a Piano preset in the A scale, starting at Key A4, and it simply moves up the scale (A4, B4, C#5, D5...) every half bar for a total of 4 seconds at 120 bpm. Here's a link to the wav if it helps: [a https://www.dropbox.com/s/zq1u9aylh5cwlmm/PianoA4_120.wav?dl=0]
Step 1:
I parse out the metadata and the actual sample data.
Metadata:
channels => 2,
sample_rate => 44100,
byte_rate => 176400,
bits_per_sample => 16,
data_chunk_size => 705600,
data => ...
Step 2: Since there are 2 channels, I have a left & right array full of the corresponding sample data and then put each of them through their on FFT. The results of each FFT give me magnitudes and phases for a given frequency
Step 3:
I need to now find the max magnitude of each FFT. I do this by finding all the magnitudes of the real / complex results and then finding the max value. I'm using Matlab to help me so I run max(abs(fft(data)))
. The values I got from finding the max of each FFT were 1275.6 and 1084.0.
Step 4: Find the index of those max values from their respective FFTs and then find the frequency at that index of the mapped frequency-domain values. This gave me 1177.0 Hz and 1177.5 Hz.
This is where I'm confused! I've plotted the time-domain graph and seen how the pitch is found to be A4 simply by looking at the Period and knowing what the period of A4 is but I'm trying to understand how I can come to the same conclusion via the FFT. Any help / places to point me to would be greatly appreciated!
A4 is usually 440Hz. My guess is that you've detected the 3rd harmonic of 440Hz and have an off-by-one error.
Here are some observations on the steps you're using:
Step 2:
There's likely to be nothing gained from performing analysis for both channels. Convert to a mono signal by summing the two together
Step 3:
This doesn't work for polyphonic signals reliably (or for that matter, real-world monophonic instrument signals), furthermore, with monophonic signals, there are cases where the power from two adjacent bins have identical values - this is because each bin is a band-pass filter with a exponential tail in its frequency response. A signal sitting precisely in the middle of two bands contributes equally to both, and in the case of real signals, neither band may have the highest energy in the spectrum despite being the predominant frequency: remember that harmonics will be present and may be large. Also be aware that with some real-world instruments sounds, the fundamentally might not even have the highest energy of the partials.
The phase component of the FFT gives plenty of clues that signals straddle bands.
Step 4:
You're finding the centre frequency of the FFT bin which has the highest energy. As the musical scale is logarithmic base-2, this is reasonably approximate for higher frequency, but at low frequency, won't do the job, even if you use large FFTs (in which cases, you burn a lot of CPU cycles and lose temporal resolution).
To do better than this, you can use the Short-time Fourier Transform and make use of i) The phase (Phi) from successive windows of FFT data ii) and that F = dPhi/dt
From this you can get pretty accurate results.