I have an application that is attempting to detect sirens from audio data. However, my understanding of audio concepts and terminology is elementary.
The first step of my application is to detect pitched sounds. The algorithm I implemented for this is as follows:
- Split audio data into windows
- Transform the data in each window to frequency domain using a FFT
- Extract the magnitude of the dominant frequency (ignore bucket 0). Let this be maxMag
- Extract the mean magnitude over all the FFT buckets (ignore bucket 0). Let this be meanMag
- If maxMag / meanMag > some threshold, then the window contains pitched sound
Does this algorithm make sense? Is my terminology correct?
Thank you.
If you are detecting a single tone (or a small set of tones) you don't need to do a full FFT. You can use the Goertzel Algorithm to detect a specific tone. You probably don't care about the level of the tone you are looking for relative to everything else, so you should be able to avoid the "dominant frequency" test unless you have some reason for only detecting the tone if it is the loudest tone in the environment.