Is it feasable to use AVAudioEngine to detect pitch in real time?

4.1k views Asked by At

I'm trying to write a music app where detection of pitch is the core of it all. I've seen solutions to this problem as well as apps on the AppStore. However most of them are pretty dated and I'd like to do this is Swift. I've been looking at AVAudioEngine as a way to do this, but I find the documentation lacking or maybe I haven't been looking hard enough.

What I have found is that I can tap the inputNode bus like this:

self.audioEngine = AVAudioEngine()
self.audioInputNode = self.audioEngine.inputNode!
self.audioInputNode.installTapOnBus(0, bufferSize:256, format: audioInputNode.outputFormatForBus(0), block: {(buffer, time) in
      self.analyzeBuffer(buffer)
})

The bus is tapped 2-3 times per second and the buffer contains more than 16000 floats for each tap. Are these amplitude samples from the microphone?

The docs at least claims it's output from the node: "The buffer parameter is a buffer of audio captured from the output of an AVAudioNode."

Is it possible to use AVAudioEngine to detect pitch in real time or should I go about this another way?

2

There are 2 answers

6
MdaG On BEST ANSWER

I realize that Hellium3 is really giving me information to what pitch is and if it's a good idea to do these things with Swift.

My question was originally about if tapping the PCM bus is the way to obtain input signals from the microphone.

Since asking this question I've done exactly that. Use the data obtained by tapping the PCM bus and analyse the buffer windows.

It works really well and it was my lack of understanding of what a PCM bus, buffer and sampling frequency is that made me ask the question in the first place.

Knowing those three makes it easier to see that this is right on.

Edit: On demand I'll paste my (deprecated) implementation of the PitchDetector.

class PitchDetector {
  var samplingFrequency: Float
  var harmonicConstant: Float

  init(harmonicConstant: Float, samplingFrequency: Float) {
    self.harmonicConstant = harmonicConstant
    self.samplingFrequency = samplingFrequency
  }

  //------------------------------------------------------------------------------
  // MARK: Signal processing
  //------------------------------------------------------------------------------

  func detectPitch(_ samples: [Float]) -> Pitch? {
    let snac = self.snac(samples)
    let (lags, peaks) = self.findKeyMaxima(snac)
    let (τBest, clarity) = self.findBestPeak(lags, peaks: peaks)
    if τBest > 0 {
      let frequency = self.samplingFrequency / τBest
      if PitchManager.sharedManager.inManageableRange(frequency) {
        return Pitch(measuredFrequency: frequency, clarity: clarity)
      }
    }

    return nil
  }

  // Returns a Special Normalision of the AutoCorrelation function array for various lags with values between -1 and 1
  private func snac(_ samples: [Float]) -> [Float] {
    let τMax = Int(self.samplingFrequency / PitchManager.sharedManager.noteFrequencies.first!) + 1
    var snac = [Float](repeating: 0.0, count: samples.count)
    let acf = self.acf(samples)
    let norm = self.m(samples)
    for τ in 1 ..< τMax {
      snac[τ] = 2 * acf[τ + acf.count / 2] / norm[τ]
    }

    return snac
  }

  // Auto correlation function
  private func acf(_ x: [Float]) -> [Float] {
    let resultSize = 2 * x.count - 1
    var result = [Float](repeating: 0, count: resultSize)
    let xPad = repeatElement(Float(0.0), count: x.count - 1)
    let xPadded = xPad + x + xPad
    vDSP_conv(xPadded, 1, x, 1, &result, 1, vDSP_Length(resultSize), vDSP_Length(x.count))

    return result
  }

  private func m(_ samples: [Float]) -> [Float] {
    var sum: Float = 0.0
    for i in 0 ..< samples.count {
      sum += 2.0 * samples[i] * samples[i]
    }
    var m = [Float](repeating: 0.0, count: samples.count)
    m[0] = sum
    for i in 1 ..< samples.count {
      m[i] = m[i - 1] - samples[i - 1] * samples[i - 1] - samples[samples.count - i - 1] * samples[samples.count - i - 1]
    }
    return m
  }

  /**
   * Finds the indices of all key maximum points in data
   */
  private func findKeyMaxima(_ data: [Float]) -> (lags: [Float], peaks: [Float]) {
    var keyMaximaLags: [Float] = []
    var keyMaximaPeaks: [Float] = []
    var newPeakIncoming = false
    var currentBestPeak: Float = 0.0
    var currentBestτ = -1
    for τ in 0 ..< data.count {
      newPeakIncoming = newPeakIncoming || ((data[τ] < 0) && (data[τ + 1] > 0))
      if newPeakIncoming {
        if data[τ] > currentBestPeak {
          currentBestPeak = data[τ]
          currentBestτ = τ
        }
        let zeroCrossing = (data[τ] > 0) && (data[τ + 1] < 0)
        if zeroCrossing {
          let (τEst, peakEst) = self.approximateTruePeak(currentBestτ, data: data)
          keyMaximaLags.append(τEst)
          keyMaximaPeaks.append(peakEst)
          newPeakIncoming = false
          currentBestPeak = 0.0
          currentBestτ = -1
        }
      }
    }

    if keyMaximaLags.count <= 1 {
      let unwantedPeakOfLowPitchTone = (keyMaximaLags.count == 1 && data[Int(keyMaximaLags[0])] < data.max()!)
      if unwantedPeakOfLowPitchTone {
        keyMaximaLags.removeAll()
        keyMaximaPeaks.removeAll()
      }
      let (τEst, peakEst) = self.approximateTruePeak(data.index(of: data.max()!)!, data: data)
      keyMaximaLags.append(τEst)
      keyMaximaPeaks.append(peakEst)
    }

    return (lags: keyMaximaLags, peaks: keyMaximaPeaks)
  }

  /**
   * Approximates the true peak according to https://www.dsprelated.com/freebooks/sasp/Quadratic_Interpolation_Spectral_Peaks.html
   */
  private func approximateTruePeak(_ τ: Int, data: [Float]) -> (τEst: Float, peakEst: Float) {
    let α = data[τ - 1]
    let β = data[τ]
    let γ = data[τ + 1]
    let p = 0.5 * ((α - γ) / (α - 2.0 * β + γ))
    let peakEst = min(1.0, β - 0.25 * (α - γ) * p)
    let τEst = Float(τ) + p

    return (τEst, peakEst)
  }

  private func findBestPeak(_ lags: [Float], peaks: [Float]) -> (τBest: Float, clarity: Float) {
    let threshold: Float = self.harmonicConstant * peaks.max()!
    for i in 0 ..< peaks.count {
      if peaks[i] > threshold {
        return (τBest: lags[i], clarity: peaks[i])
      }
    }

    return (τBest: lags[0], clarity: peaks[0])
  }
}

All credit to Philip McLeod whose research is used in my implementation above. http://www.cs.otago.ac.nz/research/publications/oucs-2008-03.pdf

6
some_id On

A few different concepts here. AVAudioEngine is just the engine that gets you the raw PCM data, you could use Novocaine, Core-Audio directly or other options.

The PCM data is the floating point samples from the microphone.

As far as the pitch tracking goes, there are various techniques. One thing to note is that frequency detection is different from pitch detection.

FFT which is good but will not be able to detect the pitch of signals with missing fundamentals. You would need to run the signal through a low pass filter to reduce possible aliasing of frequencies higher than the Nyquist Frequency and then window it before passing it to the FFT, this is to reduce spectral leakage. The FFT will output spectral content inside a series of bins, the bin with the highest value is said to be the strongest frequency in the signal.

Autocorrelation which can give better results. It's basically the signal correlated with itself.

In the end its down to what you would like to detect, there are a few considerations to take into account. Things like the male voice and certain instruments can give incorrect results through a normal FFT running on buffers that haven't been preprocessed.

Check this PITCH DETECTION METHODS REVIEW

As far as Swift goes, it's not well suited for real-time, performance focused systems. You can check the old benchmarks of Swift vs C++

enter image description here

the C++ FFT implementation is over 24x faster