how to find an offset from two audio file ? one is noisy and one is clear

1.7k views Asked by At

I have once scenario in which user capturing the concert scene with the realtime audio of the performer and at the same time device is downloading the live streaming from audio broadcaster device.later i replace the realtime noisy audio (captured while recording) with the one i have streamed and saved in my phone (good quality audio).right now i am setting the audio offset manually with trial and error basis while merging so i can sync the audio and video activity at exact position.

Now what i want to do is to automate the process of synchronisation of audio.instead of merging the video with clear audio at given offset i want to merge the video with clear audio automatically with proper sync.

for that i need to find the offset at which i should replace the noisy audio with clear audio.e.g. when user start the recording and stop the recording then i will take that sample of real time audio and compare with live streamed audio and take the exact part of that audio from that and sync at perfect time.

does any one have any idea how to find the offset by comparing two audio files and sync with the video.?

3

There are 3 answers

2
Jordan Smith On BEST ANSWER

Here's a concise, clear answer.

• It's not easy - it will involve signal processing and math.
• A quick Google gives me this solution, code included.
• There is more info on the above technique here.
• I'd suggest gaining at least a basic understanding before you try and port this to iOS.
• I would suggest you use the Accelerate framework on iOS for fast Fourier transforms etc
• I don't agree with the other answer about doing it on a server - devices are plenty powerful these days. A user wouldn't mind a few seconds of processing for something seemingly magic to happen.

Edit

As an aside, I think it's worth taking a step back for a second. While math and fancy signal processing like this can give great results, and do some pretty magical stuff, there can be outlying cases where the algorithm falls apart (hopefully not often).

What if, instead of getting complicated with signal processing, there's another way? After some thought, there might be. If you meet all the following conditions:

• You are in control of the server component (audio broadcaster device)
• The broadcaster is aware of the 'real audio' recording latency
• The broadcaster and receiver are communicating in a way that allows accurate time synchronisation

...then the task of calculating audio offset becomes reasonably trivial. You could use NTP or some other more accurate time synchronisation method so that there is a global point of reference for time. Then, it is as simple as calculating the difference between audio stream time codes, where the time codes are based on the global reference time.

0
previous_developer On

I don't know a lot about the subject, but I think you are looking for "audio fingerprinting". Similar question here.

An alternative (and more error-prone) way is running both sounds through a speech to text library (or an API) and matching relevant part. This would be of course not very reliable. Sentences frequently repeat in songs and concert maybe instrumental.

Also, doing audio processing on a mobile device may not play well (because of low performance or high battery drain or both). I suggest you to use a server if you go that way.

Good luck.

0
Rampartisan On

This could prove to be a difficult problem, as even though the signals are of the same event, the presence of noise makes a comparison harder. You could consider running some post-processing to reduce the noise, but noise reduction in its self is an extensive non-trivial topic.

Another problem could be that the signal captured by the two devices could actually differ a lot, for example the good quality audio (i guess output from the live mix console?) will be fairly different than the live version (which is guess is coming out of on stage monitors/ FOH system captured by a phone mic?)

Perhaps the simplest possible approach to start would be to use cross correlation to do the time delay analysis.

A peak in the cross correlation function would suggest the relative time delay (in samples) between the two signals, so you can apply the shift accordingly.