Fuzzy matching/chunking algorithm

1.7k views Asked by At

Background: I have video clips and audio tracks that I want to sync with said videos.

From the video clips, I'll extract a reference audio track. I also have another track that I want to synchronize with the reference track. The desync comes from editing, which altered the intervals for each cutscene.

I need to manipulate the target track to look like (sound like, in this case) the ref track. This amounts to adding or removing silence at the correct locations. This could be done manually, but it'd be extremely tedious. So I want to be able to determine these locations programatically.

Example:

     0         1         2         
     012345678901234567890123
ref: --part1------part2------
syn: -----part1----part2-----
# (let `-` denote silence)

Output:

[(2,6), (5,9) # part1
 (13, 17), (14, 18)] # part2 

My idea is, starting from the beginning:

Fingerprint 2 large chunks* of audio and see if they match:
    If yes: move on to the next chunk
    If not:
        Go down both tracks looking for the first non-silent portion of each
        Offset the target to match the original
        Go back to the beginning of the loop

# * chunk size determined by heuristics and modifiable

The main problem here is sound matching and fingerprinting are fuzzy and relatively expensive operations.

Ideally I want to them as few times as possible. Ideas?

2

There are 2 answers

1
Kaganar On

Sounds like you're not looking to spend a lot of time delving into audio processing/engineering, and hence you want something you can quickly understand and just works. If you're willing to go with something more complex see here for a very good reference.

That being the case, I'd expect simple loudness and zero crossing measures would be sufficient to identify portions of sound. This is great because you can use techniques similar to rsync.

Choose some number of samples as a chunk size and march through your reference audio data at a regular interval. (Let's call it 'chunk size'.) Calculate the zero-crossing measure (you likely want a logarithm (or a fast approximation) of a simple zero-crossing count). Store the chunks in a 2D spatial structure based on time and the zero-crossing measure.

Then march through your actual audio data a much finer step at a time. (Probably doesn't need to be as small as one sample.) Note that you don't have to recompute the measures for the entire chunk size -- just subtract out the zero-crossings no longer in the chunk and add in the new ones that are. (You'll still need to compute the logarithm or approximation thereof.)

Look for the 'next' chunk with a close enough frequency. Note that since what you're looking for is in order from start to finish, there's no reason to look at -all- chunks. In fact, we don't want to since we're far more likely to get false positives.

If the chunk matches well enough, see if it matches all the way out to silence.

The only concerning point is the 2D spatial structure, but honestly this can be made much easier if you're willing to forgive a strict window of approximation. Then you can just have overlapping bins. That way all you need to do is check two bins for all the values after a certain time -- essentially two binary searches through a search structure.

The disadvantage to all of this is it may require some tweaking to get right and isn't a proven method.

0
xan On

If you can reliably distinguish silence from non-silence as you suggest and if the only differences are insertions of silence, then it seems the only non-trivial case is where silence is inserted where there was none before:

ref: --part1part2--
syn: ---part1---part2----

If you can make your chunk size adaptive to the silence, your algorithm should be fine. That is, if your chunk size is equivalent to two characters in the above example, your algorithm would recognize "pa" matches "pa" and "rt" matches "rt" but for the third chunk it must recognize the silence in syn and adapt the chunk size to compare "1" to "1" instead of "1p" to "1-".

For more complicated edits, you might be able to adapt a weighted Shortest Edit Distance algorithm with removing silence have 0 cost.