C:Split wav file by silence gap

4.5k views Asked by At

I have a bunch human reading simple sentence (hello world) as a wav file, How can I break the wav file for 2 wav files each contains word (hello and world) by automatically recognizing the gap between the words? Unfortunately I was unable to find tool to do it for me, so I will write C code that do that, As for my understanging, the gaps should be low numeric values in the wav file, is that correct? I know how to break the files, I Will glad to get approach for the gap recognition problem. Thank you!

3

There are 3 answers

6
Russell Borogove On BEST ANSWER

The way I approach this kind of task is by breaking the wav file into blocks of, say, 0.05 seconds each, computing the RMS amplitude of each block, and comparing the RMS amp to a threshold. If the recording is done under carefully controlled conditions, and the volume of speech relatively well normalized, the threshold may be a static value, but another way to do it is dynamically, checking for a block that is substantially louder than the previous block. You then consider the over-threshold block to be the start of a word.

However, in casual speech, there may not be much of a pause between words. If I say "helloworld" to you without a pause, you can understand me easily.

RMS amplitude is defined as the square root of the average-over-time of the squares of the individual samples.

1
Yash On

http://digitalcardboard.com/blog/2009/08/25/the-sox-of-silence/

I am sure this is the link you need.

 sox in.wav out.wav silence 1 0.5 1% 1 5.0 1% : newfile : restart

SoX will split audio when it detects 5 or more seconds of silence. You’ll end up with output files named out001.wav, out002.wav, and so on.

0
MusiGenesis On

See this answer about note onset detection (detecting the start and end of musical notes in a WAV file is exactly the same problem as detecting the start and end of spoken words in a WAV file).

Please note, however, that the task you've set for yourself is essentially impossible without extremely sophisticated (and not yet in existence) artificial intelligence. When a person speaks in a recording, there usually are not gaps between individual words that are numerically any different from the gaps between individual syllables within multi-syllabic words.