I am trying to automate the generation of timestamps for speech and silences in .wav files.
My Input: Multiple .wav files with speech in English. All these .wav files have already been manually transcribed.
My Goal: To generate timestamps for the start and end of spoken text, and also for all those silences which are more than 2 secs.
What I've tried till now: I've used Python to split my .wav file at silences more than 2 secs, which is working. I used the below code from stackoverflow.
from pydub.silence import split_on_silence
import deepspeech
import numpy as np
def match_target_amplitude(sound, target_dBFS):
change_in_dBFS = target_dBFS - sound.dBFS
return sound.apply_gain(change_in_dBFS)
sound = AudioSegment.from_wav("/content/gdrive/My Drive/Surf.wav")
normalized_sound = match_target_amplitude(sound, -20.0)
chunks = split_on_silence(normalized_sound, min_silence_len=2000, silence_thresh=-30)
for i, chunk in enumerate(chunks):
fullPath = "/content/gdrive/My Drive/{number}-Surf-{length}.wav".format(number=i+1, length=len(chunk))
chunk.export(fullPath, format="wav")
After this I tried using Deepspeech to transcribe the split chunks of speech.
But I wasn't able to run Deepspeech as some of my chunks are too long, so the code just runs and stops. Also, I don't know where to split them to make them shorter.
What I'm looking for at this point: To find a way to transcribe the chunks that I've created by splitting at silences.
- Is there a way to train a model using my .wav files, so that the speech transcription becomes easy?
- Or is there a simpler way to use my .wav files along with their transcriptions, so that timestamp generation becomes easy? (I'd prefer non-internet methods to begin with...)
I hope my question is clear. Thanks!