Python for generating Timestamps for a manually transcribed .wav file

282 views Asked by At

I am trying to automate the generation of timestamps for speech and silences in .wav files.

My Input: Multiple .wav files with speech in English. All these .wav files have already been manually transcribed.

My Goal: To generate timestamps for the start and end of spoken text, and also for all those silences which are more than 2 secs.

What I've tried till now: I've used Python to split my .wav file at silences more than 2 secs, which is working. I used the below code from stackoverflow.

from pydub.silence import split_on_silence
import deepspeech
import numpy as np

def match_target_amplitude(sound, target_dBFS):
    change_in_dBFS = target_dBFS - sound.dBFS
    return sound.apply_gain(change_in_dBFS)

sound = AudioSegment.from_wav("/content/gdrive/My Drive/Surf.wav")
normalized_sound = match_target_amplitude(sound, -20.0)
chunks = split_on_silence(normalized_sound, min_silence_len=2000, silence_thresh=-30)
for i, chunk in enumerate(chunks):
    fullPath = "/content/gdrive/My Drive/{number}-Surf-{length}.wav".format(number=i+1, length=len(chunk))
    chunk.export(fullPath, format="wav")

After this I tried using Deepspeech to transcribe the split chunks of speech.

But I wasn't able to run Deepspeech as some of my chunks are too long, so the code just runs and stops. Also, I don't know where to split them to make them shorter.

What I'm looking for at this point: To find a way to transcribe the chunks that I've created by splitting at silences.

  1. Is there a way to train a model using my .wav files, so that the speech transcription becomes easy?
  2. Or is there a simpler way to use my .wav files along with their transcriptions, so that timestamp generation becomes easy? (I'd prefer non-internet methods to begin with...)

I hope my question is clear. Thanks!

0

There are 0 answers