In my app,
I'm getting array of audio sample (with sample rate =8000) which was loaded with
torchaudio.load
I need to use this audio array and run whisper (STT).
I want to avoid from loading the wav file again with whisper (load_audio) (for efficiency) and to resample the array to 16000.
whisper.load_audio
useffmpeg
to load and resample the audio to 16000. I'm trying to uselibrosa
ortorchaudio
and resample the audio array but It always seems that the resample methods are not the same.(I assume that if I use other resample method not as the whisper model was trained on, I can get bad results).
Example:
loading test.wav file (with SR=8000) and print the 5 first cells:
whisper_audio = whisper.load_audio(file)
=> [-0.00082397 -0.00115967 -0.00186157 -0.00231934 -0.00222778, ...]
loading with torchaudio
and resample it with librosa
:
librosa.resample(vad_audio, orig_sr=8000, target_sr=16000, scale=True, res_type='kaiser_best')
=> [-0.00082317 -0.0010577 -0.0013937 -0.0016688 -0.00186235
seems different values.
How can I resample the audio in the exact way ffmpeg
do it ?
You can use
torchaudio.io.StreamReader
to load and resample audio. This functionality is implemented withffmpeg
, so you might be able to produce the same waveform.When you use the
add_basic_audio_stream
method withsample_rate
option, it will use FFmpeg's filter function to apply resampling.https://pytorch.org/audio/2.1.1/generated/torchaudio.io.StreamReader.html#add-basic-audio-stream
If the
ffmpeg
command is using non-default re-sampling method, you need to construct the same filter description and pass it toadd_audio_stream
method.https://pytorch.org/audio/2.1.1/generated/torchaudio.io.StreamReader.html#add-audio-stream