I want to use python
& ffmpeg-python
to extract the audio from a video directly into numpy array.
Currently, I first dump the audio as a wav
file using ffmpeg through CLI and read it back to Python using scipy.io.wavefile
$ ffmpeg -y -i {source_file} -qscale:a 0 -ac 1 -vn -threads 1 -ar 16000 out.wav
Followed by this snippet in python
_, audio1 = wavfile.read("out.wav")
Now I want to modify the above as
out, err = (
ffmpeg
.input(in_filename)
.output(
'-', format='s16le',
acodec='pcm_s16le',
ac=1,
ar='16k',
# sample_rate='16000',
**{"qscale:a": 0}
)
# .overwrite_output()
.run(capture_stdout=True, capture_stderr=True)
)
audio2 = np.frombuffer(out, dtype=np.int16)
(Ref: https://github.com/kkroening/ffmpeg-python/blob/master/examples/transcribe.py#L23)
However, when I compare audio1
and audio2
, I see that the number of samples are different as well as the values. For the same file, when I read through wavefile
, the signal has values in range [-221, 212], but the second approach yields values in range [-74, 72].
I also tried to plot the signal (starting 1 sec, 16000 samples) and it seems, there is some issue with delay and amplitude.
A closer look at the starting shows that there are also some 0
values at the beginning when I use wavfile
The starting delay seems to be around 320 samples.
Finally, the number of samples in both the arrays also seems to be different:
>> print(audio1.shape, audio2.shape)
(2091648,)), ((2091008,)