Python - Extracting audio from video files to numpy array using ffmpeg

253 views Asked by At

I want to use python & ffmpeg-python to extract the audio from a video directly into numpy array.

Currently, I first dump the audio as a wav file using ffmpeg through CLI and read it back to Python using scipy.io.wavefile

$ ffmpeg -y -i {source_file} -qscale:a 0 -ac 1 -vn -threads 1 -ar 16000 out.wav

Followed by this snippet in python

_, audio1 = wavfile.read("out.wav")

Now I want to modify the above as

out, err = (
    ffmpeg
        .input(in_filename)
        .output(
            '-', format='s16le', 
            acodec='pcm_s16le', 
            ac=1, 
            ar='16k', 
            # sample_rate='16000',
            **{"qscale:a": 0}
        )
        # .overwrite_output()
        .run(capture_stdout=True, capture_stderr=True)
)

audio2 = np.frombuffer(out, dtype=np.int16)

(Ref: https://github.com/kkroening/ffmpeg-python/blob/master/examples/transcribe.py#L23)

However, when I compare audio1 and audio2, I see that the number of samples are different as well as the values. For the same file, when I read through wavefile, the signal has values in range [-221, 212], but the second approach yields values in range [-74, 72].

I also tried to plot the signal (starting 1 sec, 16000 samples) and it seems, there is some issue with delay and amplitude.figure1

A closer look at the starting shows that there are also some 0 values at the beginning when I use wavfile

enter image description here

The starting delay seems to be around 320 samples.

Finally, the number of samples in both the arrays also seems to be different:

>> print(audio1.shape, audio2.shape)
(2091648,)), ((2091008,)
0

There are 0 answers