I'm trying to transform the spectrogram back to the audio. First I used librosa.griffinlim
and it worked well, but it was time-consuming. Therefore I am trying to use torchaudio on GPU to boost the transformation. However I obtained different reconstruction results compared to the librosa.
This is my code:
# Preprocess
data, fs = librosa.load('waveform.wav', sr=44100)
b, a = signal.butter(3, [20 / fs, 1000 / fs], 'bandpass')
data = signal.filtfilt(b, a, data)
plt.plot(data)
# STFT
DMatrix = librosa.stft(data, n_fft=2048, hop_length=int(2048 * 0.1), window='hann')
dbMatrix = librosa.amplitude_to_db(np.abs(DMatrix), ref=np.max)
And I obtained results similar to the original waveform using librosa:
spec = librosa.db_to_amplitude(dbMatrix)
re_wav = librosa.griffinlim(spec, n_iter=100, n_fft=2048, hop_length=int(2048 * 0.1), window='hann')
plt.plot(re_wav)
But when I changed to torchaudio, the result is different.
griffinlim = torchaudio.transforms.GriffinLim(n_fft=2048, n_iter=100, hop_length=int(2048 * 0.1)).to('cuda')
spec = librosa.db_to_amplitude(dbMatrix)
re_wav = griffinlim(torch.tensor(spec).to('cuda'))
plt.plot(re_wav.cpu().detach().numpy())
What am I missing?
There are three common representations for the values in a magnitude spectrogram: amplitude, power and decibel. The Griffin-Lim transform must be aware of this when converting back to a waveform. When using
spec = librosa.db_to_amplitude(dbMatrix)
, the result is an amplitude spectrogram. Inlibrosa.griffinlim
the default is for an amplitude spectrogram, so you get a good reconstruction.For
torchaudio.transforms.GriffinLim
the default is for a power spectrogram. In order make it work with an amplitude spectrogram, pass the argumentpower=1
.