speechbrain & CUDA out of memory

390 views Asked by At

I am trying to enhance an audio file (3:16 minutes in length, available here) using Speechbrain. If I run the code below (from this tutorial), I get the error OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 39.59 GiB total capacity; 33.60 GiB already allocated; 3.19 MiB free; 38.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

What is the recommended way to fix the issue? Should I just cut the audio file in pieces?

from speechbrain.pretrained import SepformerSeparation as separator
import torchaudio

model = separator.from_hparams(source="speechbrain/sepformer-wham-enhancement",         
    savedir='pretrained_models/sepformer-wham-enhancement', run_opts={"device":"cuda"})

est_sources = model.separate_file(path=audio_file) 

torchaudio.save("enhanced_wham.wav", est_sources[:, :, 0].detach().cpu(), 8000)
1

There are 1 answers

0
Sanjay KhanSsk On

Yeah, I tried to process the audio of 1:30 minutes, it showed Cuda Memory error in google colab, and when try to split up the audio, it completed in a couple of seconds.

Attaching my code(not production ready code)

from pydub import AudioSegment
import os
import torch
import torchaudio

# Define the path to your input audio file
input_audio_path = 'custom-functions.wav'

# Define the path to the enhanced audio file (if needed)
enhanced_audio_path = 'custom-functions-exported.wav'

# Define the duration for splitting (10 seconds in milliseconds)
split_duration = 10 * 1000

# Load the input audio file
audio = AudioSegment.from_file(input_audio_path)

# Initialize a list to store the segmented audio segments
segments = []

# Split the audio into 10-second segments
for i in range(0, len(audio), split_duration):
    segment = audio[i:i + split_duration]
    segments.append(segment)

output_enhanced_speech = []

for i, segment in enumerate(segments):
    # Save the segment as a temporary file
    temp_path = f'temp_segment_{i}.wav'
    segment.export(temp_path, format='wav')
    
    # Call your enhancement function
    enhanced_speech = model.separate_file(path=temp_path)
    output_enhanced_speech.append(enhanced_speech[:, :].detach().cpu()) 
    # Do something with the enhanced_speech if needed
    
    # Clean up the temporary file
    os.remove(temp_path)

# Concatenate the enhanced speech segments along the time dimension
enhanced_speech_cummulative = torch.cat(output_enhanced_speech ,  dim=1)

audio_data = torch.flatten(enhanced_speech_cummulative , start_dim=1)

sample_rate = 8000  # Use the correct sample rate here
torchaudio.save(enhanced_audio_path, audio_data, sample_rate)
print(f'Enhanced audio saved to {enhanced_audio_path}')