Efficient speaker diarization

2.4k views Asked by At

I am running a VM instance on google cloud. My goal is to apply speaker diarization to several .wav files stored on cloud buckets.

I have tried the following alternatives with the subsequent problems:

  1. Speaker diarization on Google's API. This seems to go fast but the results make no sense at all. I've already seen similar issues and I opened a thread myself but I get no answer... The output of this only returns maximum of two speakers with random labels. Here is the code I tried in python:
from google.cloud import speech_v1p1beta1 as speech
from google.cloud import storage
import os
import json
import sys

storage_client = storage.Client()
client = speech.SpeechClient()


if "--channel" in sys.argv:
    index = sys.argv.index("--channel") + 1
    if index < len(sys.argv):
        channel = sys.argv[index]
        print("Channel:", channel)
    else:
        print("--channel option requires a value")


audio_folder=f'audio_{channel}'
# channel='tve'
transcript_folder=f'transcript_output'

bucket = storage_client.bucket(audio_folder)
bucket2 = storage_client.bucket(transcript_folder)
wav_files=[i.name for i in bucket.list_blobs()]
json_files=[i.name.split(f'{channel}/')[-1] for i in bucket2.list_blobs(prefix=channel)]



for file in wav_files:
    if not file.endswith('.wav'):
        continue
    transcript_name=file.replace('.wav','.json')
    if transcript_name in json_files:
        continue
    gcs_uri = f"gs://{audio_folder}/{file}"
    # gcs_uri = f"gs://{audio_folder}/out2.wav"
    audio = speech.RecognitionAudio(uri=gcs_uri)

    diarization_config = speech.SpeakerDiarizationConfig(
        enable_speaker_diarization=True,
        min_speaker_count=2,
        #max_speaker_count=10,
    )
    config = speech.RecognitionConfig(
            encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
            #sample_rate_hertz=8000,
            language_code="es-ES",
            diarization_config=diarization_config,
            #audio_channel_count = 2,
        )

    print("Waiting for operation to complete...")
    operation = client.long_running_recognize(config=config, audio=audio)
    response=operation.result()
    result = response.results[-1]
    # print(result)
    # print(type(result))
    with open(transcript_name,'w') as f:
        json.dump(str(result),f)
    # transcript_name=file.replace('.wav','.txt')
    # result = response.results[-1]
    # with open(transcript_name,'w') as f:
    #     f.write(result)
    os.system(f'gsutil cp  {transcript_name} gs://transcript_output/{channel}')
    os.remove(transcript_name)
    print(f'File {file} processed. ')
    

No matter how the max_speaker or min are changed, results are the same.

  1. pyannote:

As the above did not work, I decided to try with pyannote. The performance of it is very nice but there is one problem, it is extremely slow. For a wav file of 30 mins it takes more than 3 hours to finish the diarization.

Here is my code:


#import packages
import os
from datetime import datetime
import pandas as pd
from pyannote.audio import Pipeline
from pyannote.audio import Model
from pyannote.core.json import dump
from pyannote.core.json import load
from pyannote.core.json import loads
from pyannote.core.json import load_from
import subprocess
from pyannote.database.util import load_rttm
from google.cloud import speech_v1p1beta1 as speech
from google.cloud import storage
import sys

# channel='a3'
storage_client = storage.Client()
if "--channel" in sys.argv:
    index = sys.argv.index("--channel") + 1
    if index < len(sys.argv):
        channel = sys.argv[index]
        print("Channel:", channel)
    else:
        print("--channel option requires a value")

audio_folder=f'audio_{channel}'
transcript_folder=f'transcript_{channel}'
bucket = storage_client.bucket(audio_folder)
bucket2 = storage_client.bucket(transcript_folder)
wav_files=[i.name for i in bucket.list_blobs()]
rttm_files=[i.name for i in bucket2.list_blobs()]



token="XXX"
pipeline = Pipeline.from_pretrained("pyannote/[email protected]",
                                    use_auth_token=token)


# this load the model
model = Model.from_pretrained("pyannote/segmentation",
                                    use_auth_token=token)
                                    

for file in wav_files:
    if not file.endswith('.wav'):
        continue
    rttm_name=file.replace('.wav','.rttm')
    if rttm_name in rttm_files:
        continue
    if '2023' not in file:
        continue
    
    print(f'Doing file {file}')
    gcs_uri = f"gs://{audio_folder}/{file}"     
    os.system(f'gsutil cp {gcs_uri} {file}')   
    diarization = pipeline(file)                             
    with open(rttm_name, "w") as rttm:
        diarization.write_rttm(rttm)        
    os.system(f'gsutil cp {rttm_name} gs://transcript_{channel}/{rttm_name}')
    os.remove(file)
    os.remove(rttm_name)
    


I am running this with python3.9 on a VM instance with GPU NVIDIA-T4.

Is this normal? I've seen that pyannote.audio is kinda slow on the factor of 1x or so, this time is much more than that given that, in theory, it should be running on a dedicated GPU for it...

Are there any faster alternatives? Any way to improve the code or design a VM that might increase speed?

1

There are 1 answers

2
Fady's Cube On BEST ANSWER

In order to make this work quickly on a GPU (Google colab is used as an example): You need to first install pyannote:

!pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip

And then:

from pyannote.audio import Pipeline
import torch

pipeline = Pipeline.from_pretrained(
        "pyannote/[email protected]",
        use_auth_token='your hugging face token here')

pipeline.to(torch.device('cuda')) # switch to gpu
diarization = pipeline(audio_file_path)