Azure Speech diarization failing to tag speakers properly until a long 7second statement is spoken

78 views Asked by At

Azure speech private preview for diarization was earlier setting “unknown” speaker tag until it recognise a long 7 seconds statement from a speaker, with the api in public preview it started tagging guest-n which brings accuracy concern, even if a guest-1 detected and received short sentences it is getting tagged guest-2 until guest-2 speaks a long sentence and likewise

Is there a solution to get the private preview behaviour back?

Is there a solution to get the private preview behaviour back?

As per documentation, they still say it will mark shorter sentences as unknown

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=windows&pivots=programming-language-csharp

Used sdk version implementation group: 'com.microsoft.cognitiveservices.speech', name: 'client-sdk', version: '1.34.0'

1

There are 1 answers

0
Naveen Sharma On

Diarization is described as the process of segmenting audio containing multiple speakers into discrete speech segments based on the identity of the speaker during each segment.

  • It is crucial for understanding “who is speaking when” in a speech recognition pipeline.

Note: Real-time diarization is currently in public preview.

  • This emphasizes the significance of diarization in various scenarios, including podcast sessions, call center calls, doctor-patient interactions, and team meetings.
  • It states that diarization is essential for providing context to downstream NLP systems, as it enables the modeling of conversations.
  • The code is taken from Real-time diarization git.
    private static String speechKey = "SPEECH_KEY";
    private static String speechRegion = "SPEECH_REGION";

    public static void main(String[] args) throws InterruptedException, ExecutionException {
        
        SpeechConfig speechConfig = SpeechConfig.fromSubscription(speechKey, speechRegion);
        speechConfig.setSpeechRecognitionLanguage("en-US");
        AudioConfig audioInput = AudioConfig.fromWavFileInput("katiesteve.wav");
        
        Semaphore stopRecognitionSemaphore = new Semaphore(0);

        ConversationTranscriber conversationTranscriber = new ConversationTranscriber(speechConfig, audioInput);
        {
            // Subscribes to events.
            conversationTranscriber.transcribing.addEventListener((s, e) -> {
                System.out.println("TRANSCRIBING: Text=" + e.getResult().getText());
            });

            conversationTranscriber.transcribed.addEventListener((s, e) -> {
                if (e.getResult().getReason() == ResultReason.RecognizedSpeech) {
                    System.out.println("TRANSCRIBED: Text=" + e.getResult().getText() + " Speaker ID=" + e.getResult().getSpeakerId() );
                }
                else if (e.getResult().getReason() == ResultReason.NoMatch) {
                    System.out.println("NOMATCH: Speech could not be transcribed.");
                }
            });

            conversationTranscriber.canceled.addEventListener((s, e) -> {
                System.out.println("CANCELED: Reason=" + e.getReason());

                if (e.getReason() == CancellationReason.Error) {
                    System.out.println("CANCELED: ErrorCode=" + e.getErrorCode());
                    System.out.println("CANCELED: ErrorDetails=" + e.getErrorDetails());
                    System.out.println("CANCELED: Did you update the subscription info?");
                }

                stopRecognitionSemaphore.release();
            });

            conversationTranscriber.sessionStarted.addEventListener((s, e) -> {
                System.out.println("\n    Session started event.");
            });

            conversationTranscriber.sessionStopped.addEventListener((s, e) -> {
                System.out.println("\n    Session stopped event.");
            });

            conversationTranscriber.startTranscribingAsync().get();

            // Waits for completion.
            stopRecognitionSemaphore.acquire();

            conversationTranscriber.stopTranscribingAsync().get();
        }

        speechConfig.close();
        audioInput.close();
        conversationTranscriber.close();

        System.exit(0);
    }


Output:

enter image description here