Azure - Speech To Text - detect speaker channel

305 views Asked by At

I am using Azure Speech To Text - continuous recognition to transcribe an audio file. I have my speakers split in stereo wav file into left and right channel. However when I am running the transcription I am not able the get channel correctly. I tried to receive it from the PropertyId.SpeechServiceResponse_JsonResult but that always returns 0. My expectation is 0 for left channel and 1 for right channel.

var speechConfig = SpeechConfig.FromSubscription(/*api key*/, /*region*/);
var audioConfig = AudioConfig.FromWavFileInput(filePath);
var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

Is there some hidden property or missing configuration to achieve this?

My try to find the channel from the JsonResult property:

var speechServiceResponseJsonResultJson = eventArgs.Result.Properties.GetProperty(PropertyId.SpeechServiceResponse_JsonResult);

var channel = 0;
if (speechServiceResponseJsonResultJson != null)
{
    var speechServiceResponseJsonResult =
        JsonConvert.DeserializeObject<JObject>(
            eventArgs.Result.Properties.GetProperty(PropertyId
                .SpeechServiceResponse_JsonResult));

    if (speechServiceResponseJsonResult.TryGetValue("Channel", StringComparison.InvariantCultureIgnoreCase, out var channelValue))
    {
        channel = channelValue.ToObject<int>();
    }
}
1

There are 1 answers

1
Rishabh Meshram On BEST ANSWER

It appears that the SpeechServiceResponse_JsonResult property does not provide the speaker channel information. The Azure Speech to Text service does not directly provide a way to differentiate between left and right channels in a stereo audio file. The documentation does not mention any property or configuration that would allow you to achieve this directly.

A possible workaround for transcribing a stereo audio file could be to split the stereo audio file into two separate mono audio files, transcribe each mono audio file separately using Azure Speech To Text, and then combine the transcriptions while keeping track of which channel the transcription came from.

This approach will allow you to know which channel the transcription is coming from, as you will be processing each channel separately.

Also, as you mentioned you want to identify the speakers IDs with transcript, you can use the conversation transcription with diarization that can help in distinguish between speakers and provide output with Speaker ID.

With this sample code, I was able to get transcribed text with speaker ID. Output: enter image description here