I am using Azure Speech To Text - continuous recognition to transcribe an audio file. I have my speakers split in stereo wav file into left and right channel. However when I am running the transcription I am not able the get channel correctly. I tried to receive it from the PropertyId.SpeechServiceResponse_JsonResult
but that always returns 0. My expectation is 0 for left channel and 1 for right channel.
var speechConfig = SpeechConfig.FromSubscription(/*api key*/, /*region*/);
var audioConfig = AudioConfig.FromWavFileInput(filePath);
var recognizer = new SpeechRecognizer(speechConfig, audioConfig);
Is there some hidden property or missing configuration to achieve this?
My try to find the channel from the JsonResult
property:
var speechServiceResponseJsonResultJson = eventArgs.Result.Properties.GetProperty(PropertyId.SpeechServiceResponse_JsonResult);
var channel = 0;
if (speechServiceResponseJsonResultJson != null)
{
var speechServiceResponseJsonResult =
JsonConvert.DeserializeObject<JObject>(
eventArgs.Result.Properties.GetProperty(PropertyId
.SpeechServiceResponse_JsonResult));
if (speechServiceResponseJsonResult.TryGetValue("Channel", StringComparison.InvariantCultureIgnoreCase, out var channelValue))
{
channel = channelValue.ToObject<int>();
}
}
It appears that the
SpeechServiceResponse_JsonResult
property does not provide the speaker channel information. The Azure Speech to Text service does not directly provide a way to differentiate between left and right channels in a stereo audio file. The documentation does not mention any property or configuration that would allow you to achieve this directly.A possible workaround for transcribing a stereo audio file could be to split the stereo audio file into two separate mono audio files, transcribe each mono audio file separately using Azure Speech To Text, and then combine the transcriptions while keeping track of which channel the transcription came from.
This approach will allow you to know which channel the transcription is coming from, as you will be processing each channel separately.
Also, as you mentioned you want to identify the speakers IDs with transcript, you can use the
conversation transcription with diarization
that can help in distinguish between speakers and provide output with Speaker ID.With this sample code, I was able to get transcribed text with speaker ID. Output: