Microsoft.CognitiveServices.Speech.SpeechRecognizer-getting time offsets of results in a file with continuous recognition

399 views Asked by At

I'm testing out the new unified speech engine on Azure, and I'm working on a piece where I'm trying to transcribe a 10 minute audio file. I've created a recognizer with CreateSpeechRecognizerWithFileInput, and I've kicked off continuous recognition with StartContinuousRecognitionAsync. I created the recognizer with detailed results enabled.

In the FinalResultsReceived event, there doesn't seem to be a way to access the audio offset in the SpeechRecognitionResult. If I do this though:

string rawResult = ea.Result.ToString();  //can get access to raw value this way.
Regex r=new Regex(@".*Offset"":(\d*),.*");
int offset=Convert.ToInt32(r?.Match(rawResult)?.Groups[1]?.Value);

Then I can extract the offset. The raw result looks something like this:

ResultId:4116b361141446a98f306fdc11c3a5bd Status:Recognized Recognized text:<OK, so what's your think it went well, let's look at number number is 104-828-1198.>. Json:{"Duration":129500000,"NBest":[{"Confidence":0.887861133,"Display":"OK, so what's your think it went well, let's look at number number is 104-828-1198.","ITN":"OK so what's your think it went well let's look at number number is 104-828-1198","Lexical":"OK so what's your think it went well let's look at number number is one zero four eight two eight one one nine eight","MaskedITN":"OK so what's your think it went well let's look at number number is 104-828-1198"}],"Offset":6900000,"RecognitionStatus":"Success"}

The challenge there is that the Offset is sometimes zero, even for cases where it's a nonzero file index, so I'll get zeroes in the middle of a recognition stream.

I also tried submitting the same file through the batch transcription API, which gives me a different result entirely:

{
                "RecognitionStatus": "Success",
                "Offset": 531700000,
                "Duration": 91300000,
                "NBest": [{
                        "Confidence": 0.87579143,
                        "Lexical": "OK so what's your think it went well let's look at number number is one zero four eight two eight one",
                        "ITN": "OK so what's your think it went well let's look at number number is 1048281",
                        "MaskedITN": "OK so what's your think it went well let's look at number number is 1048281",
                        "Display": "OK, so what's your think it went well, let's look at number number is 1048281."
                    }
                ]
            }, 

So I have three questions on this:

  1. Is there a supported method to get the offset of a recognized section of a file in the recognizer API? The SpeechRecognitionResult doesn't expose this, nor does the Best() extension.
  2. Why is the offset coming back as 0 for a segment part way through the file?
  3. What are the units for the offsets in the bulk recognition and file recognition APIs, and why are they different? They don't appear to be ms or frames, at least from what I've found in Audacity. The result I posted was from roughly 59s into the file, which is roughly 800k samples.
1

There are 1 answers

0
Zhou On

Chris,

Thanks for your feedback. To your questions, 1) The offset as well as duration have been added to the API. The next coming release (very soon) will allow you access both properties. Please stay tuned. 2) This is probably due to different recognition mode being used. We will also fix that in the next release. 3) The time unit for both API is 100ns(tick). Please also note that batch transcription uses different model than online recognition, so that the recognition result might be slightly different.

Sorry for the inconvenience!

Thanks,