Problem
My project consists of a desktop application that records audio in real-time, for which I intend to receive real-time recognition feedback from an API. With a microphone, a real-time implementation using Microsoft's new Speech-to-Text API is trivial, with my scenario differing from that only in the sense that my data is written to a MemoryStream
object.
API Support
This article explains how to implement the API's Recognizer
(link) with custom audio streams, which invariably requires the implementation of the abstract class PullAudioInputStream
(link) in order to create the required AudioConfig
object using the CreatePullStream
method (link). In other words, to achieve what I require, a callback interface must be implemented.
Implementation attempt
Since my data is written to a MemoryStream (and the library I use will only record to files or Stream objects), in the code below I simply copy over the buffer to the implemented class (in a sloppy way, perhaps?) resolving the divergence in method signatures.
class AudioInputCallback : PullAudioInputStreamCallback
{
private readonly MemoryStream memoryStream;
public AudioInputCallback(MemoryStream stream)
{
this.memoryStream = stream;
}
public override int Read(byte[] dataBuffer, uint size)
{
return this.Read(dataBuffer, 0, dataBuffer.Length);
}
private int Read(byte[] buffer, int offset, int count)
{
return memoryStream.Read(buffer, offset, count);
}
public override void Close()
{
memoryStream.Close();
base.Close();
}
}
The Recognizer
implementation is as follows:
private SpeechRecognizer CreateMicrosoftSpeechRecognizer(MemoryStream memoryStream)
{
var recognizerConfig = SpeechConfig.FromSubscription(SubscriptionKey, @"westus");
recognizerConfig.SpeechRecognitionLanguage =
_programInfo.CurrentSourceCulture.TwoLetterISOLanguageName;
// Constants are used as constructor params)
var format = AudioStreamFormat.GetWaveFormatPCM(
samplesPerSecond: SampleRate, bitsPerSample: BitsPerSample, channels: Channels);
// Implementation of PullAudioInputStreamCallback
var callback = new AudioInputCallback(memoryStream);
AudioConfig audioConfig = AudioConfig.FromStreamInput(callback, format);
//Actual recognizer is created with the required objects
SpeechRecognizer recognizer = new SpeechRecognizer(recognizerConfig, audioConfig);
// Event subscriptions. Most handlers are implemented for debugging purposes only.
// A log window outputs the feedback from the event handlers.
recognizer.Recognized += MsRecognizer_Recognized;
recognizer.Recognizing += MsRecognizer_Recognizing;
recognizer.Canceled += MsRecognizer_Canceled;
recognizer.SpeechStartDetected += MsRecognizer_SpeechStartDetected;
recognizer.SpeechEndDetected += MsRecognizer_SpeechEndDetected;
recognizer.SessionStopped += MsRecognizer_SessionStopped;
recognizer.SessionStarted += MsRecognizer_SessionStarted;
return recognizer;
}
How the data is made available to the recognizer (using CSCore):
MemoryStream memoryStream = new MemoryStream(_finalSource.WaveFormat.BytesPerSecond / 2);
byte[] buffer = new byte[_finalSource.WaveFormat.BytesPerSecond / 2];
_soundInSource.DataAvailable += (s, e) =>
{
int read;
_programInfo.IsDataAvailable = true;
// Writes to MemoryStream as event fires
while ((read = _finalSource.Read(buffer, 0, buffer.Length)) > 0)
memoryStream.Write(buffer, 0, read);
};
// Creates MS recognizer from MemoryStream
_msRecognizer = CreateMicrosoftSpeechRecognizer(memoryStream);
//Initializes loopback capture instance
_soundIn.Start();
await Task.Delay(1000);
// Starts recognition
await _msRecognizer.StartContinuousRecognitionAsync();
Outcome
When the application is run, I don't get any exceptions, nor any response from the API other than SessionStarted
and SessionStopped
, as depicted below in the log window of my application.
I could use suggestions of different approaches to my implementation, as I suspect there is some timing problem in tying the recorded DataAvailable
event with the actual sending of data to the API, which is making it discard the session prematurely. With no detailed feedback on why my requests are unsuccessful, I can only guess at the reason.
The
Read()
callback ofPullAudioInputStream
should block if there is no data immediate available. AndRead()
returns 0, only if the stream reaches the end. The SDK will then close the stream afterRead()
returns 0 (find an API reference doc here).However, the behavior of Read() of C# MemoryStream is different: It returns 0 if there is no data available in the buffer. This is why you only see
SessionStart
andSessionStop
events, but no recognition events.In order to fix that, you need to add some kind of synchronization between
PullAudioInputStream::Read()
andMemoryStream::Write()
, in order to make sure thatPullAudioInputStream::Read()
will wait untilMemoryStream::Write()
writes some data into buffer.Alternatively, I would recommend to use
PushAudioInputStream
, which allows you directly write your data into stream. For your case, in_soundSource.DataAvailable
event, instead of writing data intoMemoryStream
, you can directly write it intoPushAudioInputStream
. You can find samples forPushAudioInputStream
here.We will update the documentation in order to provide the best practice on how to use Pull and Push
AudioInputStream
. Sorry for the inconvenience.Thank you!