How to send real-time Azure Text-to-Speech audio stream to Ozeki VoIP SIP SDK?

644 views Asked by At

I am working on a project where I need to use the Azure Text-to-Speech service to generate speech from text, and then stream this speech audio in real-time over a VoIP call using the Ozeki VoIP SIP SDK.

I am able to generate the speech audio from Azure and receive it as a byte array, but I am struggling to send this audio data to Ozeki in a way that it can be streamed over the VoIP call. I need to convert this byte array into a format that Ozeki can use, and then stream this audio data in real-time

I tried to convert the byte array from Azure TTS into a MemoryStream, and then attempted to convert this MemoryStream into a WaveStream using the NAudio library, with the intention of playing this WaveStream during the phone call.

I expected Ozeki to be able to play this WaveStream in real-time during the call. However, I am not sure how to correctly connect the WaveStream to the call, and I am not sure if this is the correct approach to achieve real-time streaming of the audio.

Here is the code I have tried so far:

using System;
using Microsoft.CognitiveServices.Speech.Audio;
using Microsoft.CognitiveServices.Speech;
using System.IO;
using System.Threading.Tasks;
using NAudio.Wave;

namespace Adion.Media
{
    public class TextToSpeech
    {
       
        public async Task Speak(string text)
        {
            // create speech config
            var config = SpeechConfig.FromSubscription(az_key, az_reg);

            // create ssml
            var ssml = $@"<speak version='1.0' xml:lang='fr-FR' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:emo='http://www.w3.org/2009/10/emotionml'  xmlns:mstts='http://www.w3.org/2001/mstts'><voice name='{az_voice}'><s /><mstts:express-as style='cheerful'>{text}</mstts:express-as><s /></voice ></speak > ";

            // Creates an audio out stream.
            using (var stream = AudioOutputStream.CreatePullStream())
            {
                // Creates a speech synthesizer using audio stream output.
                using (var streamConfig = AudioConfig.FromStreamOutput(stream))
                using (var synthesizer = new SpeechSynthesizer(config, streamConfig))
                {
                    while (true)
                    {
                        // Receives a text from console input and synthesize it to pull audio output stream.
                        if (string.IsNullOrEmpty(text))
                        {
                            break;
                        }

                        using (var result = await synthesizer.SpeakTextAsync(text))
                        {
                            if (result.Reason == ResultReason.SynthesizingAudioCompleted)
                            {
                                Console.WriteLine($"Speech synthesized for text [{text}], and the audio was written to output stream.");
                                text = null;
                            }
                            else if (result.Reason == ResultReason.Canceled)
                            {
                                var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);
                                Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");

                                if (cancellation.Reason == CancellationReason.Error)
                                {
                                    Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");
                                    Console.WriteLine($"CANCELED: ErrorDetails=[{cancellation.ErrorDetails}]");
                                    Console.WriteLine($"CANCELED: Did you update the subscription info?");
                                }
                            }
                        }
                    }
                }

                // Reads(pulls) data from the stream
                byte[] buffer = new byte[32000];
                uint filledSize = 0;
                uint totalSize = 0;
                MemoryStream memoryStream = new MemoryStream();
                while ((filledSize = stream.Read(buffer)) > 0)
                {
                    Console.WriteLine($"{filledSize} bytes received.");
                    totalSize += filledSize;
                    memoryStream.Write(buffer, 0, (int)filledSize);
                }

                Console.WriteLine($"Totally {totalSize} bytes received.");

                // Convert the MemoryStream to WaveStream
                WaveStream waveStream = new RawSourceWaveStream(memoryStream, new NAudio.Wave.WaveFormat());
                

            }
        }

    }
}

And the call handler:

using Ozeki.VoIP;
using Ozeki.Media;
using Adion.Tools;
using Adion.Media;
using TextToSpeech = Adion.Media.TextToSpeech;

namespace Adion.SIP
{
    internal class call_handler
    {

        static MediaConnector connector = new MediaConnector();
        static PhoneCallAudioSender mediaSender = new PhoneCallAudioSender();

        public static void incoming_call(object sender, VoIPEventArgs<IPhoneCall> e)
        {
            var call = e.Item;
            Log.info("Incoming call from: " + call.DialInfo.CallerID);

            call.CallStateChanged += on_call_state_changed;

            call.Answer();
        }

        public static async void on_call_state_changed(object sender, CallStateChangedArgs e) 
        {

            var call = sender as IPhoneCall;
            
            switch (e.State)
            {
                case CallState.Answered:
                    Log.info("Call is answered");
                    break;
                case CallState.Completed:
                    Log.info("Call is completed");
                    break;
                case CallState.InCall:
                    Log.info("Call is in progress");
                    
                    var textToSpeech = new TextToSpeech();
                    
                    mediaSender.AttachToCall(call);
                    connector.Connect(textToSpeech, mediaSender);

                    textToSpeech.AddAndStartText("I can't understand why this texte can be hear in the voip cal !!!");
                    
                    break;
            }
        }
    }
}

I have looked at the Ozeki documentation, but I could not find any examples or guidance on how to do this. I also looked at the Azure TTS documentation, but it does not provide any information on how to stream the audio data to another service.

Does anyone have any suggestions or examples on how to accomplish this? Any help would be greatly appreciated.

1

There are 1 answers

0
Peter  Adriaenssens On

Just out of the box? Isn't it possible to save your Azure TTS to a MP3 file and send the MP3 file to the other service?