SSML tried-
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-IN"> <mstts:backgroundaudio src="https://cdn.yellowmessenger.com/3Ix9bm4Blriv1620110811289.wav" volume="0.3" fadein="3000" fadeout="4000"/><voice name="Microsoft Server Speech Text to Speech Voice (en-IN, NeerjaNeural)"><prosody rate="-10.00%" volume="+30.00%" contour="(27%, -13%) (49%, +7%) (73%, -11%)">Hi, Good Morning. Welcome! </prosody><prosody rate="-10.00%" volume="+30.00%" contour="(13%, -7%) (26%, +12%) (60%, -14%) (76%, +15%) (90%, -6%)">I can help you answer your queries regarding place an order, locate your order and do much more.</prosody><prosody rate="+10.00%" volume="+30.00%" contour="(18%, -6%) (40%, +12%) (64%, -10%)"> Please note that this call might be recorded for internal quality and training purposes.</prosody><prosody volume="+30.00%" contour="(41%, +6%) (74%, -26%)">Let us get started</prosody></voice></speak>
Here is the doc to Improve synthesis with Speech Synthesis Markup Language (SSML).
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp