SSML - Is it possible to remove automatic break pauses?

543 views Asked by At

I'm trying to remove the automatic breaks added by the synthesis processor, to create speech files without any "linguistic pauses".

I'm using Microsoft's speech synthesis engine with the SpeechSynthesizer class in C#.

This is the output I get with "This is an example why do automatic breaks occur?" wrapped in <speak> tags with SpeechSynthesizer:

https://clyp.it/4nofhh3n

This is the output I want (achieved by using Oddcast's TTS Demo):

https://clyp.it/m55wt14u

I've read through w3.org's SSML documentation several times which in point 3.2.3 - break element, note the following:

If the element is not present between tokens, the synthesis processor is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a synthesis processor.

This is how my voice is currently behaving. I want to somehow override/turn off this functionality, and have the speech be completely uninterrupted. I have tried putting a <break> element with attributes strength="none" and time="0ms" between the words where this automatic break occurs like they write above to override it, and all kinds of different things such as wrapping the whole text string in <s> tags etc, to no avail.

I also can't just remove the breaks in post processing, since the voice has a different tone on the words spoken, when the automatic breaks are added.

I have read through several different SSML documentations which, while often worded a bit differently compared to the w3 docs, don't explain how to concretely override the automatic breaks, which is my issue.

1

There are 1 answers

3
Luke On

In my experimenting with SpeechSynthesizer if you put a break of 50ms at the end then it will respect it - if it's less then it'll be ignored. However, it will always treat <speak> wrapped content as its own clause, so will speak it as if it's a sentence/clause, rather than carrying the prosody like the 2nd example. You need to send all your text in a single <speak> element (and voice) to have it treated as a single linguistic utterance.