I am following the Google Cloud API Text-to-Speech Python tutorial. I would like to know if there is a way to return the phonemes and their duration, an intermediate step in generating the interpreted speech. Is that possible? If so, can you please refer me to the documentation and hopefully some sample code that does this. I searched and could not find anything that already answered my question.
Thanks! gma
Thanks for your reply @Akshansha. I know how to create an audio file of synthetic human speech. My question was more about how to get metadata like phoneme or viseme. For exemple, with the Amazon Polly API you can get this kind of data when using Text-to-Speech :
I was asking if we can have a similar result with the Google Cloud API TTS ?
Thanks, gma