We're running into an issue trying to use google cloud speech (GCS) for audio-indexing purposes. We've tried two different setups:
- A single audio-file containing multiple speakers (high SNR, only speech + silence) is sent to GCS.
- The audio-file is split into separate speakers, the segments concatenated, and one audio-file per speaker is sent to GCS.
The problem is that large parts (~22%) of the speech doesn't get any output hypotheses regardless of setup (1 or 2 above).
The documentation states that "If the Speech API determines that an alternative has a sufficient Confidence Value, then that alternative is included in the response." Is this also true for the best hypothesis (that it's only included if the confidence is high enough) – and is that why parts of the speech is missing?
And the actual question as per the title: Is it possible to get a best-guess for the entire input-audio from Google Cloud Speech?