Trim syllable audio recording to only the vowel part

98 views Asked by At

For a Chinese learning app, we let users record a syllable and we use speech recognition to assess if the pronunciation was correct or not.

Every Chinese syllable can be pronounced with different tones (pitch differentials) that have different meanings. We found that both Google Translate and Swift Speech framework are not accurate enough to determine wether the pronounced tone was correct or not. Therefore, we use Beethoven to detect the pitch from the audio to assess this outside of the speech recognition API.

The challenge is that in Chinese the tone is only pronounced in the vowels of syllable. So Beethoven works well if the user only pronounces a vowel, e.g. "a". But in a syllable such as "san" the the results are clouded by the consonants "s" and "n".

So I'm looking for a way to trim the syllable recording to only the vowel so we can use Beethoven on the vowel only and detect the Chinese tone correctly. I'm also happy to learn if anyone has a better idea on how to tackle this challenge.

Best, Paul

1

There are 1 answers

3
Phil Freihofner On

One fact about vowels and consonants that might be helpful is that vowels can are generally thought of as having frequency content that tends to be harmonic and concentrated in formant areas (the first two being the most important, and the 2nd of which is below 3K Hz), and many consonants (fricatives, sibilants) have noisy energy at or above 4K Hz. Here is a good diagram from a lecture on the acoustics of fricatives where this can be seen.

ASA sonogram

You might need a more sophisticated fast-fourier analysis tool than Beethoven to distinguish when the sibilants' or fricatives' frequency content is present. I've not used Beethoven and do not know what its capabilities are.

I don't know much about the nasals, though. The same lecture series, different chapter ("Plosives and Nasals") gives this info:

The nasalisation of vowels is cued by the presence of a low-frequency resonance and an increase in formant damping.

It seems to me like it would be challenging to distinguish nasals from vowels by their the spectrum.