Ideally what I am looking for is a way to get a vector of probability that a particular segment of an audio file is a certain phone. Something like:
input:
- wavfile
- start position (e.g. @1.4 sec)
- duration (e.g. 500 ms)
output:
- SIL 2.324*10^-3
- AA 1.514*10^-4
- AE 1.482*10^-2
- ...
- ZH 5.03*10^-5
You can obtain the scores running
HVitein forced alignment mode. I am afraid you have to run this for every phoneme you have:The output file
acoustic_score_AA.mlfwill contain the result. IThe contents of
wordsvocabulary file should be like:and the
phoneshas to contain the list of the phonemes (HMM models), as far as I remember.The trick here is the content of the input .mlf file. For instance,
AA.mlfshould be like:This will force HVite to apply the
AAmodel for the whole utterance. Chunking of the audio file has to be performed in advance.