How does forced alignment happen in Kaldi?

161 views Asked by At

I am going through the popular 'yesno' tutorial to get comfortable with Kaldi toolkit. For acoustic model training we will need the start and end times of each utterance, the speaker ID of each utterance, and a list of all words and phonemes present in the transcript. However, in Kaldi/egs/yesno/s5 directory I do not find any .lab files that contains the required start-end times of the utterance.

The directory that has all the sound files are labelled like 0_0_1_0_1_1.wav, meaning that no no yes no yes yes are the utterances in that .wav file.

Without any .lab file, how does alignment happen?

CGPT: "During the training phase, the alignment between acoustic features and their corresponding labels is achieved though forced alignment."

How does forced Alignment work in Kaldi?

Dan Povey: "alignment basically means for each word and phone, to find out which specific time which specific frame it corresponds to. so the way it's done in in Kaldi is very similar to how you would decode data except instead of a graph that contains a whole language model you have a graph that's just one sequence.so you kind of do beam search decoding but just for that one sequence."

I understand what beam-search decoding is. If I were to assume that beam width is 3, then we start with three likely phones then feed them back to the lexicon transducer L.FST, get next three likely phones for all the 3 initial phones, and so on till we reach EoW (End of Word). Now out of all the paths, we choose the three most likely sequences based on joint probability of all the phones in the sequence.

But still, I do not see how yes and no utterances are aligned to properly train the model.

0

There are 0 answers