I'm working on a Kaldi project about the existing example using the Tedlium dataset. Every step works well until the clean-up stage. I have a length mismatch issue. After examing all the scripts, I found the issue is in the lattice_oracle_align.sh
reference:https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/cleanup/lattice_oracle_align.sh
I believe the issue is line 142.
awk '{if ($2 == "#csid") print $1" "($4+$5+$6)}' $dir/analysis/per_utt_details.txt > $dir/edits.txt
The above line should read per_utt_details.tx line by line, every time it reads a #csid it should write a line in edits.txt texts in per_utt_details look like this.
ref
hyp
op
#csid 0 0 0 0
...repeat the above 4 lines.
There are 1073046 lines in per_utt_details.txt. I expect 268262 lines in edits.txt. However, only 48746 lines exist in edits.txt.
By seeing your samples I believe you are looking to compare 1st field NOT 2nd field(which shows in your shown code), so if this is the case then try running following(where I have changed from
$2
to$1
for comparing with 1st field).