weird awk outputs in reading/writing file

81 views Asked by At

I'm working on a Kaldi project about the existing example using the Tedlium dataset. Every step works well until the clean-up stage. I have a length mismatch issue. After examing all the scripts, I found the issue is in the lattice_oracle_align.sh

reference:https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/cleanup/lattice_oracle_align.sh

I believe the issue is line 142.

  awk '{if ($2 == "#csid") print $1" "($4+$5+$6)}' $dir/analysis/per_utt_details.txt > $dir/edits.txt

The above line should read per_utt_details.tx line by line, every time it reads a #csid it should write a line in edits.txt texts in per_utt_details look like this.

     ref
     hyp
     op
     #csid 0 0 0 0
     ...repeat the above 4 lines.

There are 1073046 lines in per_utt_details.txt. I expect 268262 lines in edits.txt. However, only 48746 lines exist in edits.txt.

1

There are 1 answers

0
RavinderSingh13 On BEST ANSWER

By seeing your samples I believe you are looking to compare 1st field NOT 2nd field(which shows in your shown code), so if this is the case then try running following(where I have changed from $2 to $1 for comparing with 1st field).

awk '($1 == "#csid"){print $1,($4+$5+$6)}' per_utt_details.txt > edits.txt