Why is the ngram-merge of srilm taking wrong input?

145 views Asked by At

This is my first post here and sorry for my poor english.

I'm instantly working on Kaldi and srilm tools for my research, but I faced a strange problem while using ngram-merge to merge the 3-gram.count files generated by ngram-count. (ngram-count and ngram-merge are two modules in srilm)

The code I used in my shell script is shown as follows:

ngram-merge \
 -write $dir_ngram/corpus_${ng}-gram.count \
 $dir_ngram/glsp_poj_tlu.txt_${ng}-gram.count /
 $dir_ngram/icorpus_tlu.txt_${ng}-gram.count /
 $dir_ngram/khkp_tlu.txt_${ng}-gram.count /
 $dir_ngram/nmtl_tlu.txt_${ng}-gram.count /
 $dir_ngram/total_tlu.txt_${ng}-gram.count /
 $dir_ngram/twbb_tlu.txt_${ng}-gram.count

while $dir_ngram simply stands for the directory of the .count files and ${ng} is 3 here since I'm using trigram for my language model.

But when I run this part of code, errors occurred and they looks like this:

/kaldi/egs/simple_20190520/source/ngram/icorpus_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/icorpus_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/icorpus_tlu.txt_3-gram.count: line 2: `<unk> <unk> 11844000'
/kaldi/egs/simple_20190520/source/ngram/khkp_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/khkp_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/khkp_tlu.txt_3-gram.count: line 2: `<unk> <unk>    449400'
/kaldi/egs/simple_20190520/source/ngram/nmtl_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/nmtl_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/nmtl_tlu.txt_3-gram.count: line 2: `<unk> <unk>    13706200'
/kaldi/egs/simple_20190520/source/ngram/total_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/total_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/total_tlu.txt_3-gram.count: line 2: `<unk> <unk>   11155390'
/kaldi/egs/simple_20190520/source/ngram/twbb_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/twbb_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/twbb_tlu.txt_3-gram.count: line 2: `<unk> <unk>    7575840'

It seems like ngram-merge took the first line of the files as file name or directory, since the unk symbol is the first line of every .count files (take icorpus_tlu.txt_3-gram.count for example):

<unk>   21952800
<unk> <unk>     11844000
<unk> <unk> <unk>       6161460
<unk> <unk> pó-tshî     660
<unk> <unk> pe̍h-liáu-kang       60
<unk> <unk> m̄-sī        3840
<unk> <unk> lîu-hîng    540
<unk> <unk> ē-sái       12900
<unk> <unk> uî-huat     1740
<unk> <unk> kín-tiunn   780
<unk> <unk> tâi-tiong-tshī      840
<unk> <unk> kuī 120
<unk> <unk> tsú-lâng    660
<unk> <unk> tsi̍t        38520
.
.
.

The unk symbol and the second line of the .count file appears in the first and third lines of the error message. I don't know why this is happening, because I think ngram-merge should only open the file and start to read the ngrams, not treating the content as a directory to open. Another strange thing is that the "take content as directory" problem only occurs on the last five files. The first file seems to have no reading or directory problem at all.

I know I could simply merge the corpus together since all the corpus are not too big, but I'm just a little curious about this problem. Does anybody know how to solve this?

0

There are 0 answers