This is my first post here and sorry for my poor english.
I'm instantly working on Kaldi and srilm tools for my research, but I faced a strange problem while using ngram-merge to merge the 3-gram.count files generated by ngram-count. (ngram-count and ngram-merge are two modules in srilm)
The code I used in my shell script is shown as follows:
ngram-merge \
-write $dir_ngram/corpus_${ng}-gram.count \
$dir_ngram/glsp_poj_tlu.txt_${ng}-gram.count /
$dir_ngram/icorpus_tlu.txt_${ng}-gram.count /
$dir_ngram/khkp_tlu.txt_${ng}-gram.count /
$dir_ngram/nmtl_tlu.txt_${ng}-gram.count /
$dir_ngram/total_tlu.txt_${ng}-gram.count /
$dir_ngram/twbb_tlu.txt_${ng}-gram.count
while $dir_ngram simply stands for the directory of the .count files and ${ng} is 3 here since I'm using trigram for my language model.
But when I run this part of code, errors occurred and they looks like this:
/kaldi/egs/simple_20190520/source/ngram/icorpus_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/icorpus_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/icorpus_tlu.txt_3-gram.count: line 2: `<unk> <unk> 11844000'
/kaldi/egs/simple_20190520/source/ngram/khkp_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/khkp_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/khkp_tlu.txt_3-gram.count: line 2: `<unk> <unk> 449400'
/kaldi/egs/simple_20190520/source/ngram/nmtl_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/nmtl_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/nmtl_tlu.txt_3-gram.count: line 2: `<unk> <unk> 13706200'
/kaldi/egs/simple_20190520/source/ngram/total_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/total_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/total_tlu.txt_3-gram.count: line 2: `<unk> <unk> 11155390'
/kaldi/egs/simple_20190520/source/ngram/twbb_tlu.txt_3-gram.count: line 1: unk: No such file or directory
/kaldi/egs/simple_20190520/source/ngram/twbb_tlu.txt_3-gram.count: line 2: syntax error near unexpected token `<'
/kaldi/egs/simple_20190520/source/ngram/twbb_tlu.txt_3-gram.count: line 2: `<unk> <unk> 7575840'
It seems like ngram-merge took the first line of the files as file name or directory, since the unk symbol is the first line of every .count files (take icorpus_tlu.txt_3-gram.count for example):
<unk> 21952800
<unk> <unk> 11844000
<unk> <unk> <unk> 6161460
<unk> <unk> pó-tshî 660
<unk> <unk> pe̍h-liáu-kang 60
<unk> <unk> m̄-sī 3840
<unk> <unk> lîu-hîng 540
<unk> <unk> ē-sái 12900
<unk> <unk> uî-huat 1740
<unk> <unk> kín-tiunn 780
<unk> <unk> tâi-tiong-tshī 840
<unk> <unk> kuī 120
<unk> <unk> tsú-lâng 660
<unk> <unk> tsi̍t 38520
.
.
.
The unk symbol and the second line of the .count file appears in the first and third lines of the error message. I don't know why this is happening, because I think ngram-merge should only open the file and start to read the ngrams, not treating the content as a directory to open. Another strange thing is that the "take content as directory" problem only occurs on the last five files. The first file seems to have no reading or directory problem at all.
I know I could simply merge the corpus together since all the corpus are not too big, but I'm just a little curious about this problem. Does anybody know how to solve this?