Sphinxtrain: Unable to lookup word that exists in the dictionary

16 views Asked by At

I'm adapting a sphinx model for Brazilian portuguese with my own data by following their tutorial and got stuck on the bw command in the "Accumulating observation counts" section. I made sure to include any missing words in the dictionary file followed by their pronunciation as well as creating the files for the transcription and file ids.

The command that I'm executing looks like this:

bw -hmmdir ../../sphinx_pt_br \
-feat 1s_c_d_dd \
-agc none \
-cmn current \
-varnorm no \
-accumdir ../../accum \
-ctlfn train.fileids \
-lsnfn train.transcription \
-moddeffn ../../sphinx_pt_br/mdef \
-ts2cbfn .cont. \
-dictfn ./pt.dic

Upon execution I get many errors like this (I added the word "REDACTED" to cover sensitive information):

INFO: cmn.c(133): CMN: 60.75 18.08 -0.69  6.64 11.20  1.50 -6.34 -1.04 -7.86 -12.31  6.73 -1.48 -16.86 
WARN: "mk_phone_list.c", line 178: Unable to lookup word 'consulta' in the dictionary
WARN: "next_utt_states.c", line 83: Unable to produce phonetic transcription for the utterance '<s> consulta veiculo placa REDACTED </s>'
WARN: "main.c", line 798: Skipped utterance '<s> consulta veiculo placa REDACTED </s>'
utt>    52   PTT-20231106-WA0030.wav  909    0     0 utt 0.000x 1.000e upd 0.000x 1.132e fwd 0.000x 0.000e bwd 0.000x 0.000e gau 0.000x 0.000e rsts 0.000x 0.000e rstf 0.000x 0.000e rstu 0.000x 0.000e

Looking at the file pt.dic, I can clearly see that the word "consulta" is there:

...
cônsul   k oo s u w
consulta     k oo s u w t a
consultado   k oo s u w t a d u
consultas    k oo s u w t a s
consultar     k oo s u w t a xm
consultoras  k oo s u w t o r a s
...

Scrolling a bit to the top, the script seems to be having some sort of problem with the phones:

ERROR: "lexicon.c", line 211: pronunciation for japão has undefined phones; skipping.
ERROR: "lexicon.c", line 90: Unknown phone zm
ERROR: "lexicon.c", line 211: pronunciation for japoneses has undefined phones; skipping.
ERROR: "lexicon.c", line 90: Unknown phone zm
ERROR: "lexicon.c", line 211: pronunciation for jaraguá has undefined phones; skipping.
ERROR: "lexicon.c", line 90: Unknown phone zm
ERROR: "lexicon.c", line 211: pronunciation for jararacuçus has undefined phones; skipping.
ERROR: "lexicon.c", line 90: Unknown phone zm
ERROR: "lexicon.c", line 211: pronunciation for jardim has undefined phones; skipping.
ERROR: "lexicon.c", line 90: Unknown phone zm
ERROR: "lexicon.c", line 211: pronunciation for jardes has undefined phones; skipping.
ERROR: "lexicon.c", line 90: Unknown phone zm
ERROR: "lexicon.c", line 211: pronunciation for jargão has undefined phones; skipping.
ERROR: "lexicon.c", line 90: Unknown phone zm
ERROR: "lexicon.c", line 211: pronunciation for jazidas has undefined phones; skipping.

The tutorial doesn't mention the creation of a lexicon, or how to even point it to the bw command, so I'm stuck.

0

There are 0 answers