I am trying to create a limited corpus and train a language model to use for a deepspeech scorer.
I have followed the information provided in the docs here
I read a helpful guide posted for an older version of deepspeech for generating a language model here
And I have read the playbook here,
It seems that this has been encountered before, but no answer was given there
I have set up the docker environment for training and followed the docs to the letter.
I can train a model, and then convert this to a .scorer file, so the whole process is working.
The steps I take from inside the docker container are:
- create a vocab.txt file with my input sentences and store it in the deepspeech-data-input folder.
- run this script to build the model in the output
python3 generate_lm.py --input_txt ../../deepspeech-data/input/vocab.txt --output_dir ../../deepspeech-data/output --top_k 100 --kenlm_bins /DeepSpeech/native_client/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|0|0" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback
- run this script to generate the scorer:
./generate_scorer_package --alphabet ../../deepspeech-data/input/alphabet.txt --lm ../../deepspeech-data/output/lm.binary --vocab ../../deepspeech-data/output/vocab-100.txt --package ../../deepspeech-data/output/deepspeech-0.9.3-models.scorer --default_alpha 0.9 --default_beta 0.9 --force_bytes_output_mode 1
- replace the default scorer with my one.
- Run deepspeech
Everything seems to work as it should, no errors or anything, but when doing this deepspeech just detects an empty string. If I use the default scorer I have it working fine, but I need to restrict the vocabulary so that I can just detect a few commands.
I have tried adjusting some of the flags, but I always get the same result.
I am using the --discount_fallback
flag as suggested as it is a small corpus
So my question is this. Why would a deepspeech language model/scrorer output an empty string and how can I fix it?
I am running this inside the NodeJS example on github but testing against any of them would work to reproduce. github examples