I'm working with Mozilla DeepSpeech in a Docker environment and have encountered an error during training. I'm seeking assistance to resolve this issue.
System Setup:
- Docker environment on a Windows 10 PC
- Using Ubuntu-20-04 in Docker
- NVIDIA GPU RTX 3090 with
--gpus all
flag enabled - CUDA 10.0 Version 10.0.130 with cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.0
- Python 3.7.3
Steps Taken:
- Installed the official training image for deepspeech to use in docker (mozilla/deepspeech-train:v0.9.3) followed the exact steps mentioned in this site (https://mozilla.github.io/deepspeech-playbook/ENVIRONMENT.html#contents)
- Successfully ran the provided script (
./bin/run-ldc93s1.sh
) in the Docker environment. - Created a custom training script for my dataset.
- Faced challenges with file paths, resolved by mounting the WSL 2 directory to the Docker container.
- Updated script paths to match the mounted directory.
My Script: ``` root@b11bd0a278ee:/DeepSpeech#
python -u DeepSpeech.py
--train_files /DeepSpeech/CSV/Training/training.csv
--dev_files /DeepSpeech/CSV/Validation/dev.csv
--test_files /DeepSpeech/CSV/Test/test.csv
--alphabet_config_path /DeepSpeech/data/alphabet.txt
--scorer_path /DeepSpeech/deepspeech-0.9.3-models.scorer
--checkpoint_dir /DeepSpeech/checkpoints_dir
--export_dir /DeepSpeech/CSV/exports_dir
--train_batch_size 1
--test_batch_size 1
--n_hidden 100
--epochs 200
--noshow_progressbar
Issue: When running my custom training script, I encounter the following error:
Traceback (most recent call last):
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
absl.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/DeepSpeech/training/deepspeech_training/train.py", line 949, in main
early_training_checks()
File "/DeepSpeech/training/deepspeech_training/train.py", line 934, in early_training_checks
FLAGS.scorer_path, Config.alphabet)
File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 36, in __init__
raise ValueError('Scorer initialization failed with error code 0x{:X}'.format(err))
ValueError: Scorer initialization failed with error code 0x2005
```
Tried looking for the path: root@b11bd0a278ee:/DeepSpeech# ls /DeepSpeech/deepspeech-0.9.3- models.scorer ls: cannot access '/DeepSpeech/deepspeech-0.9.3- models.scorer': No such file or directory Found the path: root@b11bd0a278ee:/DeepSpeech# find / -type f ( -name "alphabet.txt" -o -name ".csv" -o -name ".scorer" ) /DeepSpeechData/DeepSpeech/deepspeech-0.9.3-models.scorer /DeepSpeechData/DeepSpeech/data/alphabet.txt /DeepSpeechData/DeepSpeech/CSV/Test/test.csv /DeepSpeechData/DeepSpeech/CSV/Training/training.csv /DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv /DeepSpeechData/DeepSpeech/CSV/Model Checkpoints/Model Checkpoints.csv
2nd try:
root@b11bd0a278ee:/DeepSpeech# python -u DeepSpeech.py \
> --train_files
/DeepSpeechData/DeepSpeech/CSV/Training/training.csv \
> --dev_files /DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv \
> --test_files /DeepSpeechData/DeepSpeech/CSV/Test/test.csv \
habet_c> --alphabet_config_path
/DeepSpeechData/DeepSpeech/data/alphabet.txt \
> --scorer_path /DeepSpeechData/DeepSpeech/deepspeech-0.9.3-
models.scorer \
> --checkpoint_dir /DeepSpeechData/DeepSpeech/checkpoints_dir \
> --export_dir /DeepSpeechData/DeepSpeech/CSV/exports_dir \
> --train_batch_size 1 \
> --test_batch_size 1 \
> --n_hidden 100 \
> --epochs 200 \
> --noshow_progressbar
I Loading best validating checkpoint from
/DeepSpeechData/DeepSpeech/checkpoints_dir/best_dev-1466475
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint:
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
Traceback (most recent call last):
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/DeepSpeech/training/deepspeech_training/train.py", line 982,
in
run_script
absl.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300,
in
run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251,
in
_run_main
sys.exit(main(argv))
File "/DeepSpeech/training/deepspeech_training/train.py", line 954,
in
main
train()
File "/DeepSpeech/training/deepspeech_training/train.py", line 529,
in
train
load_or_init_graph_for_training(session)
File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py",
line 137, in load_or_init_graph_for_training
_load_or_init_impl(session, methods, allow_drop_layers=True)
File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py",
line 98, in _load_or_init_impl
return _load_checkpoint(session, ckpt_path, allow_drop_layers,
allow_lr_init=allow_lr_init)
File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py",
line 71, in _load_checkpoint
v.load(ckpt.get_tensor(v.op.name), session=session)
File "/usr/local/lib/python3.6/dist-
packages/tensorflow_core/python/util/deprecation.py", line 324, in
new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-
packages/tensorflow_core/python/ops/variables.py", line 1033, in load
session.run(self.initializer, {self.initializer.inputs[1]: value})
File "/usr/local/lib/python3.6/dist-
packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-
packages/tensorflow_core/python/client/session.py", line 1156, in
_run
(np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (8192,) for Tensor
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Initial
izer/Const:0', which has shape '(400,)'
3rd Try:
root@b11bd0a278ee:/DeepSpeech# python -u DeepSpeech.py --
train_files /DeepSpeechData/DeepSpeech/CSV/Training/training.csv --
dev_files /DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv --
test_files /DeepSpeechData/DeepSpeech/CSV/Test/test.csv --
alphabet_config_path /DeepSpeechData/DeepSpeech/data/alphabet.txt --
scorer_path /DeepSpeechData/DeepSpeech/deepspeech-0.9.3-models.scorer
--checkpoint_dir /DeepSpeechData/DeepSpeech/checkpoints_dir --
export_dir /DeepSpeechData/DeepSpeech/CSV/exports_dir --
train_batch_size 1 --test_batch_size 1 --n_hidden 2048 --epochs
200 --noshow_progressbar
I Loading best validating checkpoint from
/DeepSpeechData/DeepSpeech/checkpoints_dir/best_dev-1466475
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint:
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint:
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
Traceback (most recent call last):
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in
run_script
absl.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in
run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in
_run_main
sys.exit(main(argv))
File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in
main
train()
File "/DeepSpeech/training/deepspeech_training/train.py", line 529, in
train
load_or_init_graph_for_training(session)
File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py",
line 137, in load_or_init_graph_for_training
_load_or_init_impl(session, methods, allow_drop_layers=True)
File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py",
line 98, in _load_or_init_impl
return _load_checkpoint(session, ckpt_path, allow_drop_layers,
allow_lr_init=allow_lr_init)
File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py",
line 71, in _load_checkpoint
v.load(ckpt.get_tensor(v.op.name), session=session)
File "/usr/local/lib/python3.6/dist-
packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line
915, in get_tensor
return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
not found in checkpoint
4th Try:
root@b11bd0a278ee:/DeepSpeech# python -u DeepSpeech.py --train_files
/DeepSpeechData/DeepSpeech/CSV/Training/training.csv --dev_files
/DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv --test_files
/DeepSpeechData/DeepSpeech/CSV/Test/test.csv --alphabet_config_path
/DeepSpeechData/DeepSpeech/data/alphabet.txt --scorer_path
/DeepSpeechData/DeepSpeech/deepspeech-0.9.3-models.scorer --
checkpoint_dir /DeepSpeechData/DeepSpeech/checkpoints_dir --export_dir
/DeepSpeechData/DeepSpeech/CSV/exports_dir --train_batch_size 1 --
test_batch_size 1 --n_hidden 2048 --epochs 200 --
noshow_progressbar --use_cudnn_rnn
FATAL Flags parsing error: Unknown command line flag 'use_cudnn_rnn'
Pass --helpshort or --helpfull to see help on flags.
5th Try: added --train_cudnn flag but the output was nothing:
root@0123a1149260:/DeepSpeech# python -u DeepSpeech.py \ --train_files
/DeepSpeechData/DeepSpeech/CSV/Training/training.csv \ --dev_files
/DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv \ --test_files
/DeepSpeechData/DeepSpeech/CSV/Test/test.csv \ alphabet_config_path
/DeepSpeechData/DeepSpeech/data/alphabet.txt \ --scorer_path
/DeepSpeechData/DeepSpeech/deepspeech-0.9.3-models.scorer \ --
checkpoint_dir
/DeepSpeechData/DeepSpeech/checkpoints_dir \ --export_dir
/DeepSpeechData/DeepSpeech/CSV/exports_dir \ --train_batch_size 1 \ --
test_batch_size 1 \ --n_hidden 100 \ --epochs 200 \
--noshow_progressbar --train_cudnn
root@0123a1149260:/DeepSpeech#
Question:
- What could be causing this error in my setup?
- Are there specific considerations or best practices when setting up DeepSpeech training in a Docker environment that I might be missing?
Any insights or suggestions to resolve this error would be greatly appreciated.