Error Training Deepspeech inside Docker with Ubuntu 20.04 integration on Windows 10 (Nvidia Gpu Rtx 3090)

105 views Asked by At

I'm working with Mozilla DeepSpeech in a Docker environment and have encountered an error during training. I'm seeking assistance to resolve this issue.

System Setup:

  • Docker environment on a Windows 10 PC
  • Using Ubuntu-20-04 in Docker
  • NVIDIA GPU RTX 3090 with --gpus all flag enabled
  • CUDA 10.0 Version 10.0.130 with cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.0
  • Python 3.7.3

Steps Taken:

  1. Installed the official training image for deepspeech to use in docker (mozilla/deepspeech-train:v0.9.3) followed the exact steps mentioned in this site (https://mozilla.github.io/deepspeech-playbook/ENVIRONMENT.html#contents)
  2. Successfully ran the provided script (./bin/run-ldc93s1.sh) in the Docker environment.
  3. Created a custom training script for my dataset.
  4. Faced challenges with file paths, resolved by mounting the WSL 2 directory to the Docker container.
  5. Updated script paths to match the mounted directory.

My Script: ``` root@b11bd0a278ee:/DeepSpeech#

python -u DeepSpeech.py   
--train_files /DeepSpeech/CSV/Training/training.csv   
--dev_files /DeepSpeech/CSV/Validation/dev.csv   
--test_files /DeepSpeech/CSV/Test/test.csv   
--alphabet_config_path /DeepSpeech/data/alphabet.txt   
--scorer_path /DeepSpeech/deepspeech-0.9.3-models.scorer   
--checkpoint_dir /DeepSpeech/checkpoints_dir   
--export_dir /DeepSpeech/CSV/exports_dir   
--train_batch_size 1   
--test_batch_size 1   
--n_hidden 100   
--epochs 200   
--noshow_progressbar

Issue: When running my custom training script, I encounter the following error:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 949, in main
    early_training_checks()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 934, in early_training_checks
    FLAGS.scorer_path, Config.alphabet)
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 36, in __init__
    raise ValueError('Scorer initialization failed with error code 0x{:X}'.format(err))
ValueError: Scorer initialization failed with error code 0x2005

```

Tried looking for the path: root@b11bd0a278ee:/DeepSpeech# ls /DeepSpeech/deepspeech-0.9.3- models.scorer ls: cannot access '/DeepSpeech/deepspeech-0.9.3- models.scorer': No such file or directory Found the path: root@b11bd0a278ee:/DeepSpeech# find / -type f ( -name "alphabet.txt" -o -name ".csv" -o -name ".scorer" ) /DeepSpeechData/DeepSpeech/deepspeech-0.9.3-models.scorer /DeepSpeechData/DeepSpeech/data/alphabet.txt /DeepSpeechData/DeepSpeech/CSV/Test/test.csv /DeepSpeechData/DeepSpeech/CSV/Training/training.csv /DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv /DeepSpeechData/DeepSpeech/CSV/Model Checkpoints/Model Checkpoints.csv

2nd try:

    root@b11bd0a278ee:/DeepSpeech# python -u DeepSpeech.py \
    >   --train_files 
    /DeepSpeechData/DeepSpeech/CSV/Training/training.csv \
    >   --dev_files /DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv \
    >   --test_files /DeepSpeechData/DeepSpeech/CSV/Test/test.csv \
    habet_c>   --alphabet_config_path 
    /DeepSpeechData/DeepSpeech/data/alphabet.txt \
    >   --scorer_path /DeepSpeechData/DeepSpeech/deepspeech-0.9.3- 
    models.scorer \
    >   --checkpoint_dir /DeepSpeechData/DeepSpeech/checkpoints_dir \
    >   --export_dir /DeepSpeechData/DeepSpeech/CSV/exports_dir \
    >   --train_batch_size 1 \
    >   --test_batch_size 1 \
    >   --n_hidden 100 \
    >   --epochs 200 \
    >   --noshow_progressbar
    I Loading best validating checkpoint from 
    /DeepSpeechData/DeepSpeech/checkpoints_dir/best_dev-1466475
    I Loading variable from checkpoint: beta1_power
    I Loading variable from checkpoint: beta2_power
    I Loading variable from checkpoint: 
    cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
    Traceback (most recent call last):
    File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
    File "/DeepSpeech/training/deepspeech_training/train.py", line 982, 
    in 
    run_script
    absl.app.run(main)
    File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, 
    in 
    run
    _run_main(main, args)
    File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, 
    in 
    _run_main
    sys.exit(main(argv))
    File "/DeepSpeech/training/deepspeech_training/train.py", line 954, 
    in 
    main
    train()
    File "/DeepSpeech/training/deepspeech_training/train.py", line 529, 
    in 
    train
    load_or_init_graph_for_training(session)
    File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py", 
    line 137, in load_or_init_graph_for_training
    _load_or_init_impl(session, methods, allow_drop_layers=True)
    File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py", 
    line 98, in _load_or_init_impl
    return _load_checkpoint(session, ckpt_path, allow_drop_layers, 
    allow_lr_init=allow_lr_init)
    File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py", 
    line 71, in _load_checkpoint
    v.load(ckpt.get_tensor(v.op.name), session=session)
    File "/usr/local/lib/python3.6/dist- 
    packages/tensorflow_core/python/util/deprecation.py", line 324, in 
    new_func
    return func(*args, **kwargs)
    File "/usr/local/lib/python3.6/dist- 
    packages/tensorflow_core/python/ops/variables.py", line 1033, in load
    session.run(self.initializer, {self.initializer.inputs[1]: value})
    File "/usr/local/lib/python3.6/dist- 
    packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
    File "/usr/local/lib/python3.6/dist- 
    packages/tensorflow_core/python/client/session.py", line 1156, in 
    _run
    (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
    ValueError: Cannot feed value of shape (8192,) for Tensor 
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Initial 
    izer/Const:0', which has shape '(400,)'

3rd Try:

   root@b11bd0a278ee:/DeepSpeech# python -u DeepSpeech.py   -- 
   train_files /DeepSpeechData/DeepSpeech/CSV/Training/training.csv   -- 
   dev_files /DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv   -- 
   test_files /DeepSpeechData/DeepSpeech/CSV/Test/test.csv   -- 
   alphabet_config_path /DeepSpeechData/DeepSpeech/data/alphabet.txt   -- 
   scorer_path /DeepSpeechData/DeepSpeech/deepspeech-0.9.3-models.scorer   
   --checkpoint_dir /DeepSpeechData/DeepSpeech/checkpoints_dir   -- 
   export_dir /DeepSpeechData/DeepSpeech/CSV/exports_dir   -- 
   train_batch_size 1   --test_batch_size 1   --n_hidden 2048   --epochs 
   200   --noshow_progressbar
   I Loading best validating checkpoint from 
   /DeepSpeechData/DeepSpeech/checkpoints_dir/best_dev-1466475
   I Loading variable from checkpoint: beta1_power
   I Loading variable from checkpoint: beta2_power
   I Loading variable from checkpoint: 
   cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
   I Loading variable from checkpoint: 
  cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
   Traceback (most recent call last):
   File "DeepSpeech.py", line 12, in <module>
   ds_train.run_script()
   File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in 
   run_script
   absl.app.run(main)
   File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in 
   run
   _run_main(main, args)
   File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in 
   _run_main
   sys.exit(main(argv))
   File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in 
   main
   train()
   File "/DeepSpeech/training/deepspeech_training/train.py", line 529, in 
   train
   load_or_init_graph_for_training(session)
   File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py", 
   line 137, in load_or_init_graph_for_training
   _load_or_init_impl(session, methods, allow_drop_layers=True)
   File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py", 
   line 98, in _load_or_init_impl
   return _load_checkpoint(session, ckpt_path, allow_drop_layers, 
   allow_lr_init=allow_lr_init)
   File "/DeepSpeech/training/deepspeech_training/util/checkpoints.py", 
   line 71, in _load_checkpoint
   v.load(ckpt.get_tensor(v.op.name), session=session)
   File "/usr/local/lib/python3.6/dist- 
   packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 
   915, in get_tensor
   return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
   tensorflow.python.framework.errors_impl.NotFoundError: Key 
  cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam 
   not found in checkpoint

4th Try:

    root@b11bd0a278ee:/DeepSpeech# python -u DeepSpeech.py   --train_files 
    /DeepSpeechData/DeepSpeech/CSV/Training/training.csv   --dev_files 
    /DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv   --test_files 
    /DeepSpeechData/DeepSpeech/CSV/Test/test.csv   --alphabet_config_path 
    /DeepSpeechData/DeepSpeech/data/alphabet.txt   --scorer_path 
    /DeepSpeechData/DeepSpeech/deepspeech-0.9.3-models.scorer   -- 
    checkpoint_dir /DeepSpeechData/DeepSpeech/checkpoints_dir   --export_dir 
    /DeepSpeechData/DeepSpeech/CSV/exports_dir   --train_batch_size 1   -- 
    test_batch_size 1   --n_hidden 2048   --epochs 200   -- 
    noshow_progressbar --use_cudnn_rnn
    
    FATAL Flags parsing error: Unknown command line flag 'use_cudnn_rnn'
    Pass --helpshort or --helpfull to see help on flags.

5th Try: added --train_cudnn flag but the output was nothing:

    root@0123a1149260:/DeepSpeech# python -u DeepSpeech.py \ --train_files 
    /DeepSpeechData/DeepSpeech/CSV/Training/training.csv \ --dev_files 
    /DeepSpeechData/DeepSpeech/CSV/Validation/dev.csv \ --test_files 
    /DeepSpeechData/DeepSpeech/CSV/Test/test.csv \ alphabet_config_path 
    /DeepSpeechData/DeepSpeech/data/alphabet.txt \ --scorer_path 
    /DeepSpeechData/DeepSpeech/deepspeech-0.9.3-models.scorer \ -- 
    checkpoint_dir 
    /DeepSpeechData/DeepSpeech/checkpoints_dir \ --export_dir 
    /DeepSpeechData/DeepSpeech/CSV/exports_dir \ --train_batch_size 1 \ -- 
     test_batch_size 1 \ --n_hidden 100 \ --epochs 200 \ 
    --noshow_progressbar --train_cudnn

    root@0123a1149260:/DeepSpeech#

Question:

  • What could be causing this error in my setup?
  • Are there specific considerations or best practices when setting up DeepSpeech training in a Docker environment that I might be missing?

Any insights or suggestions to resolve this error would be greatly appreciated.

0

There are 0 answers