Torch's loss.backward() hanged on ParlAI

109 views Asked by At

I am interested in Memory Networks and Movie Dialog QA. Recently facebook announced AI training framework called ParlAI, which supports many models and datasets. Although I tried below command on ParlAI, the training stopped at first loss.backward() at memnn.py. I waited almost one day, but loss.backward() didn't finish. I have checked this by printing debug and [Using Cuda] printing. Actually my GPU was working because it used some memory. I checked this by nvidia-smi -l 1.

python examples/train_model.py -m memnn -t "#moviedd-qa" -bs 32 --gpu 0 -e 10

Then, I switched to simple task, and it finished a few minutes.

python examples/train_model.py -m memnn -t "babi:task1k:1" -bs 32 --gpu 0 -e 10

I realize #moviedd-qa is more complicated compared to babi task. But how long does it usually take to train this model in my setting? Does anyone try to train this model via ParlAI? I am afraid this is not bug of ParlAI. Could you advise me to proceed my work?

My Environment

  • Ubunt 16.04.03 LTS, 64 bit
  • python 3.6.1 (Anaconda 4.4.0 (64-bit))
  • GPU: GTX 1080 ti
  • CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  • torch.version: '0.2.0_3'

I am also asking developers at ParlAI at their github, but no responses.

0

There are 0 answers