CNTK Distributed Crash - Beta 7

110 views Asked by At

I am running a variation of the CIFAR 10 distributed to utilize my data.

I get the following error:

Traceback (most recent call last):
  File "", line 158, in <module>
    checkpoint_path = "C:/projects/RoboLabs/CognitiveServices/ML_Models/DocSuite/Doc_Classify/checkpoints/CNTK_VGG9")
  File "", line 80, in train_and_evaluate
    trainer.save_checkpoint(os.path.join(checkpoint_path + "_{}.dnn".format(current_epoch)))
  File "C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py34\lib\site-packages\cntk\", line 138, in save_checkpoint
    super(Trainer, self).save_checkpoint(filename, _py_dict_to_cntk_dict(external_state))
  File "C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py34\lib\site-packages\cntk\", line 1774, in save_checkpoint
    return _cntk_py.Trainer_save_checkpoint(self, *args)
RuntimeError: Runtime exception

The code I am using for the training loop with checkpoints is here:

while updated:
    data=train_reader.next_minibatch(minibatch_size, input_map=input_map) # fetch minibatch.
    updated=trainer.train_minibatch(data)                                 # update model with it
    progress_printer.update_with_trainer(trainer, with_metric=True)       # log progress
    epoch_index = int(trainer.total_number_of_samples_seen/epoch_size)
    if current_epoch != epoch_index:                                      # new epoch reached
    if current_epoch % 25 == 0:
        trainer.save_checkpoint(os.path.join(checkpoint_path + "_{}.dnn".format(current_epoch)))

Insights welcome. I am actively debugging.


There are 2 answers

David Crook On BEST ANSWER

This appears to be resolved in the latest version which performs check pointing in a different manner. Solution is to upgrade your CNTK version. Use the session apis which begin in version 9.

Sayan Pathak On

Is it possible that you are running in windows environment path format specified in Linux-style. On windows the path should be something like 'X:\Repos\CNTK\Examples\Image\Classification\ResNet\Python\Models\resnet20_0.dnn'. Suggest that you try os.path.join instead of hardcode / or \ in the path string passed to save_model.