horovod do summary_op occur "one or more tensors were submitted to be reduced"

1.1k views Asked by At

I try to do hvd.allreduce(loss) to summay_op for tensorboard.

self.avg_loss = hvd.allreduce(self.loss)
self.auc, self.auc_update_op = tf.metrics.auc(
        labels=self.label,
        predictions=self.sigmoid_prediction,
        name=keys.AUC,
        summation_method='careful_interpolation',
    )
self.avg_auc = hvd.allreduce(self.auc)

tf.summary.scalar(
        "loss", 
        self.avg_loss
    )

tf.summary.scalar(
       "auc", 
        self.avg_auc
)
self.summary_op = tf.summary.merge_all()


hooks = [tf.train.StopAtStepHook(last_step=self.steps_per_epoch * args.num_epochs),
             tf.train.LoggingTensorHook({
                 'step': self.global_step,
                 'loss': self.loss,
                 'auc': self.auc
             }, every_n_iter=100),
              tf.train.LoggingTensorHook({
                 'auc_update_op': self.auc_update_op,
             }, formatter=lambda _: "...", every_n_iter=100),
             tf.train.NanTensorHook(self.loss),
             tf.train.SummarySaverHook(
                save_steps=100,
                output_dir=args.tensorboard_dir if hvd.rank() == 0 else None,
                summary_op=self.summary_op,
            ),
             ]

 with tf.train.MonitoredTrainingSession(
            config=config,
            save_checkpoint_secs=60,
            save_summaries_steps=None,
            save_summaries_secs=None,
            checkpoint_dir=args.checkpoint if hvd.rank() == 0 else None,
            hooks=hooks) as session:
        while not session.should_stop():
             session.run(self.train_op)

But keep encounter this error.

One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

0

There are 0 answers