Assume CNN ResNet50 for ImageNet, in distributed training with multi-node, and supposing that each epoch shall iterate every training sample across nodes via data parallelism.

  1. is "iterate each sample once AND only once" always guaranteed? or it's about the possibility
  2. if it's guaranteed, does TF require any coordinator e.g. node0 to coordinate across all nodes before each mini-batch? such as partition samples, e.g. node0 to load sample1-10K; node2 to load sample10K-20K?
  3. If so, does it mean for a given node, it always loads the same (or fixed) datasets/files across epoch 0...N? though actual sample order in the step could be shuffled.

0 Answers