Assume CNN ResNet50 for ImageNet, in distributed training with multi-node, and supposing that each epoch shall iterate every training sample across nodes via data parallelism.
- is "iterate each sample once AND only once" always guaranteed? or it's about the possibility
- if it's guaranteed, does TF require any coordinator e.g. node0 to coordinate across all nodes before each mini-batch? such as partition samples, e.g. node0 to load sample1-10K; node2 to load sample10K-20K?
- If so, does it mean for a given node, it always loads the same (or fixed) datasets/files across epoch 0...N? though actual sample order in the step could be shuffled.