This article suggests there are three options for distributed training
- Data-parallel training with synchronous updates.
- Data-parallel training with asynchronous updates.
- Model-parallel training.
The tutorial then goes on to suggest that the code that follows performs data-parallel training with asynchronous updates on Cloud ML Engine which behaves as "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches."
However, it's not clear what portion of the code actually specifies that this is using data-parallel training with asynchronous updates. Is this simply the default for ML engine if you run it in distributed training mode with a custom tf.estimator?
The short answer is that
tf.estimator
is currently mostly built around Data-parallel training (2).You get Model-parallel training simply by using
with tf.device()
statements in your code.You could try to use SyncReplicasOptimizer and probably accomplish synchronous training (1).
All of the above applies generally to
tf.estimator
; nothing is different for CloudML Engine.