Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker?
I'm afraid that several issues will slow things down or even make them infeasible:
- the interruption detection lag
- the increased probability of interruption (N instances)
- the need to re-download data at every interruption
- the need start/stop whole clusters instead of just replacing interrupted nodes
- the fact that Sagemaker doesn' support variable size cluster
Additionally EC2-Spot documentation deters users from using Spot in multi-node workflows where nodes are tightly coupled (which is the case in data-parallel and model-parallel training) "Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes."
Anybody here have experience doing Spot-enabled distributed GPU training on SageMaker happily?
Short answer is that Spot training works well when the instance type you need, in the region you need, has enough free capacity, at a particular time. Otherwise you won't be able to start the job, or get too frequent interruptions.
Why not just try it for yourself? Once you have a working on-demand training job, you can enable spot training by adding 3 relevant parameters to the job's Estimator definition, and implement checkpoint save/load (good to have anyway). Then if it works well, great! If not, switch back.