Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?

Question

Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?

67 views Asked by juvchan At 15 October 2022 at 03:46

Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker?

I'm afraid that several issues will slow things down or even make them infeasible:

the interruption detection lag
the increased probability of interruption (N instances)
the need to re-download data at every interruption
the need start/stop whole clusters instead of just replacing interrupted nodes
the fact that Sagemaker doesn' support variable size cluster

Additionally EC2-Spot documentation deters users from using Spot in multi-node workflows where nodes are tightly coupled (which is the case in data-parallel and model-parallel training) "Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes."

Anybody here have experience doing Spot-enabled distributed GPU training on SageMaker happily?

Original Q&A

There are 1 answers

**Gili Nachum** · Answer 1 · 2022-10-16T12:27:16+00:00

Short answer is that Spot training works well when the instance type you need, in the region you need, has enough free capacity, at a particular time. Otherwise you won't be able to start the job, or get too frequent interruptions.

Why not just try it for yourself? Once you have a working on-demand training job, you can enable spot training by adding 3 relevant parameters to the job's Estimator definition, and implement checkpoint save/load (good to have anyway). Then if it works well, great! If not, switch back.

TechQA.

Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?

There are 1 answers

Related Questions in AMAZON-SAGEMAKER

Related Questions in SPOT-INSTANCES

Related Questions in AMZ-SAGEMAKER-DISTRIBUTED-TRAINING

Popular Questions

Trending Questions