I see here that SageMaker Data Distributed Library only supports 3 instance types: ml.p3.16xlarge, ml.p3dn.24xlarge, ml.p4d.24xlarge.
Why is this? I would have thought there might be use cases for parallel training for other GPUs, and even potentially CPUs
SageMaker DDP is designed to work with GPU's only and it uses NVIDIA Collective Communications Library (NCCL) for its all reduce approach. It gives good performance when used with Instances with more GPU's and higher network bandwidth. I believe this is the reason why only few instances are supported.