I'm getting documented on how to use AWS batch to train deep learning models. The idea is that, once a model is built, I'd like to submit several jobs to explore a bit the hyperparameter space.
In this interesting blog post, the blogger created an execution environment of P2 instances and used it to train a convolutional neural network for MNIST. I am now wondering if it's possible to require a specific number of GPUs instead of vCPUs in my job definition. In this way I'm sure that my job has the number of GPUs it needs. If not, is there any workaround?
AWS Batch start to support GPU allocation/scheduling since April 2019. With this new feature, you can specify the number of GPU your job needs. Batch also does GPU pinning for your jobs. If a instance has multiple GPUs, Batch can place multiple jobs (each job asks for 1 GPU) on the same instance and having them run concurrently. Here is a example to run gpu jobs with Batch gpu support. https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/