I have a cloudformation template that I have created in hopes to spin up an ec2 instance with the necessary dependencies (where these dependencies are installed as bash in UserData) to leverage GPU hardware within a docker container. The main dependencies are: 1) nvidia drivers, 2) docker, and 3) nvidia-docker2.
The first two dependencies install as expected and after several moments of running can be verified by 1) nvidia-smi
, and docker --version
. The third dependency however consistently does not install.
For reference here are the relevant parts of my UserData bash:
# install gpu stuff
apt-get install linux-headers-$(uname -r)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin
mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub
echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | tee /etc/apt/sources.list.d/cuda.list
apt-get update
apt-get -y install cuda-drivers
# install docker on system
curl https://get.docker.com | sh
systemctl start docker && systemctl enable docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get -y install nvidia-docker2 > /var/log/mason
# add nvidia runtime stuff
# echo "{ \"runtimes\": { \"nvidia\": { \"path\": \"/usr/bin/nvidia-container-runtime\", \"runtimeArgs\": [] } } }" >> /etc/docker/daemon.json
systemctl restart docker
I have tried to pipe the stdout from apt-get -y install nvidia-docker2
to a log file but the logs only show:
Reading package lists...
Building dependency tree...
Reading state information...
and seems to be stuck there.
Other potential helpful bits:
- AMI: ubuntu 18.04 image
I will also note that I am able to SSH into the instance and install the apt-get -y install nvidia-docker2
in the command terminal without a hitch (or any user prompt or anything).
Can anyone help me figure out how to trouble shoot this issue or does anyone see any potential problems in what I have shared above? The stdout pipe to file is about the only trick I know to debug such an issue as this. Please let me know if I can update/edit this post to make this issue easier to debug.
Based on the comments.
The issue was caused by not updating ubuntu's repositories after adding
nvidia-docker2
repo.The solution was to run
apt-get update
after the addition of the repo.