Nvidia-Docker2 won't install in Cloudformation UserData bash script

220 views Asked by At

I have a cloudformation template that I have created in hopes to spin up an ec2 instance with the necessary dependencies (where these dependencies are installed as bash in UserData) to leverage GPU hardware within a docker container. The main dependencies are: 1) nvidia drivers, 2) docker, and 3) nvidia-docker2.

The first two dependencies install as expected and after several moments of running can be verified by 1) nvidia-smi, and docker --version. The third dependency however consistently does not install.

For reference here are the relevant parts of my UserData bash:

          # install gpu stuff
          apt-get install linux-headers-$(uname -r)
          distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
          wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin
          mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600
          apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub
          echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | tee /etc/apt/sources.list.d/cuda.list
          apt-get update
          apt-get -y install cuda-drivers

          # install docker on system
          curl https://get.docker.com | sh
          systemctl start docker && systemctl enable docker

          distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
          curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
          curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list

          apt-get -y install nvidia-docker2 > /var/log/mason

          # add nvidia runtime stuff
          # echo "{ \"runtimes\": { \"nvidia\": { \"path\": \"/usr/bin/nvidia-container-runtime\", \"runtimeArgs\": [] } } }" >> /etc/docker/daemon.json

          systemctl restart docker

I have tried to pipe the stdout from apt-get -y install nvidia-docker2 to a log file but the logs only show:

Reading package lists...
Building dependency tree...
Reading state information...

and seems to be stuck there.

Other potential helpful bits:

  • AMI: ubuntu 18.04 image

I will also note that I am able to SSH into the instance and install the apt-get -y install nvidia-docker2 in the command terminal without a hitch (or any user prompt or anything).

Can anyone help me figure out how to trouble shoot this issue or does anyone see any potential problems in what I have shared above? The stdout pipe to file is about the only trick I know to debug such an issue as this. Please let me know if I can update/edit this post to make this issue easier to debug.

2

There are 2 answers

0
Marcin On

Based on the comments.

The issue was caused by not updating ubuntu's repositories after adding nvidia-docker2 repo.

The solution was to run apt-get update after the addition of the repo.

1
Shailesh On

replace:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')

with:

distribution = ubuntu18.04