Segfault when using pytorch with rocm in linux

2.5k views Asked by At

I have an amd rx 6600 and I am trying to use pytorch with rocm. I am running archlinux and using the package provided by the distro. When I try to access the gpu memory the program crashes. It seems that memory is being allocated but I cannot read the memory. Here's the code:

import torch

# check for amd hip
print(torch.cuda.is_available())
print(torch.version.hip)

device = torch.device('cuda')
id = torch.cuda.current_device()
# print gpu name
print(torch.cuda.get_device_name(id))
# no memory is allocated at first
print(torch.cuda.memory_allocated(id))

# store some variable in gpu memory
r = torch.rand(16).to(device)
# memory is allocated
print(torch.cuda.memory_allocated(id))
# crashes when accessing r
print(r[0])

And here's the output:

 ~ > python test.py
Tru                                                     # gpu compute is available
5.4.22804-                                              # rocm version
AMD Radeon RX 6600                                      # name of gpu
0                                                       # memory allocation at start
512                                                     # memory allocation after storing variable
zsh: segmentation fault (core dumped)  python test.py   # program crashes when reading variable

Is there something wrong with my code? How do I debug this? I want to be sure before submitting a bug report to the package maintainer. Any help is appreciated.

1

There are 1 answers

1
robin1010101 On

FYI I have just a simple AMD desktop and am a fairly Linux-proficient user but am by no means knowledgable, however, I was able to resolve this exact same issue with the following steps, although this is just a workaround, and does not originally solve your issue. For anyone that finds themselves here, I hope this helps as a hotfix while this is solved elsewhere.

System specs:

  • OS: Ubuntu 22.04.2 LTS x86_64
  • Kernel: 5.19.0-41-generic
  • CPU: AMD Ryzen 9 7900X (24) @ 4.700GHz
  • GPU: AMD Radeon RX 6950 XT

tldr; try with caution! ensure your linux kernel builds properly etc.

sudo apt update; 
sudo apt install rocm-libs miopen-hip rccl; # this will install rocm dependencies, 14GB worth, so be patient!
pip install torch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 --index-url "https://download.pytorch.org/whl/rocm5.2";
pip install ipython;
ipython;

```sh
 In [1]: import torch
      ...: 
      ...: # check for amd hip
      ...: print(torch.cuda.is_available())
      ...: print(torch.version.hip)
      ...: 
      ...: device = torch.device('cuda')
      ...: id = torch.cuda.current_device()
      ...: # print gpu name
      ...: print(torch.cuda.get_device_name(id))
      ...: # no memory is allocated at first
      ...: print(torch.cuda.memory_allocated(id))
      ...: 
      ...: # store some variable in gpu memory
      ...: r = torch.rand(16).to(device)
      ...: # memory is allocated
      ...: print(torch.cuda.memory_allocated(id))
      ...: # crashes when accessing r
      ...: print(r[0])
      True
      5.2.21151-afdc89f8
      AMD Radeon RX 6950 XT
      0
      512
      tensor(0.3706, device='cuda:0')
 In [3]: torch.__version__
 Out[3]: '1.13.0+rocm5.2'

```
  1. Ensure that your rocm works correctly. I've had to fix broken installs and only terrible experiences with package dependencies not resolving trying to use amdgpu-install, not to mention, holy moly the rocm docs are hard to read for a simple general user (as I identify as), and in general, I've found it to do a terrible install unless you're super familiar with AMD GPU architectures. Had better luck with just a plain old apt-get
  2. Unfortunately, no matter what I have done, I cannot get rocm 5.4.2 to work properly with torch 2.0.0. This is a shame as there are considerable speedups and really new fancy things to play around with, but 1.13.0 is the best I could get working without a segfault popping on trying to send anything to the GPU, although I haven't tried to build from source.
  3. This is an optional step (I'm not sure it actually fixed the problem) but in a previous attempt using amdgpu-install had errors building the dkms kernel (yikes) so I just went ahead and followed the steps here to ensure nothing surprising happened on restart (https://askubuntu.com/a/1240434/1416884).
  4. Check your rocm version (there apparently isn't an easy way to do this? no idea why, but this is the best I found:
$ ll -ah /etc/alternatives/rocminfo 

lrwxrwxrwx 1 root root 28 Apr 25 22:31 /etc/alternatives/rocminfo -> /opt/rocm-5.4.3/bin/rocminfo
  1. install pytorch 1.13.0 (with rocm 5.2, it's backward compatible, so anything after 5.2 should be alright, see this compatibility chart here, note that 1.13 is the highest supported version for pytorch)
  2. Lastly, if you want a sample of an install script that installs with your python package, feel free to check this one out [here], although it's under construction, it should work with just a bash install.sh

Alright, hope this helps!