I have an amd rx 6600 and I am trying to use pytorch with rocm. I am running archlinux and using the package provided by the distro. When I try to access the gpu memory the program crashes. It seems that memory is being allocated but I cannot read the memory. Here's the code:
import torch
# check for amd hip
print(torch.cuda.is_available())
print(torch.version.hip)
device = torch.device('cuda')
id = torch.cuda.current_device()
# print gpu name
print(torch.cuda.get_device_name(id))
# no memory is allocated at first
print(torch.cuda.memory_allocated(id))
# store some variable in gpu memory
r = torch.rand(16).to(device)
# memory is allocated
print(torch.cuda.memory_allocated(id))
# crashes when accessing r
print(r[0])
And here's the output:
~ > python test.py
Tru # gpu compute is available
5.4.22804- # rocm version
AMD Radeon RX 6600 # name of gpu
0 # memory allocation at start
512 # memory allocation after storing variable
zsh: segmentation fault (core dumped) python test.py # program crashes when reading variable
Is there something wrong with my code? How do I debug this? I want to be sure before submitting a bug report to the package maintainer. Any help is appreciated.
FYI I have just a simple AMD desktop and am a fairly Linux-proficient user but am by no means knowledgable, however, I was able to resolve this exact same issue with the following steps, although this is just a workaround, and does not originally solve your issue. For anyone that finds themselves here, I hope this helps as a hotfix while this is solved elsewhere.
System specs:
tldr; try with caution! ensure your linux kernel builds properly etc.
amdgpu-install, not to mention, holy moly the rocm docs are hard to read for a simple general user (as I identify as), and in general, I've found it to do a terrible install unless you're super familiar with AMD GPU architectures. Had better luck with just a plain oldapt-getamdgpu-installhad errors building thedkmskernel (yikes) so I just went ahead and followed the steps here to ensure nothing surprising happened on restart (https://askubuntu.com/a/1240434/1416884).[here], although it's under construction, it should work with just abash install.shAlright, hope this helps!