Pytorch ROCm fails to train even though it's setup correctly

401 views Asked by At

I have installed pytorch on my Arch machine using the recommended snippet from the official site:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6

Set the following environment variables:

export PYTORCH_ROCM_ARCH="gfx1031"
export HSA_OVERRIDE_GFX_VERSION=10.3.1
export HIP_VISIBLE_DEVICES=0
export ROCM_PATH=/opt/rocm
export HIP_PLATFORM=amd
export HIP_DEVICE=0

and running this test script yields the following status of pytorch:

Checking ROCM support...
GOOD: ROCM devices found:  2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user enchilada is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
Testing PyTorch ROCM support...
Everything fine! You can run PyTorch code inside of:
--->  Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz
--->  gfx1031

After all of that, if i try to train a model on the gpu, i get the error:

File "/storage/Programs/PythonVenvs/transformers-rocm/lib/python3.11/site-packages/torch/nn/init.py", line 19, in _no_grad_normal_
    return tensor.normal_(mean, std)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: invalid device function

I am using a slightly modified version of TinyLlama's finetune script, which runs perfectly on cpu, so i believe that my rocm setup is at fault.

I would really appreciate any tips as to how i can solve this.

rocminfo output (i deleted the cpu agent part to make it shorter):

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

*******
Agent 2
*******
  Name:                    gfx1031
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon RX 6700 XT
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      3072(0xc00) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29663(0x73df)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2855
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            40
  SIMDs per CU:            2
  Shader Engines:          2
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    12566528(0xbfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1031
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***
0

There are 0 answers