Huggingface Trainer instant shutdown Ubuntu VM in Vcenter no warning no logs no errors

Question

Huggingface Trainer instant shutdown Ubuntu VM in Vcenter no warning no logs no errors

37 views Asked by texasdave At 19 October 2023 at 02:22

I have been troubleshooting this issue for over a week because the problem leaves zero trace of any errors in any logs of any kind. I'm asking this question to see if anyone else has experienced this.

No matter what notebook I use, or modules I install, or upgrade, or uninstall, the Trainer() module causes the VM to shutdown instantly.

I have an idea that it is GPU related since I have run this on the CPU with no problems.

I have made the devices visible (0,1) I have also enabled / disabled wandb and set report_to="none"

Is cuda available? True
Cuda torch version? 12.1
Is cuDNN version: 8902
cuDNN enabled?  True
Device count? 1
Current device? 0
Device name?  NVIDIA A30
tensor([[0.4543, 0.0545, 0.9293],
        [0.7722, 0.6535, 0.1276],
        [0.9957, 0.5621, 0.1621],
        [0.3164, 0.2845, 0.6874],
        [0.5489, 0.7582, 0.7139]])

# setting device on GPU if available, else CPU

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

has anyone experienced this before?

Original Q&A

There are 1 answers

**John Aldrich** · Answer 1 · 2023-10-19T02:27:15+00:00

This is likely one of two things

a driver issue
or a VM setup issue.

I'm leaning towards #2. Check out the link below as it may help with your problem. Your GPU passthru setting is likely incorrect.

https://mathiashueber.com/windows-virtual-machine-gpu-passthrough-ubuntu/

TechQA.

Huggingface Trainer instant shutdown Ubuntu VM in Vcenter no warning no logs no errors

There are 1 answers

Related Questions in PYTORCH

Related Questions in HUGGINGFACE

Related Questions in HUGGINGFACE-TRAINER

Popular Questions

Popular Tags

Trending Questions