u/Illustrious_Tap9300 — reddlx

Hello everyone,

I currently have a bosgame mini with 128GB UMA(RAM + VRAM). Currently I am facing an issue where I am trying to train some neural network, however whenever I reach the stage of sending something to device via pytorch:

import torch
import sys


print("1. PyTorch loaded.")
device = torch.device('cuda')
print(f"2. Device selected: {torch.cuda.get_device_name(0)}")


print("3. Creating CPU Tensor...")
x = torch.ones((100, 100))


print("4. Attempting Memory Allocation...")
try:
    x = x.to(device)
    print("5. SUCCESS! GPU Memory Allocated.")
    print("6. Doing math...")
    y = x * 2
    print("7. Math successful. Hardware is fully operational.")
except Exception as e:
    print(f"FAILED: {e}")import torch
import sys


print("1. PyTorch loaded.")
device = torch.device('cuda')
print(f"2. Device selected: {torch.cuda.get_device_name(0)}")


print("3. Creating CPU Tensor...")
x = torch.ones((100, 100))


print("4. Attempting Memory Allocation...")
try:
    x = x.to(device)
    print("5. SUCCESS! GPU Memory Allocated.")
    print("6. Doing math...")
    y = x * 2
    print("7. Math successful. Hardware is fully operational.")
except Exception as e:
    print(f"FAILED: {e}")

This is an example code real quick to explain my situation. On the line where it is x = x.to(device) it just completely hangs, so on the terminal you will only see "Attempting Memory Allocation" and nothing else.

Now, pytorch correctly detects the device, and everything seems to be clear, but then I hit this roadblock and I've been scratching my head over it for a while.

I followed AMD's documentation for rocm. I am using Ubuntu 24.04 with the latest kernel version. Did anyone come across this issue, and if solved how did you do it?

I would really appreciate the help, thank you all.