I'm asking for help. I am repurposing an older PC with Intel Pentium G4400 as CPU, 16GB of RAM and 7 GTX1080Ti. I want to train YOLOv8 model on it. I know that there are some limitations given the CPU and RAM, but it might not be that bad. But I've ran to some issues. I've installed the CUDA, Ultralytics and it's working if I wanna train with one GPU on it. But when I try to train with more GPUs the results are not as expected. I would've thought that the training time for one epoch would be smaller, but it is consistent as with the one GPU. I'm changing the batch size given the number of GPUs in use. It is just slow and inefficient. There are some issues like this:
[W socket.cpp:697] [c10d] The client socket has failed to connect to ["name_of_pc"]: 54920 (system error: 10049 - The requested address is not valid in its context.).
I've googled the error but found no answer. I don't if the firewall is to blame here though. This is the code I'm using:
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
import torch
from ultralytics import YOLO
if __name__ == '__main__':
if torch.cuda.is_available():
torch.cuda.empty_cache()
cuda_devices = list(range(torch.cuda.device_count()))
print("Using GPUs:", ", ".join([torch.cuda.get_device_name(i) for i in cuda_devices]))
else:
print("Using CPU")
# num_workers = 0
# Load the YOLOv8 model
model = YOLO('yolov8n.pt') # Automatic device selection
# Train the model with specified and adjusted parameters
results = model.train(
data=r"D:\AI_Training\Datasets\siemens_numbers_mk3\data.yaml",
imgsz=1280, # Image size
epochs=100, # Number of epochs
batch=64, # Batch size, adjust based on your GPU memory capacity
name='yolov8_b16_e100_sz1280_dtmk3', # Custom run name
workers=2, # Adjusted based on GPU availability
amp=True, # Automatic Mixed Precision
device=[0,1,2,3]
)
And I change the number of devices based on the try. Someone has any ideas how to make it work please? Thanks!
I've made changes to the train batches and number of GPUs and all sorts of things, but none worked.