huggingface 모델 학습 시 gpu가 여러 개일 때 발생하는 에러

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0

trainer.train()을 하다가

위 에러가 발생했고 jupyter 환경에서 작업하고 있었기에

import os

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"

os.environ["CUDA_VISIBLE_DEVICES"]= "0"

device = torch.device('cuda:0')

model.to(device)

위와 같이 처리해주었지만 여전히 같은 에러 발생

print('Device:', device)

print('Current cuda device:', torch.cuda.current_device())

print('Count of using GPUs:', torch.cuda.device_count())

이걸 출력해보면

Device: cud

디바이스 세팅은 잘되어있어서 환장함

그러다가 training_args를 출력해보니

TrainingArguments( _n_gpu=1, -> 이럼

위에서 디바이스와 training_args 설정 후

그래서 training_args._n_gpu = 1로 따로 넣어주니 학습 시작됨....

콩콩