Trainer RuntimeError CUDA error #19862

devozs · 2022-10-25T07:41:00Z

System Info

Versions

i`ve tried using

transformers==4.15.0, 4.8.0 and latest
python 3.8, 3.9 and 3.10

nvidia-smi

NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7

Thanks!

Who can help?

@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Actual Error:

Class: transformers/trainer.py
Failing at: tr_loss = torch.tensor(0.0).to(args.device)
Returns: RuntimeError: CUDA error: invalid argument

Code

I am following several examples and getting the same above error and the same line.
As a reference you can reproduce it according to this code sample
The exact same error occur with other huggingface training examples.

Also tried

if i am running the line tr_loss = torch.tensor(0.0).to(args.device) as a standalone its working fine
Also tried to run this line as part of gpt_neo.py in the above example and it worked fine but later failed as part of transformers/trainer.py
I made sure the CUDA is running fine:
torch.cuda.is_available()
Running only torch.tensor(0.0) works fine, only when adding .to(device) its failing

Expected behavior

No errors at torch.tensor(0.0).to(device)

The text was updated successfully, but these errors were encountered:

sgugger · 2022-10-25T13:21:13Z

This looks linked to your particular setup. Can you add a print of args.device in the script you are running and copy-paste the result of transformers-cli env (as was requested in the template)?

devozs · 2022-10-25T17:30:27Z

thanks for the prompt reply and sorry for missing the transformers-cli env

args.device cuda:0

Env:

transformers version: 4.23.1
Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
Python version: 3.8.15
Huggingface_hub version: 0.10.1
PyTorch version (GPU?): 1.12.1+cu116 (True)
Tensorflow version (GPU?): 2.10.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.6.1 (cpu)
Jax version: 0.3.23
JaxLib version: 0.3.22
Using GPU in script?:
Using distributed or parallel set-up in script?:

(i`ve also tried with other PyTorch & transformers versions)

devozs · 2022-10-26T10:19:40Z

i hope its ok that i putting a link to different ML library (in case its not - i`ll delete it)
this issue seems to be similar

sgugger · 2022-10-26T12:40:55Z

It doesn't look exactly similar in the sense that it is in an environment without a GPU, whereas yours shows one. Unless you are not executing the script within the exact same env as the results of the commands passed above of course.

devozs closed this as completed Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer RuntimeError CUDA error #19862

Trainer RuntimeError CUDA error #19862

devozs commented Oct 25, 2022 •

edited

Loading

sgugger commented Oct 25, 2022

devozs commented Oct 25, 2022 •

edited

Loading

devozs commented Oct 26, 2022

sgugger commented Oct 26, 2022

Trainer RuntimeError CUDA error #19862

Trainer RuntimeError CUDA error #19862

Comments

devozs commented Oct 25, 2022 • edited Loading

System Info

Versions

nvidia-smi

Who can help?

Information

Tasks

Reproduction

Actual Error:

Code

Also tried

Expected behavior

sgugger commented Oct 25, 2022

devozs commented Oct 25, 2022 • edited Loading

devozs commented Oct 26, 2022

sgugger commented Oct 26, 2022

devozs commented Oct 25, 2022 •

edited

Loading

devozs commented Oct 25, 2022 •

edited

Loading