Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer RuntimeError CUDA error #19862

Closed
1 of 4 tasks
devozs opened this issue Oct 25, 2022 · 4 comments
Closed
1 of 4 tasks

Trainer RuntimeError CUDA error #19862

devozs opened this issue Oct 25, 2022 · 4 comments

Comments

@devozs
Copy link

devozs commented Oct 25, 2022

System Info

Versions

i`ve tried using

  • transformers==4.15.0, 4.8.0 and latest
  • python 3.8, 3.9 and 3.10

nvidia-smi

NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7

Thanks!

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Actual Error:

Class: transformers/trainer.py
Failing at: tr_loss = torch.tensor(0.0).to(args.device)
Returns: RuntimeError: CUDA error: invalid argument

Code

I am following several examples and getting the same above error and the same line.
As a reference you can reproduce it according to this code sample
The exact same error occur with other huggingface training examples.

Also tried

  • if i am running the line tr_loss = torch.tensor(0.0).to(args.device) as a standalone its working fine
  • Also tried to run this line as part of gpt_neo.py in the above example and it worked fine but later failed as part of transformers/trainer.py
  • I made sure the CUDA is running fine:
    torch.cuda.is_available()
  • Running only torch.tensor(0.0) works fine, only when adding .to(device) its failing

Expected behavior

No errors at torch.tensor(0.0).to(device)

@sgugger
Copy link
Collaborator

sgugger commented Oct 25, 2022

This looks linked to your particular setup. Can you add a print of args.device in the script you are running and copy-paste the result of transformers-cli env (as was requested in the template)?

@devozs
Copy link
Author

devozs commented Oct 25, 2022

thanks for the prompt reply and sorry for missing the transformers-cli env

args.device cuda:0

Env:

  • transformers version: 4.23.1
  • Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
  • Python version: 3.8.15
  • Huggingface_hub version: 0.10.1
  • PyTorch version (GPU?): 1.12.1+cu116 (True)
  • Tensorflow version (GPU?): 2.10.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.6.1 (cpu)
  • Jax version: 0.3.23
  • JaxLib version: 0.3.22
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

(i`ve also tried with other PyTorch & transformers versions)

@devozs
Copy link
Author

devozs commented Oct 26, 2022

i hope its ok that i putting a link to different ML library (in case its not - i`ll delete it)
this issue seems to be similar

@sgugger
Copy link
Collaborator

sgugger commented Oct 26, 2022

It doesn't look exactly similar in the sense that it is in an environment without a GPU, whereas yours shows one. Unless you are not executing the script within the exact same env as the results of the commands passed above of course.

@devozs devozs closed this as completed Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants