-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apex issue #14
Comments
Hi! Did you follow the commands in this script for setting up apex? https://github.com/salesforce/GeDi/blob/master/scripts/setup.sh |
Yes i did follow those commands. They did not help. I identified the issue and resolved it by just commenting our the exception raised. It installs after we do that without any errors. But the nest thing i face due to it is during training time it raises another exception: Epoch: 0% 0/1 [00:01<?, ?it/s] I cannot find any clue on how to solve this. No resources found online and i have tried to alter as much code as i can but to no avail. |
I was able to resolve this error. loss_a*=loss_mask to loss_a = loss_a * loss_mask in train_gedi.py at line 355. This occurs due to an internal inplace function happening when you write the upper mentioned code. |
I am running my code on google colab with 12 GB of RAM and on CUDA. But it is giving me these errors. RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 11.17 GiB total capacity; 10.42 GiB already allocated; 1.81 MiB free; 10.66 GiB reserved in total by PyTorch) Just because of allocating 12MiB the CUDA memory overloads. How to free up space from PyTorch as it has reserved much of it. What i have tried on my end is
But to no avail. |
so when i run "!bash run_training.sh" after "%cd scripts", I get the following issue.
`09/03/2021 08:57:05 - INFO - main - Saving features into cached file ../data/AG-news/cached_train_gpt2-medium_192_sst-2
Traceback (most recent call last):
File "../train_GeDi.py", line 193, in train
from apex import amp
ModuleNotFoundError: No module named 'apex'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "../train_GeDi.py", line 1103, in
main()
File "../train_GeDi.py", line 1052, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "../train_GeDi.py", line 195, in train
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
ImportError: Please install apex from https://www.github.com/nvidia/apex to use fp16 training.`
Although the apex is installed. How to cater this issue.
The text was updated successfully, but these errors were encountered: