Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems unable to utilize multiple GPUs #11

Open
jerermyyoung opened this issue Jun 5, 2022 · 1 comment
Open

Seems unable to utilize multiple GPUs #11

jerermyyoung opened this issue Jun 5, 2022 · 1 comment

Comments

@jerermyyoung
Copy link

Hi there.

I have tried running this code on one of my machine with four RTX3090 GPUs (GPU memory 24GB for each)

python -m torch.distributed.launch --nproc_per_node=4 script/run.py -c config/inductive/wn18rr.yaml --gpus [0,1,2,3]

I do not change any other parts of this repo. However, I encountered the CUDA error saying that I need more GPU memory. Later I modified this code as follows:

python script/run.py -c config/inductive/wn18rr.yaml --gpus [0]

and run it on a machine with one A100 GPU with 40GB GPU memory. The code runs successfully and costs roughly 32GB GPU memory. I am really puzzled for this: why the code does not properly utilize the total 24GB*4=96GB GPU memory and still report a memory issue? Is there something wrong with my setups?

@KiddoZhu
Copy link
Member

KiddoZhu commented Nov 5, 2022

Hi! Sorry for the late reply.

In the multi-GPU setup, the batch size is proportional to the number of GPUs. That is, each GPU uses the same batch size (and thus the same GPU memory) as the single-GPU case. Since our default hyperparameter configuration is tuned with 32GB V100 GPUs, it is possible that the configuration can't fit into 24GB GPU memory. You may reduce the batch size to fit it into 24GB GPU memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants