Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Not Completing in 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb with Multiple GPUs #787

Open
abhilashadavi opened this issue Sep 10, 2024 · 0 comments
Labels
bug Something isn't working status/needs-triage

Comments

@abhilashadavi
Copy link

Bug description

While running the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb file using multiple NVIDIA A100 GPUs (40GB each), the training process gets stuck and does not complete under certain configurations. The training works correctly when using a single GPU or when running with lower training days. However, the process stalls when using 25 or 30 training days and multiple GPUs, suggesting a potential issue with batch sizes and GPU scaling.

Steps/Code to reproduce bug

  1. Run the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb notebook using the NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12.
  2. Set up the environment with 2 NVIDIA A100 GPUs (40GB each).
  3. Use a training batch size of 512 and an evaluation batch size of 256.
  4. Increase the number of training days to 25 or 30.
  5. Observe that the training process gets stuck and does not proceed beyond the evaluation step:
    eval_metrics = recsys_trainer.evaluate(metric_key_prefix='eval')

Expected behavior

The training process should complete successfully with the specified batch sizes and number of GPUs without stalling.

Environment details

  • Transformers4Rec version: 23.12.00
  • Platform: Using NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12
  • Python version: 3.10.12
  • Huggingface Transformers version: 4.27.1
  • PyTorch version (GPU?): 2.1.0a0+4136153 GPU

Additional context

  • The training process works with a single GPU and for lower training days.
  • No error messages are generated; the training simply stalls.
  • With 2 GPUs, reducing the evaluation batch size to 128 or 64 allows training to complete.
  • When using 3 GPUs, even reducing the training batch size to 256 with an evaluation batch size of 128 results in the training getting stuck.
  • GPU utilization remains relatively low (around 14GB and 12GB per GPU).
  • The issue may be related to how the training scales with multiple GPUs and higher batch sizes or training days.
@abhilashadavi abhilashadavi added bug Something isn't working status/needs-triage labels Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant