Training Not Completing in 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb with Multiple GPUs #787

abhilashadavi · 2024-09-10T06:33:34Z

Bug description

While running the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb file using multiple NVIDIA A100 GPUs (40GB each), the training process gets stuck and does not complete under certain configurations. The training works correctly when using a single GPU or when running with lower training days. However, the process stalls when using 25 or 30 training days and multiple GPUs, suggesting a potential issue with batch sizes and GPU scaling.

Steps/Code to reproduce bug

Run the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb notebook using the NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12.
Set up the environment with 2 NVIDIA A100 GPUs (40GB each).
Use a training batch size of 512 and an evaluation batch size of 256.
Increase the number of training days to 25 or 30.
Observe that the training process gets stuck and does not proceed beyond the evaluation step:
eval_metrics = recsys_trainer.evaluate(metric_key_prefix='eval')

Expected behavior

The training process should complete successfully with the specified batch sizes and number of GPUs without stalling.

Environment details

Transformers4Rec version: 23.12.00
Platform: Using NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12
Python version: 3.10.12
Huggingface Transformers version: 4.27.1
PyTorch version (GPU?): 2.1.0a0+4136153 GPU

Additional context

The training process works with a single GPU and for lower training days.
No error messages are generated; the training simply stalls.
With 2 GPUs, reducing the evaluation batch size to 128 or 64 allows training to complete.
When using 3 GPUs, even reducing the training batch size to 256 with an evaluation batch size of 128 results in the training getting stuck.
GPU utilization remains relatively low (around 14GB and 12GB per GPU).
The issue may be related to how the training scales with multiple GPUs and higher batch sizes or training days.

The text was updated successfully, but these errors were encountered:

abhilashadavi added bug Something isn't working status/needs-triage labels Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Not Completing in 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb with Multiple GPUs #787

Training Not Completing in 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb with Multiple GPUs #787

abhilashadavi commented Sep 10, 2024

Training Not Completing in 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb with Multiple GPUs #787

Training Not Completing in 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb with Multiple GPUs #787

Comments

abhilashadavi commented Sep 10, 2024

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment details

Additional context