You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While running the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb file using multiple NVIDIA A100 GPUs (40GB each), the training process gets stuck and does not complete under certain configurations. The training works correctly when using a single GPU or when running with lower training days. However, the process stalls when using 25 or 30 training days and multiple GPUs, suggesting a potential issue with batch sizes and GPU scaling.
Steps/Code to reproduce bug
Run the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb notebook using the NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12.
Set up the environment with 2 NVIDIA A100 GPUs (40GB each).
Use a training batch size of 512 and an evaluation batch size of 256.
Increase the number of training days to 25 or 30.
Observe that the training process gets stuck and does not proceed beyond the evaluation step: eval_metrics = recsys_trainer.evaluate(metric_key_prefix='eval')
Expected behavior
The training process should complete successfully with the specified batch sizes and number of GPUs without stalling.
Environment details
Transformers4Rec version: 23.12.00
Platform: Using NVIDIA container nvcr.io/nvidia/merlin/merlin-pytorch:23.12
Python version: 3.10.12
Huggingface Transformers version: 4.27.1
PyTorch version (GPU?): 2.1.0a0+4136153 GPU
Additional context
The training process works with a single GPU and for lower training days.
No error messages are generated; the training simply stalls.
With 2 GPUs, reducing the evaluation batch size to 128 or 64 allows training to complete.
When using 3 GPUs, even reducing the training batch size to 256 with an evaluation batch size of 128 results in the training getting stuck.
GPU utilization remains relatively low (around 14GB and 12GB per GPU).
The issue may be related to how the training scales with multiple GPUs and higher batch sizes or training days.
The text was updated successfully, but these errors were encountered:
Bug description
While running the 03-Session-based-Yoochoose-multigpu-training-PyT.ipynb file using multiple NVIDIA A100 GPUs (40GB each), the training process gets stuck and does not complete under certain configurations. The training works correctly when using a single GPU or when running with lower training days. However, the process stalls when using 25 or 30 training days and multiple GPUs, suggesting a potential issue with batch sizes and GPU scaling.
Steps/Code to reproduce bug
eval_metrics = recsys_trainer.evaluate(metric_key_prefix='eval')
Expected behavior
The training process should complete successfully with the specified batch sizes and number of GPUs without stalling.
Environment details
Additional context
The text was updated successfully, but these errors were encountered: