From db6a1bdb2c98f5a5d93f803cd54c51be9f8bd064 Mon Sep 17 00:00:00 2001 From: Christian Schou Oxvig Date: Mon, 18 Mar 2024 14:18:15 +0100 Subject: [PATCH] Updated sfantao example README for new test runs. --- examples/LUMI/sfantao_pytorch/README.md | 24 +++--------------------- 1 file changed, 3 insertions(+), 21 deletions(-) diff --git a/examples/LUMI/sfantao_pytorch/README.md b/examples/LUMI/sfantao_pytorch/README.md index 6df3e01..19b3667 100644 --- a/examples/LUMI/sfantao_pytorch/README.md +++ b/examples/LUMI/sfantao_pytorch/README.md @@ -28,29 +28,11 @@ Running the above SLURM batch scripts and looking through the output files, you | Approach | NCCL INFO NET/* | GPU training time | | -------- | ----------------- | ----------------- | -| `run_cotainr_docker_base_mnist_example.sh` | Using network Socket | 0:00:16 | -| `run_cotainr_lumisif_base_mnist_example.sh` | Using aws-ofi-rccl 1.4.0 / Selected Provider is cxi | 0:00:20 | -| `run_lumisif_mnist_example.sh` | Using aws-ofi-rccl 1.4.0 / Selected Provider is cxi | 0:00:21 | +| `run_cotainr_docker_base_mnist_example.sh` | Using network Socket | 0:00:17 | +| `run_cotainr_lumisif_base_mnist_example.sh` | Using aws-ofi-rccl 1.4.0 / Selected Provider is cxi | 0:00:17 | +| `run_lumisif_mnist_example.sh` | Using aws-ofi-rccl 1.4.0 / Selected Provider is cxi | 0:00:17 | ## Notes - Only submit one of the above SLURM batch scripts at a time since they overwrite the run-script.sh, etc. - The --gpus-per-task flag is not working correctly on LUMI - see #5 under https://lumi-supercomputer.github.io/LUMI-training-materials/1day-20230921/notes_20230921/#running-jobs -- When relying on sockets for NCCL communication, it seems that it sometimes randomly crashes with an error like (this needs more debugging on its own): - -```python -Traceback (most recent call last): - File "/workdir/mnist/mnist_DDP.py", line 261, in - run(modelpath=args.modelpath, gpu=args.gpu) - File "/workdir/mnist/mnist_DDP.py", line 205, in run - average_gradients(model) - File "/workdir/mnist/mnist_DDP.py", line 170, in average_gradients - group = dist.new_group([0]) - ^^^^^^^^^^^^^^^^^^^ - File "/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3544, in new_group - _store_based_barrier(global_rank, default_store, timeout) - File "/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 456, in _store_based_barrier - worker_count = store.add(store_key, 0) - ^^^^^^^^^^^^^^^^^^^^^^^ -RuntimeError: Connection reset by peer -```