Name		Name	Last commit message	Last commit date
parent directory ..
PyTorch-GPU-Distributed-Gloo.ipynb		PyTorch-GPU-Distributed-Gloo.ipynb
Readme.md		Readme.md
cli-instructions.md		cli-instructions.md
job.json		job.json
mnist_trainer.py		mnist_trainer.py

Readme.md

PyTorch-GPU-Distributed-Gloo

This example demonstrates how to run distributed GPU training for PyTorch using Gloo backend in Batch AI

Details

The Gloo backend will be implemented using Batch AI shared job temporary directory which is visible for all GPU nodes in the job
Will use Batch AI generated AZ_BATCHAI_PYTORCH_INIT_METHOD for shared file-system initialization.
Will use Batch AI generated AZ_BATCHAI_TASK_INDEX as rank of each worker process
Standard output of the job will be stored on Azure File Share.
PyTorch training script mnist_trainer.py is attached, which trains a CNN for MNIST dataset.

Note Due to a known bug in PyTorch Gloo backend, the job may fail with the following error as reported:

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /pytorch/torch/lib/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /pytorch/torch/lib/gloo/gloo/cuda.cu:249: driver shutting down

Instructions to Run Recipe

Python Jupyter Notebook

You can find Jupyter Notebook for this sample in PyTorch-GPU-Distributed-Gloo.ipynb.

Azure CLI 2.0

You can find Azure CLI 2.0 instructions for this recipe in cli-instructions.md.

License Notice

Under construction...

Help or Feedback

If you have any problems or questions, you can reach the Batch AI team at [email protected] or you can create an issue on GitHub.

We also welcome your contributions of additional sample notebooks, scripts, or other examples of working with Batch AI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch-GPU-Distributed-Gloo

PyTorch-GPU-Distributed-Gloo

Readme.md

PyTorch-GPU-Distributed-Gloo

Details

Instructions to Run Recipe

Python Jupyter Notebook

Azure CLI 2.0

License Notice

Help or Feedback

Files

PyTorch-GPU-Distributed-Gloo

Directory actions

More options

Directory actions

More options

Latest commit

History

PyTorch-GPU-Distributed-Gloo

Folders and files

parent directory

Readme.md

PyTorch-GPU-Distributed-Gloo

Details

Instructions to Run Recipe

Python Jupyter Notebook

Azure CLI 2.0

License Notice

Help or Feedback