This example demonstrate how to run standard TensorFlow sample (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py) on Azure Batch AI cluster of 2 nodes.
- For demonstration purposes, MNIST dataset and
mnist_replica.py
will be deployed at Azure File Share; - Standard output of the job will be stored on Azure File Share;
- MNIST dataset (http://yann.lecun.com/exdb/mnist/) is archived and uploaded into the blob.
- The recipe modifies official
mnist_replica.py
(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py) to generate model checkpoints and tensorboard event output files. - Please refer to the official tutorial on distributed tensorflow training
You can find Jupyter Notebook for this recipe in TensorFlow-GPU-Distributed.ipynb.
You can find Azure CLI 2.0 instructions for this recipe in cli-instructions.md.
Under construction...
If you have any problems or questions, you can reach the Batch AI team at [email protected] or you can create an issue on GitHub.
We also welcome your contributions of additional sample notebooks, scripts, or other examples of working with Batch AI.