From bc10faef67d308add6812d8267752b7701d89d8a Mon Sep 17 00:00:00 2001 From: Peter Heywood Date: Tue, 8 Feb 2022 17:52:51 +0000 Subject: [PATCH] Document Open-CE, update Tensorflow, PyTorch and deprecate WMLCE + Adds Open-CE documentation page + Marks as successor to WMLCE + Lists the key features no longer availablle from WMLCE + Describes why to use Open-CE + provides instructions for installing Open-CE packages into conda environments + Updates TensorFlow page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates PyToorch page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates WMLCE page + Refer to Open-CE as successor, emphasising that WMLCE is deprecated / no longer supported + Update/Tweak tensorflow-benchmarks resnet50 usage+description. + Expands Conda documentation + Includes upgrading installation instructions to source the preffered etc/profile.d/conda.sh + https://github.com/conda/conda/blob/master/CHANGELOG.md#recommended-change-to-enable-conda-in-your-shell + conda python version selection should only use a single '=' + Updates usage page emphasising ddlrun is not supported on RHEL 8 This does not include benchmarking of open-CE or RHEL 7/8 comparisons of WMLCE benchmarking due to ddlrun errors on RHEL 8. Closes #63 Closes #72 --- software/applications/conda.rst | 44 ++- software/applications/open-ce.rst | 117 ++++++++ software/applications/pytorch.rst | 63 +++-- software/applications/tensorflow.rst | 61 ++-- software/applications/wmlce.rst | 262 +++++++++--------- .../applications/wmlce/sbatch_resnet50base.sh | 50 ++++ usage/index.rst | 8 +- 7 files changed, 421 insertions(+), 184 deletions(-) create mode 100644 software/applications/open-ce.rst create mode 100644 software/applications/wmlce/sbatch_resnet50base.sh diff --git a/software/applications/conda.rst b/software/applications/conda.rst index 1e66b4e..b22efa6 100644 --- a/software/applications/conda.rst +++ b/software/applications/conda.rst @@ -5,6 +5,9 @@ Conda `Conda `__ is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. + +.. _software-applications-conda-installing: + Installing Miniconda ~~~~~~~~~~~~~~~~~~~~ @@ -26,7 +29,7 @@ The simplest way to install Conda for use on Bede is through the `miniconda /$USER # Update this with your code. - source $CONDADIR/miniconda/bin/activate + source $CONDADIR/miniconda/etc/profile.d/conda.sh Creating a new Conda Environment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -45,7 +48,7 @@ I.e. to create a new conda environment named `example`, with `python 3.9` you ca .. code-block:: bash - conda create -y --name example python==3.9 + conda create -y --name example python=3.9 Once created, the environment can be activated using ``conda activate``. @@ -53,6 +56,20 @@ Once created, the environment can be activated using ``conda activate``. conda activate example +Alternatively, Conda environments can be created outside of the conda/miniconda install, using the ``-p`` / ``--prefix`` option of ``conda create``. + +I.e. if you have installed miniconda to your home directory, but wish to create a conda environment within the ``/project//$USER/`` directory named ``example`` you can use: + +.. code-block:: bash + + conda create -y --prefix /project//$USER/example python=3.9 + +This can subsequently be loaded via: + +.. code-block:: bash + + conda activate /project//$USER/example + Listing and Activating existing Conda Environments ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -64,6 +81,27 @@ Existing conda environments can be listed via: ``conda activate`` can then be used to activate one of the listed environments. +Adding Conda Channels to an Environment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The default conda channel does not contain all packages or may not contain versions of packages you may wish to use. + +In this case, third-party conda channels can be added to conda environments to provide access to these packages, such as the :ref:`Open-CE ` Conda channel hosted by Oregon State University. + +It is recommended to add channels to specific conda environments, rather than your global conda configuration. + +I.e. to add the `OSU Open-CE Conda channel `__ to the currently loaded conda environment: + +.. code-block:: bash + + conda config --env --prepend channels https://ftp.osuosl.org/pub/open-ce/current/ + +You may also wish to enable `strict channel priority `__ to speed up conda operations and reduce incompatibility which will be default from Conda 5.0. This may break old environment files. + +.. code-block:: bash + + conda config --env --set channel_priority strict + Installing Conda Packages ~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/software/applications/open-ce.rst b/software/applications/open-ce.rst new file mode 100644 index 0000000..dc47895 --- /dev/null +++ b/software/applications/open-ce.rst @@ -0,0 +1,117 @@ +.. _software-applications-open-ce: + +Open-CE +======= + +The `Open Cognitive Environment (Open-CE) `__ is a community driven software distribution for machine learning and deep learning frameworks. + +Open-CE software is distributed via :ref:`Conda`, with all included packages for a given Open-CE release being installable in to the same conda environment. + +Open-CE conda channels suitable for use on Bede's IBM Power architecture systems are hosted by `Oregon State University `__ and `MIT `__. + +It is the successor to :ref:`IBM WMLCE ` which was archived on 2020-11-10, with IBM WMLCE 1.7.0 being the final release. + +Open-CE includes the following software packages, amongst others: + +* :ref:`TensorFlow ` +* :ref:`PyTorch ` +* `Horovod `__ +* `ONNX `__ + +.. note:: + + Open-CE does not include all features from WMLCE, such as Large Model Support or Distributed Deep Learning (DDL). + +Using Open-CE +------------- + +Open-CE provides software packages via :ref:`Conda`, which you must first :ref:`install`. +Conda installations of the packages provided by Open-CE can become quite large (multiple GBs), so you may wish to use a conda installation in ``/nobackup/projects/`` or ``/projects/`` as described in the :ref:`Installing Conda section `. + +With a working Conda install, Open-CE packages can be installed from either the OSU or MIT Conda channels for PPC64LE systems such as Bede. + +* OSU: ``https://ftp.osuosl.org/pub/open-ce/current/`` +* MIT: ``https://opence.mit.edu/`` + +Using Conda Environments are recommended when working with Open-CE. + +I.e. to install ``tensorflow`` and ``pytorch`` from OSU Open-CE conda channel into a conda environment named ``open-ce``: + +.. code-block:: bash + + # Create a new conda environment named open-ce within your conda installation + conda create -y --name open-ce python=3.9 # Older Open-CE may require older Python versions + + # Activate the conda environment + conda activate open-ce + + # Add the OSU Open-CE conda channel to the current environment config + conda config --env --prepend channels https://ftp.osuosl.org/pub/open-ce/current/ + # Also use strict channel priority + conda config --env --set channel_priority strict + + # Install the required conda package, using the channels set within the conda env. This may take some time. + conda install -y tensorflow + conda install -y pytorch + +Once installed into a conda environment, the Open-CE provided software packages can be used interactively on login nodes or within batch jobs by activating the named conda environment. + +.. code-block:: bash + + # Activate the conda environment + conda activate open-ce + + # Run a python command or script which makes use of the installed packages + # I.e. to output the version of tensorflow: + python3 -c "import tensorflow;print(tensorflow.__version__)" + + # I.e. or to output the version of pytorch: + python3 -c "import torch;print(torch.__version__)" + +Using older versions of Open-CE +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The OSU conda distribution provides an archive of older Open-CE releases, beginning at version ``1.0.0``. + +The available versions are listed at https://ftp.osuosl.org/pub/open-ce/. + +Using versions other than ``current`` can be done by modifying the channel URI when adding the channel to the current conda environment with the desired version number. + +I.e. to explicitly use Open-CE ``1.4.1`` the command to add the conda channel to the current environment would be: + +.. code-block:: bash + + conda config --env --prepend channels https://ftp.osuosl.org/pub/open-ce/1.4.1/ + +Using older Open-CE versions may require older python versions. +See the `OSU Open-CE page `__ for further version information. + +The MIT Open-CE channel provides multiple versions of Open-CE in the same Conda channel. If using the MIT Open-CE distribution, older versions of packages can be requested by specifying the specific version of the desired package. + +Why use Open-CE +--------------- + +Modern machine learning packages like TensorFlow and PyTorch have large dependency trees which can conflict with one another due to the independent release schedules. +This has made it difficult to use multiple competing packages within the same environment. + +Open-CE solves this issue by ensuring that packages included in a given Open-CE distribution are compatible with one another, and can be installed a the same time, simplifying the distribution of these packages. + +It also provides pre-compiled distributions of these packages for PPC64LE architecture machines, which are not always available from upstream sources, reducing the time required to install these packages. + +For more information on the potential benefits of using Open-CE see `this blog post from the OpenPOWER foundation `__. + +Differences from WMLCE +---------------------- + +:ref:`IBM WMLCE` include several features not available in upstream TensorFlow and PyTorch distributions, such as Large Model Support. + +Unfortunately, LMS is not available in TensorFlow or PyTorch provided by Open-CE. + +Other features or packages absent in Open-CE which were included in WMLCE include: + +* Large Model Support (LMS) +* IBM DDL +* Caffe (IMB-enhanced) +* IBM SnapML +* NVIDIA Rapids + diff --git a/software/applications/pytorch.rst b/software/applications/pytorch.rst index 3e358d1..75de01f 100644 --- a/software/applications/pytorch.rst +++ b/software/applications/pytorch.rst @@ -6,44 +6,57 @@ PyTorch `PyTorch `__ is an end-to-end machine learning framework. PyTorch enables fast, flexible experimentation and efficient production through a user-friendly front-end, distributed training, and ecosystem of tools and libraries. -The main method of distribution for PyTorch is via :ref:`Conda `. +The main method of distribution for PyTorch is via :ref:`Conda `, with :ref:`Open-CE` providing a simple method for installing multiple machine learning frameworks into a single conda environment. -For more information on the usage of PyTorch, see the `Online Documentation `__. +The upstream Conda and pip distributions do not provide ppc64le pytorch packages at this time. -PyTorch Quickstart -~~~~~~~~~~~~~~~~~~ +Installing via Conda +~~~~~~~~~~~~~~~~~~~~ + +With a working Conda installation (see :ref:`Installing Miniconda`) the following instructions can be used to create a Python 3.9 conda environment named ``torch`` with the latest Open-CE provided PyTorch: + +.. note:: + + Pytorch installations via conda can be relatively large. Consider installing your miniconda (and therfore your conda environments) to the ``/nobackup`` file store. -The following should get you set up with a working conda environment (replacing with your project code): .. code-block:: bash - export DIR=/nobackup/projects//$USER - # rm -rf ~/.conda ~/.condarc $DIR/miniconda # Uncomment if you want to remove old env - mkdir $DIR - pushd $DIR + # Create a new conda environment named torch-env within your conda installation + conda create -y --name torch-env python=3.8 + + # Activate the conda environment + conda activate torch-env + + # Add the OSU Open-CE conda channel to the current environment config + conda config --env --prepend channels https://ftp.osuosl.org/pub/open-ce/current/ - # Download the latest miniconda installer for ppcle64 - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.sh - # Validate the file checksum matches is listed on https://docs.conda.io/en/latest/miniconda_hashes.html. - sha256sum Miniconda3-latest-Linux-ppc64le.sh + # Also use strict channel priority + conda config --env --set channel_priority strict + + # Install the latest available version of PyTorch + conda install -y pytorch + +In subsequent interactive sessions, and when submitting batch jobs which use PyTorch, you will then need to re-activate the conda environment. + +For example, to verify that PyTorch is available and print the version: + +.. code-block:: bash - sh Miniconda3-latest-Linux-ppc64le.sh -b -p $DIR/miniconda - source miniconda/bin/activate - conda update conda -y - conda config --set channel_priority strict + # Activate the conda environment + conda activate torch-env - conda config --prepend channels \ - https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ + # Invoke python + python3 -c "import torch;print(torch.__version__)" - conda config --prepend channels \ - https://opence.mit.edu - conda create --name opence pytorch=1.7.1 -y - conda activate opence +Installation via the upstream Conda channel is not currently possible, due to the lack of ``ppc64le`` or ``noarch`` distributions. -This has some limitations such as not supporting large model support. -If you require LMS, please see the :ref:`WMLCE ` page. +.. note:: + + The :ref:`Open-CE` distribution of PyTorch does not include IBM technologies such as DDL or LMS, which were previously available via :ref:`WMLCE`. + WMLCE is not supported on RHEL 8. Further Information diff --git a/software/applications/tensorflow.rst b/software/applications/tensorflow.rst index 0f301cc..eba6ffb 100644 --- a/software/applications/tensorflow.rst +++ b/software/applications/tensorflow.rst @@ -1,43 +1,58 @@ -.. _software-python-tensorflow: +.. _software-applications-tensorflow: TensorFlow ---------- `TensorFlow `__ is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. -TensorFlow Quickstart -~~~~~~~~~~~~~~~~~~~~~ +TensorFlow can be installed through a number of python package managers such as :ref:`Conda` or ``pip``. + +For use on Bede, the simplest method is to install TensorFlow using the :ref:`Open-CE Conda distribution`. + + +Installing via Conda (Open-CE) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +With a working Conda installation (see :ref:`Installing Miniconda`) the following instructions can be used to create a Python 3.8 conda environment named ``tf-env`` with the latest Open-CE provided TensorFlow: + +.. note:: + + TensorFlow installations via conda can be relatively large. Consider installing your miniconda (and therfore your conda environments) to the ``/nobackup`` file store. -The following should get you set up with a working conda environment (replacing ```` with your project code): .. code-block:: bash - export DIR=/nobackup/projects//$USER - # rm -rf ~/.conda ~/.condarc $DIR/miniconda # Uncomment if you want to remove old env - mkdir $DIR - pushd $DIR + # Create a new conda environment named tf-env within your conda installation + conda create -y --name tf-env python=3.8 - # Download the latest miniconda installer for ppcle64 - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.sh - # Validate the file checksum matches is listed on https://docs.conda.io/en/latest/miniconda_hashes.html. - sha256sum Miniconda3-latest-Linux-ppc64le.sh + # Activate the conda environment + conda activate tf-env - sh Miniconda3-latest-Linux-ppc64le.sh -b -p $DIR/miniconda - source miniconda/bin/activate - conda update conda -y + # Add the OSU Open-CE conda channel to the current environment config + conda config --env --prepend channels https://ftp.osuosl.org/pub/open-ce/current/ - conda config --prepend channels \ - https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ + # Also use strict channel priority + conda config --env --set channel_priority strict + + # Install the latest available version of Tensorflow + conda install -y tensorflow + +In subsequent interactive sessions, and when submitting batch jobs which use TensorFlow, you will then need to re-activate the conda environment. + +For example, to verify that TensorFlow is available and print the version: + +.. code-block:: bash - conda config --prepend channels \ - https://opence.mit.edu + # Activate the conda environment + conda activate tf-env - conda create --name opence tensorflow -y - conda activate opence + # Invoke python + python3 -c "import tensorflow;print(tensorflow.__version__)" .. note:: - - This conflicts with the :ref:`PyTorch ` instructions as they set the conda channel_priority to be strict which seems to cause issues when installing TensorFlow. + + The :ref:`Open-CE` distribution of TensorFlow does not include IBM technologies such as DDL or LMS, which were previously available via :ref:`WMLCE`. + WMLCE is not supported on RHEL 8. Further Information ~~~~~~~~~~~~~~~~~~~ diff --git a/software/applications/wmlce.rst b/software/applications/wmlce.rst index 18f50ee..8c0eb9e 100644 --- a/software/applications/wmlce.rst +++ b/software/applications/wmlce.rst @@ -1,124 +1,188 @@ .. _software-applications-wmlce: -IBM WMLCE -========= - -`IBM WMLCE `__ is the Watson Machine Learning Community Edition, a software distribution for machine learning which included some technology previews such as `Large Model Support for TensorFlow `__. +IBM WMLCE (End of Life) +======================= .. warning:: - WMLCE was archived by IBM on 2020-11-10 and is no longer updated or maintained. + WMLCE was archived by IBM on 2020-11-10 and is no longer updated, maintained or supported. + + It has been replaced by :ref:`Open Cognitiive Environment (Open-CE) `, a community driven software distribution for machine learning. - It has been replaced by `Open-CE `__, a community driven software distribution for machine learning, which does not support all features of WMLCE. + Open-CE does not not support all features of WMLCE. + + Please refer to the :ref:`Open-CE ` documentation for more information. - The remainder of this document refers to WMLCE, so may be considered out of date. + Alternatively, consider moving to upstream sources for python packages such as :ref:`Tensorflow ` or :ref:`PyTorch` where available. .. warning:: - WMLCE 1.7 may not be compatible with RHEL 8. + WMLCE 1.7 only supported RHEL 7.6 and 7.7. + It is unsupported on RHEL 8, and may not behave correctly once the RHEL 8 migration has completed. + + Consider migrating to :ref:`Open Cognitiive Environment (Open-CE) `. + +`IBM WMLCE `__ was the *Watson Machine Learning Community Edition* - a software distribution for machine learning which included IBM technology previews such as `Large Model Support for TensorFlow `__. +WMLCE is also known as PowerAI. + +It included a number of popular machine learning tools and frameworks such as :ref:`TensorFlow ` and :ref:`PyTorch `, enhanced for use on IBM POWER9 + Nvidia GPU based systems. +The use of :ref:`Conda` to enable simple installation of multiple machine learning frameworks into a single software environment without users needing to manage complex dependency trees was another key feature of IBM WMLCE. + +For more information, refer to the `IBM WMLCE documentation `__. + +Using IBM WMLCE (End of Life) +----------------------------- -PyTorch and TensorFlow: IBM PowerAI and wmlce [Possibly Out of Date] --------------------------------------------------------------------- +IBM WMLCE provided software packages via a hosted :ref:`Conda` channel. -IBM have done a lot of work to port common Machine Learning tools to the -POWER9 system, and to take advantage of the GPUs abililty to directly -access main system memory on the POWER9 architecture using its "Large -Model Support". +Conda installations of the packages provided by WMLCE can become quite large (multiple GBs), so you may wish to use a conda installation in ``/nobackup/projects/`` or ``/projects/`` as described in the :ref:`Installing Conda section `. -This has been packaged up into what is variously known as IBM Watson -Machine Learning Community Edition (wmlce) or the catchier name PowerAI. +With a working Conda install, IBM WMLCE packages can be installed from the IBM WMLCE conda channel: ``https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/``. -Documentation on wmlce can be found here: -https://www.ibm.com/support/pages/get-started-ibm-wml-ce +Using Conda Environments are recommended when working with Open-CE. -Installation is via the IBM channel of the anaconda package management tool. **Note: -if you do not use this channel you will not find all of the available packages.** -First install anaconda (can be quite large - so using the /nobackup area): +I.e. to install all WMLCE packages into a conda environment named ``wmlce``: + +.. note:: + + IBM WMLCE requires Python 3.6 or Python 3.7. This may require an older Conda installation. + +.. note:: + + Installation of the full ``powerai`` package can take a considerable amount of time (hours) and consume a large amount of disk space of disk storage space. .. code-block:: bash - cd /nobackup/projects/ + # Create a new python 3.6 conda environment named wmlce within your conda installation. + # Your conda installation should be in the /nobackup filesystem. + conda create -y --name wmlce python=3.6 + + # Activate the conda environment + conda activate wmlce + + # Add the IBM WMLCE channel to the environment + conda config --env --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ + + # Enable strict channel priority for the environment + conda config --env --set channel_priority strict - # Download the latest miniconda installer for ppcle64 - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.sh - # Validate the file checksum matches is listed on https://docs.conda.io/en/latest/miniconda_hashes.html. - sha256sum Miniconda3-latest-Linux-ppc64le.sh - sh Miniconda3-latest-Linux-ppc64le.sh - conda update conda - conda config --set channel_priority strict - conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ - conda create --name wmlce + # Install specific conda packages + conda install -y tensorflow + conda install -y pytorch + + # or the full powerai package, or powerai-cpu for the cpu version + conda install -y powerai -Then login again and install wmlce (GPU version by default - substitute -``powerai-cpu`` for ``powerai`` for the CPU version): +Once packages are installed into a named conda environment, the packages can be used interactively or within batch jobs by activating the conda environment. .. code-block:: bash + # activate the conda environment + conda activate wmlce + + # Run a python command or script which makes use of the installed packages + # I.e. to output the version of tensorflow: + python3 -c "import tensorflow;print(tensorflow.__version__)" + + # I.e. or to output the version of pytorch: + python3 -c "import torch;print(torch.__version__)" + +IBM WMLCE includes `IBM Distributed Deep Learning (DDL) `__ which is an mpi-based library optimised for deep learning. +When an application is integrated with DDL, it becomes an MPI application which should be launched via a special command. +In WMLCE, DDL is integrated into PowerAI IBM Caffe, Pytorch, and TensorFlow. +This allows the use of multiple nodes when running machine learning models to support larger models and improved performance. + +On Bede, this command is ``bede-ddlrun``. For example: + +.. code-block:: slurm + + #!/bin/bash + + # Generic options: + + #SBATCH --account= # Run job under project + #SBATCH --time=1:0:0 # Run for a max of 1 hour + + # Node resources: + + #SBATCH --partition=gpu # Choose either "gpu" or "infer" node type + #SBATCH --nodes=2 # Resources from a two nodes + #SBATCH --gres=gpu:4 # Four GPUs per node (plus 100% of node CPU and RAM per node) + + # Run commands: + conda activate wmlce - conda install powerai ipython -Running ``ipython`` on the login node will then allow you to experiment -with this feature using an interactive copy of Python and the GPUs on -the login node. Demanding work should be packaged into a job and -launched with the ``python`` command. + bede-ddlrun python $CONDA_PREFIX/ddl-tensorflow/examples/keras/mnist-tf-keras-adv.py + +.. warning:: + + IBM DDL is not supported on RHEL 8 and will likely error on use. + + Consider migrating away from DDL via :ref:`Open-CE` and regular ``bede-mpirun`` + +WMLCE resnet50 benchmark (RHEL 7 only) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The WMLCE conda channel includes a package ``tensorflow-benchmarks`` which provides a TensorFlow implementation of the resnet-50 model for benchmarking purposes. + +When the ``tensorflow-benchmarks`` conda package is installed into the current conda environment, the documentation for this benchmark can be found at ``$CONDA_PREFIX/tensorflow-benchmarks/resnet50/README.md``. +Subsequent sections are based on the contents of the readme. -If a single node with 4 GPUs and 512GB RAM isn't enough, the Distributed -Deep Learning feature of PowerAI should allow you to write code that can -take advantage of multiple nodes. +The remainder of this section describes how to execute this benchmark on Bede, +using a conda environment named ``wmlce`` with ``tensorflow`` and ``tensorflow-benchmarks`` installed. -WMLCE resnet50 benchmark [Possibly out of date] -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The necessary data from ImageNet has been downloaded and processed. +It is stored in ``/nobackup/datasets/resnet50/TFRecords`` and is universally readable. -This Bede specific README file is based upon options laid out in the README.MD file in the WMLCE -resnet50 benchmark directory. The necessary data from ImageNet has been downloaded and processed. -It is stored in /nobackup/datasets/resnet50/TFRecords and is universally readable. +.. note:: -NOTE: As written, the associated sbatch script must be run in a directory that is writable -by the user. It creates a directory with the default name run_results into which it will write -the results of the computation. The results data will use up to 1.2GB of space. The run -directory must also be accessible by the compute nodes, so using /tmp on a login node is not -suitable. + As written, the associated sbatch script must be run in a directory that is writeable by the user. -The main WMLCE README.MD file suggests the following parameters are appropriate for a 4 node -(possibly 16 GPU) run: + It creates a directory with the default name run_results into which it will write the results of the computation. + The results data will use up to 1.2GB of space. + + The run directory must also be accessible by the compute nodes, so using ``/tmp`` on a login node is not suitable. + +The main WMLCE README.MD file suggests the following parameters are appropriate for a 4 node (up to 16 GPU) run: .. code-block:: bash # Run a training job - ddlrun -H host1,host2,host3,host4 python benchmarks/tensorflow-benchmarks/resnet50/main.py \ + ddlrun -H host1,host2,host3,host4 python $CONDA_PREFIX/benchmarks/tensorflow-benchmarks/resnet50/main.py \ --mode=train_and_evaluate --iter_unit=epoch --num_iter=50 --batch_size=256 --warmup_steps=100 \ --use_cosine_lr --label_smoothing 0.1 --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 \ --weight_decay=3.0517578125e-05 --data_dir=/data/imagenetTF/ --results_dir=run_results \ --use_xla --precision=fp16 --loss_scale=1024 --use_static_loss_scaling -ddlrun by itself is not integrated with Slurm and will not run directly on Bede. A wrapper-script -called bede-ddlrun is available and that is what is used in the following. +``ddlrun`` is not integrated with Slurm and will not run directly on Bede. +A wrapper-script called ``bede-ddlrun`` is available and that is what is used in the following. -It is easy to define a single GPU run based on the above set of parameters (basically -remove the ddlrun command at the front and specify the correct paths). The associated run -takes about 16 hours to complete. +A single GPU run of this benchmark can be completed without requiring ``ddlrun`` or ``bede-ddlrun`` the above set of parameters. +The associated run takes about 16 hours to complete, however, the job may be killed due to insufficient host memory when only a single GPU is requested. -The related sbatch script ( :ref:`sbatch_resnet50base.sh `) is configured to use 4 GPUs on one node. +The related ``sbatch`` script (:download:`sbatch_resent50base.sh` +) is configured to use 4 GPUs on one node. Changing the script to use 4 nodes, 16 GPUs, requires changing one line. - The sbatch script specifies: .. code-block:: bash # ... - #SBATCH -p gpu + #SBATCH --partition gpu #SBATCH --gres=gpu:4 - #SBATCH -N1 + #SBATCH --nodes=1 # ... - module load slurm/dflt - export PYTHON_HOME=/opt/software/apps/anaconda3/ - source $PYTHON_HOME/bin/activate wmlce_env + export CONDADIR=/nobackup/projects//$USER # Update this with your code. + source $CONDADIR/miniconda/etc/profile.d/conda.sh + # Activate the + conda activate wmlce export OMP_NUM_THREADS=1 # Disable multithreading - bede-ddlrun python $PYTHON_HOME/envs/wmlce_env/tensorflow-benchmarks/resnet50/main.py \ + bede-ddlrun python $CONDA_PREFIX/tensorflow-benchmarks/resnet50/main.py \ --mode=train_and_evaluate --iter_unit=epoch --num_iter=50 --batch_size=256 \ --warmup_steps=100 --use_cosine_lr --label_smoothing 0.1 --lr_init=0.256 \ --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \ @@ -194,65 +258,5 @@ be a just over an hour and during the 16 GPU run, about 18000 images per second be processed. Unfortunately, the basic parameters used with the resnet50 run do not allow this -job to scale much beyond 16 GPUs. Indeed, there is no speedup with this configuration -using 32 GPUs. Improving scalability is left as an exercise for the user. - - - -.. _sbatch_resenet50base.sh: - -sbatch_resent50base.sh -^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: bash - - #!/bin/bash -l - #SBATCH -A bdXXXYY - #SBATCH -p gpu - #SBATCH --gres=gpu:4 - #SBATCH -N1 - #SBATCH -o multix1.o%j - #SBATCH -t 4:20:00 - # - # Author: C. Addison - # Initial version: 2020-11-19 - # - # Please read the file bede-README-batch.txt for details on this - # script. - # - echo ========================================================= - echo SLURM job: submitted date = `date` - date_start=`date +%s` - - echo Nodes involved: - echo $SLURM_NODELIST - echo ========================================================= - echo Job output begins - echo ----------------- - echo - module load slurm/dflt - export PYTHON_HOME=/opt/software/apps/anaconda3/ - source $PYTHON_HOME/bin/activate wmlce_env - - export OMP_NUM_THREADS=1 # Disable multithreading - - bede-ddlrun python $PYTHON_HOME/envs/wmlce_env/tensorflow-benchmarks/resnet50/main.py \ - --mode=train_and_evaluate --iter_unit=epoch --num_iter=50 --batch_size=256 \ - --warmup_steps=100 --use_cosine_lr --label_smoothing 0.1 --lr_init=0.256 \ - --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \ - --data_dir=/nobackup/datasets/resnet50/TFRecords/ --results_dir=run_results \ - --use_xla --precision=fp16 --loss_scale=1024 --use_static_loss_scaling - - echo - echo --------------- - echo Job output ends - date_end=`date +%s` - seconds=$((date_end-date_start)) - minutes=$((seconds/60)) - seconds=$((seconds-60*minutes)) - hours=$((minutes/60)) - minutes=$((minutes-60*hours)) - echo ========================================================= - echo SLURM job: finished date = `date` - echo Total run time : $hours Hours $minutes Minutes $seconds Seconds - echo ========================================================= \ No newline at end of file +job to scale much beyond 16 GPUs. +Indeed, there is no speedup with this configuration using 32 GPUs. diff --git a/software/applications/wmlce/sbatch_resnet50base.sh b/software/applications/wmlce/sbatch_resnet50base.sh new file mode 100644 index 0000000..7cfddd6 --- /dev/null +++ b/software/applications/wmlce/sbatch_resnet50base.sh @@ -0,0 +1,50 @@ +#!/bin/bash -l +#SBATCH --account=bdXXXYY +#SBATCH --partion=gpu +#SBATCH --gres=gpu:4 +#SBATCH --nodes=1 +#SBATCH -o multix1.o%j +#SBATCH -t 4:20:00 +# +# Author: C. Addison +# Initial version: 2020-11-19 +# +# Please read the file bede-README-batch.txt for details on this +# script. +# +echo ========================================================= +echo SLURM job: submitted date = `date` +date_start=`date +%s` + +echo Nodes involved: +echo $SLURM_NODELIST +echo ========================================================= +echo Job output begins +echo ----------------- +echo +module load slurm/dflt +export PYTHON_HOME=/opt/software/apps/anaconda3/ +source $PYTHON_HOME/bin/activate wmlce_env + +export OMP_NUM_THREADS=1 # Disable multithreading + +bede-ddlrun python $PYTHON_HOME/envs/wmlce_env/tensorflow-benchmarks/resnet50/main.py \ +--mode=train_and_evaluate --iter_unit=epoch --num_iter=50 --batch_size=256 \ +--warmup_steps=100 --use_cosine_lr --label_smoothing 0.1 --lr_init=0.256 \ +--lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \ +--data_dir=/nobackup/datasets/resnet50/TFRecords/ --results_dir=run_results \ +--use_xla --precision=fp16 --loss_scale=1024 --use_static_loss_scaling + +echo +echo --------------- +echo Job output ends +date_end=`date +%s` +seconds=$((date_end-date_start)) +minutes=$((seconds/60)) +seconds=$((seconds-60*minutes)) +hours=$((minutes/60)) +minutes=$((minutes-60*hours)) +echo ========================================================= +echo SLURM job: finished date = `date` +echo Total run time : $hours Hours $minutes Minutes $seconds Seconds +echo ========================================================= \ No newline at end of file diff --git a/usage/index.rst b/usage/index.rst index 8b3910c..d28c1c2 100644 --- a/usage/index.rst +++ b/usage/index.rst @@ -215,6 +215,10 @@ Examples: Multiple nodes (IBM PowerAI DDL) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. warning:: + + IBM PowerAI DDL is only supported on RHEL 7. + IBM PowerAI DDL (Distributed Deep Learning) is a method of using the GPUs in more than one node to perform calculations. Example job script: @@ -244,10 +248,6 @@ GPUs in more than one node to perform calculations. Example job script: echo "end of job" -.. warning:: - - IBM PowerAI DDL is only supported on RHEL 7 - .. _usage-maximum-job-runtime: Maximum Job Runtime