Adding IBM WML-CE Documentation

gmarkomanolis · Oct 29, 2019 · b602789 · b602789
1 parent 90bbc92
commit b602789
Show file tree

Hide file tree

Showing 4 changed files with 264 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
 _build
+.vscode
diff --git a/images/ibm_wml_ddl_resnet50.png b/images/ibm_wml_ddl_resnet50.png
diff --git a/software/analytics/ibm-wml-ce.rst b/software/analytics/ibm-wml-ce.rst
@@ -1,5 +1,264 @@
 
+*************************************************************************************
 IBM Watson Machine Learning CE
-===============================
+*************************************************************************************
 
-(Coming soon!)
+Getting Started
+===============
+
+IBM Watson Machine Learning Community Edition 1.6.1 is provided on Summit
+through the module ``ibm-wml-ce``. This module includes a license for IBM
+Distributed Deep Learning (DDL) allowing execution across up to 954 nodes.
+
+To access the IBM WML CE packages use the ``module load`` command:
+
+.. code-block:: bash
+
+    module load ibm-wml-ce
+
+This will activate a conda environment which is pre-loaded with the following
+packages, and their dependencies:
+
+* `IBM DDL 1.4.0 <https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_ddl.html>`_
+
+* `Tensorflow 1.14 <https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_tensorflow.html>`_
+
+* `Pytorch 1.1.0 <https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_pytorch.html>`_
+
+* `Caffe(IBM-enhanced) 1.0.0 <https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_caffe.html>`_
+
+* `Horovod (IBM-DDL Backend) <https://github.com/horovod/horovod>`_
+
+For a complete list of packages and their versions please see: `WMLC CE Software Packages <https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_software_pkgs.html>`_.
+
+Running DDL Jobs
+================
+
+IBM DDL provides a tool called ``ddlrun`` which facilitates the launching of
+DDL jobs on Summit. When called inside of a ``bsub`` script, ``ddlrun`` will
+automatically distribute the training job across all the compute hosts in the
+reservation. ``ddlrun`` automatically selects default arguments optimized
+for performance.
+
+Basic DDL BSUB Script
+---------------------
+
+The following bsub script will run a distributed Tensorflow resnet50 training job
+across 2 nodes.
+
+.. code-block:: bash
+    :caption: script.bash
+
+    #BSUB -P <PROJECT>
+    #BSUB -W 0:10
+    #BSUB -nnodes 2
+    #BSUB -q batch
+    #BSUB -J ddl_test_job
+    #BSUB -o /ccs/home/<user>/job%J.out
+    #BSUB -e /ccs/home/<user>/job%J.out
+
+    module load ibm-wml-ce
+
+    ddlrun python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=ddl --model=resnet50
+
+``bsub`` is used to launch the script as follows:
+
+.. code-block:: bash
+
+    bsub script.bash
+
+For more information on ``bsub`` and job submission
+please see: :ref:`running-jobs`.
+
+Troubleshooting Tips
+--------------------
+
+Full command
+^^^^^^^^^^^^
+
+The output from ``ddlrun`` includes the exact command used to launch the
+distributed job. This is useful if a user wants to see exactly what ``ddlrun``
+is doing. The following is the first line of the output from the above script:
+
+.. code-block:: console
+
+    $ module load ibm-wml-ce
+    (ibm-wml-ce-1.6.1) $ ddlrun python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=ddl --model=resnet50
+    + /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-3/spectrum-mpi-10.3.0.1-20190611-aqjt3jo53mogrrhcrd2iufr435azcaha/bin/mpirun \
+      -x LSB_JOBID -x PATH -x PYTHONPATH -x LD_LIBRARY_PATH -x LSB_MCPU_HOSTS -x NCCL_LL_THRESHOLD=0 -x NCCL_TREE_THRESHOLD=0 \
+      -disable_gdr -gpu --rankfile /tmp/DDLRUN/DDLRUN.xoObgjtixZfp/RANKFILE -x "DDL_OPTIONS=-mode p:6x2x1x1 " -n 12 \
+      -mca plm_rsh_num_concurrent 12 -x DDL_HOST_PORT=2200 -x "DDL_HOST_LIST=g28n14:0,2,4,6,8,10;g28n15:1,3,5,7,9,11" bash \
+      -c 'source /sw/summit/ibm-wml-ce/anaconda-base/etc/profile.d/conda.sh && conda activate /sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.1-1 \
+      > /dev/null 2>&1 && python /sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.1-1/ddl-tensorflow/examples/mnist/mnist-env.py'
+    ...
+
+
+Verbose mode
+^^^^^^^^^^^^
+
+Using the verbose flag (``-v``) with ``ddlrun`` displays much more debugging
+information. This should be the first step to troubleshoot errors when
+launching a distributed job.
+
+Setting up Custom Environments
+==============================
+
+The ``IBM-WML-CE-1.6.1`` conda environment is read-only. Therefore, users
+cannot install any additional packages that may be needed. If users need
+any additional conda or pip packages, they can clone the ``IBM-WML-CE-1.6.1``
+conda environment into their home directory and then add any packages they
+need.
+
+.. code-block:: console
+
+    $ module load ibm-wml-ce
+    (ibm-wml-ce-1.6.1) $ conda create --name cloned_env --clone ibm-wml-ce-1.6.1
+    (ibm-wml-ce-1.6.1) $ conda activate cloned_env
+    (cloned_env) $
+
+By default this should create the cloned environment in
+``/ccs/home/${USER}/.conda/envs/cloned_env``.
+
+To activate the new environment you should still load the module first. This
+will ensure that all of the conda settings remain the same.
+
+.. code-block:: console
+
+    $ module load ibm-wml-ce
+    (ibm-wml-ce-1.6.1) $ conda activate cloned_env
+    (cloned_env) $
+
+To use Horovod with the IBM DDL backend in a cloned environment, the user must
+``pip`` install Horovod using the following command:
+
+.. code-block:: bash
+
+    HOROVOD_CUDA_HOME="${CONDA_PREFIX}" HOROVOD_GPU_ALLREDUCE=DDL pip install --no-cache-dir git+https://github.com/horovod/horovod.git@9f87459ead9ebb7331e1cd9cf8e9a5543ecfb784
+
+Best DDL Performance
+====================
+
+Most users will get good performance using LSF basic job submission, and
+specifying the node count with ``-nnodes N``. However, users trying
+to squeeze out the final few percent of performance can use the following
+techniques.
+
+Reserving Whole Racks
+---------------------
+
+When making node reservations for DDL jobs, it is best to reserve nodes in a
+rack-contiguous manner. IBM DDL optimizes communication with knowledge of the
+node layout.
+
+In order to instruct BSUB to reserve nodes in the same rack, expert mode must
+be used (``-csm y``), and the user needs to explicitly specify the reservation
+string. For more information on Expert mode see: :ref:`easy_mode_v_expert_mode`
+
+The following BSUB arguments and reservation string instruct ``bsub`` to
+reserve 2 compute nodes within the same rack:
+
+.. code-block:: bash
+
+    #BSUB -csm y
+    #BSUB -n 85
+    #BSUB -R 1*{select[((LN)&&(type==any))]order[r15s:pg]span[hosts=1]cu[type=rack:pref=config]}+84*{select[((CN)&&(type==any))]order[r15s:pg]span[ptile=42]cu[type=rack:maxcus=1]}
+
+``-csm y`` enables 'expert mode'.
+
+``-n 85`` the total number of slots must be requested, as ``-nnodes`` is not
+compatible with expert mode.
+
+We can break the reservation string down to understand each piece.
+
+1. The first term is needed to include a launch node in the reservation.
+
+    .. code-block:: bash
+
+        1*{select[((LN)&&(type==any))]order[r15s:pg]span[hosts=1]cu[type=rack:pref=config]}
+
+2. The second term specifies how many compute slots and how many racks.
+
+    .. code-block:: bash
+
+        +84*{select[((CN)&&(type==any))]order[r15s:pg]span[ptile=42]cu[type=rack:maxcus=1]}
+
+    * Here the ``84`` slots represents 2 compute nodes. Each compute node has 42 compute slots.
+
+    * The ``maxcus=1`` specifies that the nodes can come from at most 1 rack.
+
+Best DDL Arguments
+------------------
+
+Summit is comprised of 256 racks of 18 nodes with 6 GPUs each. For more
+information about the hardware of Summit please see: :ref:`system-overview`.
+
+DDL works best with topological knowledge of the cluster.
+``GPUs per Node X Nodes per Rack X Racks Per Aisle X Aisles`` Some of this
+information can be acquired automatically, but some has to be specified
+by the user.
+
+To get the best performance reservations should be made in multiples of 18,
+and the user should pass topology arguments to ``DDLRUN``.
+
+* ``--nodes 18`` informs DDL that there are 18 nodes per rack. Specifying 18
+  nodes per rack gave the best performance in preliminary testing, but it may
+  be that logically splitting racks in half (``--nodes 9``) or logically
+  grouping racks (``--nodes 36``) could lead to better performance on other
+  workloads.
+
+* ``--racks 4`` informs DDL that there are 4 racks per aisle. Summit is a
+  fat tree, but preliminary testing showed that grouping racks into logical
+  aisles of 4 racks gave the best performance.
+
+* ``--aisles 2`` informs DDL that there are 2 total aisles.
+  ``Nodes X Racks X Aisles`` must equal the total number of nodes in the LSF
+  reservation.
+
+If running on 144 nodes, the following ``ddlrun`` command should
+give good performance.
+
+.. code-block:: bash
+
+    ddlrun --nodes 18 --racks 4 --aisles 2 python script.py
+
+For more information on ``ddlrun``, please see: `DDLRUN <https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_ddlrun.html>`_.
+
+
+Example
+===================
+
+The following graph shows the scaling performance of the
+``tf_cnn_benchmarks`` implementation of the Resnet50 model
+running on Summit during initial benchmark testing.
+
+.. figure:: /images/ibm_wml_ddl_resnet50.png
+   :align: center
+
+   Figure 1. Performance Scaling of IBM DDL on Summit
+
+The following LSF script can be used to reproduce the results for 144 nodes:
+
+.. code-block:: bash
+
+    #BSUB -P <PROJECT>
+    #BSUB -W 1:00
+    #BSUB -csm y
+    #BSUB -n 6049
+    #BSUB -R "1*{select[((LN) && (type == any))] order[r15s:pg] span[hosts=1] cu[type=rack:pref=config]}+6048*{select[((CN) && (type == any))] order[r15s:pg] span[ptile=42] cu[type=rack:maxcus=8]}"
+    #BSUB -q batch
+    #BSUB -J <PROJECT>
+    #BSUB -o /ccs/home/user/job%J.out
+    #BSUB -e /ccs/home/user/job%J.out
+
+    module load ibm-wml-ce
+
+    ddlrun --nodes 18 --racks 4 --aisles 2 python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
+        --variable_update=horovod\
+        --model=resnet50 \
+        --num_gpus=1 \
+        --batch_size=256 \
+        --num_batches=100 \
+        --num_warmup_batches=10 \
+        --data_name=imagenet \
+        --allow_growth=True \
+        --use_fp16
diff --git a/systems/summit_user_guide.rst b/systems/summit_user_guide.rst
@@ -1888,6 +1888,8 @@ functionality.
 | ``#BSUB -alloc_flags gpumps``   | No equivalent (set via environment variable)   | Enable multiple MPI tasks to simultaneously access a GPU   |
 +---------------------------------+------------------------------------------------+------------------------------------------------------------+
 
+.. _easy_mode_v_expert_mode:
+
 Easy Mode vs. Expert Mode
 -------------------------
-Original file line number
+Diff line change
@@ Expand Up / @@ -1888,6 +1888,8 @@ functionality. @@
     | ``#BSUB -alloc_flags gpumps``   | No equivalent (set via environment variable)   | Enable multiple MPI tasks to simultaneously access a GPU   |
     +---------------------------------+------------------------------------------------+------------------------------------------------------------+
+    .. _easy_mode_v_expert_mode:
     Easy Mode vs. Expert Mode
     -------------------------
@@ Expand Down @@