Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

confusion with wmcle #63

Closed
loveshack opened this issue Jul 30, 2021 · 3 comments · Fixed by #102
Closed

confusion with wmcle #63

loveshack opened this issue Jul 30, 2021 · 3 comments · Fixed by #102
Assignees

Comments

@loveshack
Copy link
Collaborator

I had a user who was confused by "Powerai and wmlce" saying "Possibly Out of Date" and going to the IBM site. I guess it should say that's superseded by opence, which dropped the large model support.
There could be a pointer to the LM patches, and the discussions about (not) merging them, in case someone is motivated to update them.

@markdturner
Copy link
Contributor

We're waiting for the upgrade to REHL8 before making these changes

@ptheywood
Copy link
Member

ptheywood commented Jan 31, 2022

I'd started working on this in #67, but to avoid blocking that being merged I'll defer to figuring out the exact state of opence and wmlce until later, for now just moving the existing (potentially out of date) wmlce docs to their new location.


My current understanding, is that WMLCE (or PowerAI, it's other name) 1.7 was the final release, from 2020-02-21. It only officially supports RHEL7.6/7.7 with CUDA driver 440 on Power9 hosts.
I do not know if it works with RHEL8 or not.

It included / supported TensorFlow 2.1, PyToprch 1.3.1, and Horovod 0.19 amongst others (i.e. more recent versions do not support any ibm specific features, unless upstreamed).

TensorFlow LMS could be enabled by tf.config.experimental.set_lms_enabled(True) in that version, but as far as I could tell when i last looked the LMS patches were never upstreamed?

Open-CE (An open cognitive environment) is a non-IBM set of conda packages designed to work together, and be easily distributed by a single conda channel.

https://github.com/open-ce
https://github.com/open-ce/open-ce

It supports multiple CPU architectures, including x86 and Power.
OSU provide a hosted x86/power conda channel, while MIT host a power channel.

https://ftp.osuosl.org/pub/open-ce/current/
https://opence.mit.edu/

Open CE requires conda >= 3.8.6, and supports Python 3.7 to 3.9. CUDA 10.2, 11.0, 11.2 (when I originally looked into this, it may have changed since).

OpenCE releases support specific versions of tensorflow etc.

In general, LMS doesn't look like it is supported outside of wmlce.
ddlrun (and therefore bede-ddlrun) don't appear to be supported either.
It might be nice to run some benchmarks with and without ddlrun prior to the rhel7 migration progressing, to see how much of an impact losing ddlrun might have.


For changes to the docs post #67 , I'd lean towards:

  • open-ce.rst - a new page
    • Describes OpenCE as a stand-alone project, independent of WMLCE, including usage.
  • wmlce.rst updates
    • Update the current WMLCE docs to clearly state that WMLCE is no longer an active project, and that post RHEL 8 migration is might not work at all (it would be good to verify this). Then refer to the open ce docs.
  • tensorflow.rst
    • Add cross-reference(s) to open-ce.rst as one mechanism for installing tensorflow
    • Describe DDLrun/LMS status (i.e. probably no longer usable) once confirmed
  • pytorch.rst
    • Add cross-reference(s) to open-ce.rst as one mechanism for installing tensorflow
    • Describe DDLrun status (i.e. probably no longer usable) once confirmed
  • usage/index.rst
    • Update usage of bede-ddlrun for RHEL 8, once known what the status will be.

Summit's documentation suggests using jsrun in place of the deprecated ddlrun, This is the IBM scheduler command, so this would map srun/sbatch on Bede (with appropriate flags?).

https://docs.olcf.ornl.gov/software/analytics/ibm-wml-ce.html#running-distributed-deep-learning-jobs


For my reference in the future, my WIP comments about this were as follows

.. WMLCE /PowerAI 1.7 is the final release, from 2020-02-21. Archived on 2020-11-10. 
.. https://www.ibm.com/support/pages/get-started-ibm-wml-ce
.. Only supported RHEL 7.6 and 7.7, with driver 440.
.. TF 2.1, PyTorch 1.3.1, Horovod 0.19, TFLMS (via tf.config.experimental.set_lms_enabled(True))


.. Open-CE (Open Cognitive Environment) replaces wmlce. 
.. https://github.com/open-ce
.. https://github.com/open-ce/open-ce
.. Supports Power/x86. Python 3.7 to 3.9. CUDA 10.2, 11.0, 11.2.
.. Requires conda >= 3.8.3
.. Oregon state hosts pre-build for power and x86 https://ftp.osuosl.org/pub/open-ce/current/
.. MIT hosts pre-build OpenCE https://opence.mit.edu/
.. OpenCE 1.2.2 TF 2.4.2, pytorch 1.7.1, horovod 0.21.0, 
.. OpenCE 1.0.0 has TF 2.3.1 , pytorch 1.6.0, horovod 0.19.5

.. Docs plan:
.. Main section will be OpenCE. Blurb stating formerly WMLCE, but no longer supported, and will be no longer available from RHEL 8 upgrade. 
.. List the missing features? 
.. * LMS doesn't appear to have been upstreamed for tf or pytorch.
.. * ddlrun/bede-ddlrun - These are probably not supported either.  
.. Update the tf/torch docs to include this?
.. It may be worth benchmarking resnet50 again with and without ddlrun?

.. Satori docs may provide additional context https://mit-satori.github.io/satori-ai-frameworks.html

ptheywood added a commit that referenced this issue Jan 31, 2022
…ich are much more manageable, with their own easier to find rendered pages.

Closes #61

Whilst splitting this file into many smaller files, a number of additions and changes were made to the documentation, including:

+ Adds (basic) documentation for:
  + IBM XL compilers (Closes #61)
  + Amber (Part of #78)
  + EMAN2
  + GRACE
  + Gromacs (Closes #37, part of #79)
  + NAMD
  + OpenMM
  + PLUMED
  + Singularity (Apptainer) (Closes #49)
  + Generic python information, with more detailed conda usage (Closes #47)
  + nvidia-smi (Closes #75)
  + HECBioSim project
  + IBM Collaboration project
  + Boost Module
  + FFTW module
  + NVTX library
  + PLUMED library
  + VTK
  + CMake
  + Make
+ Creates new `guides` section
  + Migrates the `profiling` documentation into the guides section
  + Migrates the `wanderings` about CUDA into the guides section
+ Adds some notes/warnings about potential WMLCE + RHEL 8 incompatibility. Larger changes still required (#63)
+ CSS/JS/_templates changes for a useful sidebar with the bootstrap theme with split source files
  + New issue #87 opened to consider replacing the theme to an actively maintained theme.
+ Removes relations.html from the sidebar, as styling issues were difficult to resolve nicely (Closes #77)
+ Adds sphinxext-rediraffe plugin for redirects for moved .html files (see conf.py)
+ Assorted RST improvements (links, crossrefs, quoteblocks, code-block, note, etc.)
+ Clarify module loads for RHEL 7 vs RHEL 8 where appropriate (Part of #73).
+ Assorted other improvements throughout the documentation

History was a little messy, so has been squashed to avoid `.git` bloat.
ptheywood added a commit that referenced this issue Jan 31, 2022
…ich are much more manageable, with their own easier to find rendered pages.

Closes #61

Whilst splitting this file into many smaller files, a number of additions and changes were made to the documentation, including:

+ Adds (basic) documentation for:
  + IBM XL compilers (Closes #61)
  + Amber (Part of #78)
  + EMAN2
  + GRACE
  + Gromacs (Closes #37, part of #79)
  + NAMD
  + OpenMM
  + PLUMED
  + Singularity (Apptainer) (Closes #49)
  + Generic python information, with more detailed conda usage (Closes #47)
  + nvidia-smi (Closes #75)
  + HECBioSim project
  + IBM Collaboration project
  + Boost Module
  + FFTW module
  + NVTX library
  + PLUMED library
  + VTK
  + CMake
  + Make
+ Creates new `guides` section
  + Migrates the `profiling` documentation into the guides section
  + Migrates the `wanderings` about CUDA into the guides section
+ Adds some notes/warnings about potential WMLCE + RHEL 8 incompatibility. Larger changes still required (#63)
+ CSS/JS/_templates changes for a useful sidebar with the bootstrap theme with split source files
  + New issue #87 opened to consider replacing the theme to an actively maintained theme.
+ Removes relations.html from the sidebar, as styling issues were difficult to resolve nicely (Closes #77)
+ Adds sphinxext-rediraffe plugin for redirects for moved .html files (see conf.py)
+ Assorted RST improvements (links, crossrefs, quoteblocks, code-block, note, etc.)
+ Clarify module loads for RHEL 7 vs RHEL 8 where appropriate (Part of #73).
+ Assorted other improvements throughout the documentation
+ Adds the sphinx-copybutton plugin, for easy to copy code-block contents.

History was a little messy, so has been squashed to avoid `.git` bloat.
@ptheywood ptheywood self-assigned this Feb 21, 2022
ptheywood added a commit that referenced this issue Mar 7, 2022
+ Adds Open-CE documentation page
  + Marks as successor to WMLCE
  + Lists the key features no longer availablle from WMLCE
  + Describes why to use Open-CE
  + provides instructions for installing Open-CE packages into conda environments
+ Updates TensorFlow page to refer to/use Open-CE not WMLCE
  + Replaces quickstart with installation via conda section
+ Updates PyToorch page to refer to/use Open-CE not WMLCE
  + Replaces quickstart with installation via conda section
+ Updates WMLCE page
  + Refer to Open-CE as successor, emphasising that WMLCE is deprecated / no longer supported
  + Update/Tweak tensorflow-benchmarks resnet50 usage+description.
+ Expands Conda documentation
  + Includes upgrading installation instructions to source the preffered etc/profile.d/conda.sh
    + https://github.com/conda/conda/blob/master/CHANGELOG.md#recommended-change-to-enable-conda-in-your-shell
  + conda python version selection should only use a single '='
+ Updates usage page emphasising ddlrun is not supported on RHEL 8

This does not include benchmarking of open-CE or RHEL 7/8 comparisons of WMLCE benchmarking due to ddlrun errors on RHEL 8.

Closes #63
Closes #72
ptheywood added a commit that referenced this issue Mar 7, 2022
+ Adds Open-CE documentation page
  + Marks as successor to WMLCE
  + Lists the key features no longer availablle from WMLCE
  + Describes why to use Open-CE
  + provides instructions for installing Open-CE packages into conda environments
+ Updates TensorFlow page to refer to/use Open-CE not WMLCE
  + Replaces quickstart with installation via conda section
+ Updates PyToorch page to refer to/use Open-CE not WMLCE
  + Replaces quickstart with installation via conda section
+ Updates WMLCE page
  + Refer to Open-CE as successor, emphasising that WMLCE is deprecated / no longer supported
  + Update/Tweak tensorflow-benchmarks resnet50 usage+description.
+ Expands Conda documentation
  + Includes upgrading installation instructions to source the preffered etc/profile.d/conda.sh
    + https://github.com/conda/conda/blob/master/CHANGELOG.md#recommended-change-to-enable-conda-in-your-shell
  + conda python version selection should only use a single '='
+ Updates usage page emphasising ddlrun is not supported on RHEL 8

This does not include benchmarking of open-CE or RHEL 7/8 comparisons of WMLCE benchmarking due to ddlrun errors on RHEL 8.

Closes #63
Closes #72
@ptheywood
Copy link
Member

PR #102 is now ready for review, which documents Open-CE and adds a number of updates to the WMLCE section to clearly show it is deprecated / not supported and will not (fully) work on RHEL 8.

@loveshack I've requested your review to see if you feel it has clarrified the concerns you raised, but no pressure to provide a review.

@ptheywood ptheywood moved this to New in Documentation Mar 24, 2022
@ptheywood ptheywood moved this from New to In Progress in Documentation Mar 24, 2022
Repository owner moved this from In Progress to Done in Documentation Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants