-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
confusion with wmcle #63
Comments
We're waiting for the upgrade to REHL8 before making these changes |
I'd started working on this in #67, but to avoid blocking that being merged I'll defer to figuring out the exact state of opence and wmlce until later, for now just moving the existing (potentially out of date) wmlce docs to their new location. My current understanding, is that WMLCE (or PowerAI, it's other name) 1.7 was the final release, from 2020-02-21. It only officially supports RHEL7.6/7.7 with CUDA driver 440 on Power9 hosts. It included / supported TensorFlow 2.1, PyToprch 1.3.1, and Horovod 0.19 amongst others (i.e. more recent versions do not support any ibm specific features, unless upstreamed). TensorFlow LMS could be enabled by Open-CE (An open cognitive environment) is a non-IBM set of conda packages designed to work together, and be easily distributed by a single conda channel. https://github.com/open-ce It supports multiple CPU architectures, including x86 and Power. https://ftp.osuosl.org/pub/open-ce/current/ Open CE requires OpenCE releases support specific versions of tensorflow etc. In general, LMS doesn't look like it is supported outside of wmlce. For changes to the docs post #67 , I'd lean towards:
Summit's documentation suggests using https://docs.olcf.ornl.gov/software/analytics/ibm-wml-ce.html#running-distributed-deep-learning-jobs For my reference in the future, my WIP comments about this were as follows .. WMLCE /PowerAI 1.7 is the final release, from 2020-02-21. Archived on 2020-11-10.
.. https://www.ibm.com/support/pages/get-started-ibm-wml-ce
.. Only supported RHEL 7.6 and 7.7, with driver 440.
.. TF 2.1, PyTorch 1.3.1, Horovod 0.19, TFLMS (via tf.config.experimental.set_lms_enabled(True))
.. Open-CE (Open Cognitive Environment) replaces wmlce.
.. https://github.com/open-ce
.. https://github.com/open-ce/open-ce
.. Supports Power/x86. Python 3.7 to 3.9. CUDA 10.2, 11.0, 11.2.
.. Requires conda >= 3.8.3
.. Oregon state hosts pre-build for power and x86 https://ftp.osuosl.org/pub/open-ce/current/
.. MIT hosts pre-build OpenCE https://opence.mit.edu/
.. OpenCE 1.2.2 TF 2.4.2, pytorch 1.7.1, horovod 0.21.0,
.. OpenCE 1.0.0 has TF 2.3.1 , pytorch 1.6.0, horovod 0.19.5
.. Docs plan:
.. Main section will be OpenCE. Blurb stating formerly WMLCE, but no longer supported, and will be no longer available from RHEL 8 upgrade.
.. List the missing features?
.. * LMS doesn't appear to have been upstreamed for tf or pytorch.
.. * ddlrun/bede-ddlrun - These are probably not supported either.
.. Update the tf/torch docs to include this?
.. It may be worth benchmarking resnet50 again with and without ddlrun?
.. Satori docs may provide additional context https://mit-satori.github.io/satori-ai-frameworks.html |
…ich are much more manageable, with their own easier to find rendered pages. Closes #61 Whilst splitting this file into many smaller files, a number of additions and changes were made to the documentation, including: + Adds (basic) documentation for: + IBM XL compilers (Closes #61) + Amber (Part of #78) + EMAN2 + GRACE + Gromacs (Closes #37, part of #79) + NAMD + OpenMM + PLUMED + Singularity (Apptainer) (Closes #49) + Generic python information, with more detailed conda usage (Closes #47) + nvidia-smi (Closes #75) + HECBioSim project + IBM Collaboration project + Boost Module + FFTW module + NVTX library + PLUMED library + VTK + CMake + Make + Creates new `guides` section + Migrates the `profiling` documentation into the guides section + Migrates the `wanderings` about CUDA into the guides section + Adds some notes/warnings about potential WMLCE + RHEL 8 incompatibility. Larger changes still required (#63) + CSS/JS/_templates changes for a useful sidebar with the bootstrap theme with split source files + New issue #87 opened to consider replacing the theme to an actively maintained theme. + Removes relations.html from the sidebar, as styling issues were difficult to resolve nicely (Closes #77) + Adds sphinxext-rediraffe plugin for redirects for moved .html files (see conf.py) + Assorted RST improvements (links, crossrefs, quoteblocks, code-block, note, etc.) + Clarify module loads for RHEL 7 vs RHEL 8 where appropriate (Part of #73). + Assorted other improvements throughout the documentation History was a little messy, so has been squashed to avoid `.git` bloat.
…ich are much more manageable, with their own easier to find rendered pages. Closes #61 Whilst splitting this file into many smaller files, a number of additions and changes were made to the documentation, including: + Adds (basic) documentation for: + IBM XL compilers (Closes #61) + Amber (Part of #78) + EMAN2 + GRACE + Gromacs (Closes #37, part of #79) + NAMD + OpenMM + PLUMED + Singularity (Apptainer) (Closes #49) + Generic python information, with more detailed conda usage (Closes #47) + nvidia-smi (Closes #75) + HECBioSim project + IBM Collaboration project + Boost Module + FFTW module + NVTX library + PLUMED library + VTK + CMake + Make + Creates new `guides` section + Migrates the `profiling` documentation into the guides section + Migrates the `wanderings` about CUDA into the guides section + Adds some notes/warnings about potential WMLCE + RHEL 8 incompatibility. Larger changes still required (#63) + CSS/JS/_templates changes for a useful sidebar with the bootstrap theme with split source files + New issue #87 opened to consider replacing the theme to an actively maintained theme. + Removes relations.html from the sidebar, as styling issues were difficult to resolve nicely (Closes #77) + Adds sphinxext-rediraffe plugin for redirects for moved .html files (see conf.py) + Assorted RST improvements (links, crossrefs, quoteblocks, code-block, note, etc.) + Clarify module loads for RHEL 7 vs RHEL 8 where appropriate (Part of #73). + Assorted other improvements throughout the documentation + Adds the sphinx-copybutton plugin, for easy to copy code-block contents. History was a little messy, so has been squashed to avoid `.git` bloat.
+ Adds Open-CE documentation page + Marks as successor to WMLCE + Lists the key features no longer availablle from WMLCE + Describes why to use Open-CE + provides instructions for installing Open-CE packages into conda environments + Updates TensorFlow page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates PyToorch page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates WMLCE page + Refer to Open-CE as successor, emphasising that WMLCE is deprecated / no longer supported + Update/Tweak tensorflow-benchmarks resnet50 usage+description. + Expands Conda documentation + Includes upgrading installation instructions to source the preffered etc/profile.d/conda.sh + https://github.com/conda/conda/blob/master/CHANGELOG.md#recommended-change-to-enable-conda-in-your-shell + conda python version selection should only use a single '=' + Updates usage page emphasising ddlrun is not supported on RHEL 8 This does not include benchmarking of open-CE or RHEL 7/8 comparisons of WMLCE benchmarking due to ddlrun errors on RHEL 8. Closes #63 Closes #72
+ Adds Open-CE documentation page + Marks as successor to WMLCE + Lists the key features no longer availablle from WMLCE + Describes why to use Open-CE + provides instructions for installing Open-CE packages into conda environments + Updates TensorFlow page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates PyToorch page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates WMLCE page + Refer to Open-CE as successor, emphasising that WMLCE is deprecated / no longer supported + Update/Tweak tensorflow-benchmarks resnet50 usage+description. + Expands Conda documentation + Includes upgrading installation instructions to source the preffered etc/profile.d/conda.sh + https://github.com/conda/conda/blob/master/CHANGELOG.md#recommended-change-to-enable-conda-in-your-shell + conda python version selection should only use a single '=' + Updates usage page emphasising ddlrun is not supported on RHEL 8 This does not include benchmarking of open-CE or RHEL 7/8 comparisons of WMLCE benchmarking due to ddlrun errors on RHEL 8. Closes #63 Closes #72
PR #102 is now ready for review, which documents Open-CE and adds a number of updates to the WMLCE section to clearly show it is deprecated / not supported and will not (fully) work on RHEL 8. @loveshack I've requested your review to see if you feel it has clarrified the concerns you raised, but no pressure to provide a review. |
I had a user who was confused by "Powerai and wmlce" saying "Possibly Out of Date" and going to the IBM site. I guess it should say that's superseded by opence, which dropped the large model support.
There could be a pointer to the LM patches, and the discussions about (not) merging them, in case someone is motivated to update them.
The text was updated successfully, but these errors were encountered: