-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revised WMLCE + Open-CE documentation. #102
Conversation
This may also have to include updates to general miniconda installation instructions. The current instructions for installing into sh Miniconda3-latest-Linux-ppc64le.sh -b -p $(pwd)/miniconda The above will install silently into a directory |
Aditionally, the current wmlce instructions add the wmlce channel to the users global conda channel configuration, not per environment. This breaks subsequent use of open-ce. There is likely need to add instructions on how to deal with this (i.e. if an It would also be better to adjust he wmlce instructions to only set the channel within the environment. The same applies to Open-CE (via the conda |
01fae60
to
fe80467
Compare
The WMLCE
The existing WMLCE benchmark on RHEL 7 with ddlrun and 4 gpus in a single node behaves as described, although it suspiciously took 15 seconds less than the 5 hours I requested( compared to 4 hours as previously suggested).
Attempting to run this via
This appears to confirm that ddlrun/bede-ddlrun does not work on the RHEL 8 nodes, though I'll double check this via slack later. Attempting to run the benchmark on RHEL 8 without ddlrun, using a single GPU in a single node resulted in the job being killed for OOM.
I.e. it requires more than 1/4 of the node's memory, but otherwise WMLCE tensorflow does appear to work on RHEL 8, though given its very EOL most users should migrate to Open-CE or upstream tensorflow (and lose LMS) Running the single GPU version on RHEL 7 also died due to OOM, but several hours further into the run. To benchmark this, correctly, requesting a full node but only making one device available by Finding a more recent / more open benchmark to run might be a better plan.
|
+ Adds Open-CE documentation page + Marks as successor to WMLCE + Lists the key features no longer availablle from WMLCE + Describes why to use Open-CE + provides instructions for installing Open-CE packages into conda environments + Updates TensorFlow page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates PyToorch page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates WMLCE page + Refer to Open-CE as successor, emphasising that WMLCE is deprecated / no longer supported + Update/Tweak tensorflow-benchmarks resnet50 usage+description. + Expands Conda documentation + Includes upgrading installation instructions to source the preffered etc/profile.d/conda.sh + https://github.com/conda/conda/blob/master/CHANGELOG.md#recommended-change-to-enable-conda-in-your-shell + conda python version selection should only use a single '=' + Updates usage page emphasising ddlrun is not supported on RHEL 8 This does not include benchmarking of open-CE or RHEL 7/8 comparisons of WMLCE benchmarking due to ddlrun errors on RHEL 8. Closes #63 Closes #72
I don't now remember the context, but I guess if it's documented to use
open-ce rather than wmlce, that's OK. For what it's worth, there's
something about it in the Summit docs (specifically about RHEL8, I
think).
|
Improves the state of WMLCE documentation, adds Open-CE documentation and updates Tensorflow/PyTorch to reflect this change.
bede-ddlrun
on RHEL 8.[Possibly Out of Date]
.WMLCE resnet50 benchmark
sectionBenchmarking, use the resnet benchmark from above on <= 4 nodes?Closes #63
Closes #72