Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised WMLCE + Open-CE documentation. #102

Merged
merged 3 commits into from
Apr 19, 2022
Merged

Revised WMLCE + Open-CE documentation. #102

merged 3 commits into from
Apr 19, 2022

Conversation

ptheywood
Copy link
Member

@ptheywood ptheywood commented Feb 17, 2022

Improves the state of WMLCE documentation, adds Open-CE documentation and updates Tensorflow/PyTorch to reflect this change.

  • Updates WMLCE page
    • Document current project status (Deprecated / unsupported)
    • Clearly state that users should probably switch to Open-CE or upstream distributions
    • Verify if it is still usable on RHEL 8 or not, update accordingly
      • Errors occured when attempting to bede-ddlrun on RHEL 8.
    • Update usage instructions to no longer be [Possibly Out of Date].
    • Update WMLCE resnet50 benchmark section
      • This is only available via WMLCE with a licence which may prevent distribution outside of WMLCE so not re-running to generate results.
      • DDLrun errors on RHEL 8 as expected, so not useful to compare wmlce on 7 vs 8.
  • Adds a page documenting Open-CE, the successor to WMLCE.
    • Basic Description
    • Usage with example
      • Verify instructions
    • Why should someone use Open-CE rather than upstream TF/pytorch?
    • Clear description of missing WMLCE features (LMS, DDLRUN?, Others?)
    • Benchmarking, use the resnet benchmark from above on <= 4 nodes?
      • Not benchmarking as WMLCE tensorflow-benchmarks is not openly licenced by ibm, and ddlrun doesn't work.
    • Cross reference TF, Pytorch, WMLCE.
  • Add cross references to Open-CE from Tensorflow, Conda and Pytorch pages

Closes #63
Closes #72

@ptheywood
Copy link
Member Author

This may also have to include updates to general miniconda installation instructions.

The current instructions for installing into /nobackup/projects/<project> will actually install int /users/.

sh Miniconda3-latest-Linux-ppc64le.sh -b -p $(pwd)/miniconda

The above will install silently into a directory miniconda within the current directory and not update the users .bashrc which may or may not be desirable.

@ptheywood
Copy link
Member Author

Aditionally, the current wmlce instructions add the wmlce channel to the users global conda channel configuration, not per environment.

This breaks subsequent use of open-ce. There is likely need to add instructions on how to deal with this (i.e. if an unsatisfiableError is raised).

It would also be better to adjust he wmlce instructions to only set the channel within the environment. The same applies to Open-CE (via the conda --env flag on conda config --set`?)

@ptheywood
Copy link
Member Author

ptheywood commented Mar 2, 2022

The WMLCE tensorflow-benchmarks/resnet50 benchmark script is not included in Open-CE. It is Apahce 2 licenced by Nvidia, but with IBM modifications with the follwoing restrictive / unclear licence, so I am not going to make this available for use outside of wmlce, so can't use this for a comparitive benchmark.

Licensed Materials - Property of IBM
(C) Copyright IBM Corp. 2020. All Rights Reserved.
US Government Users Restricted Rights - Use, duplication or
disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

The existing WMLCE benchmark on RHEL 7 with ddlrun and 4 gpus in a single node behaves as described, although it suspiciously took 15 seconds less than the 5 hours I requested( compared to 4 hours as previously suggested).

:::NVLOGv0.2.3 resnet 1646176253.965135813 (training_hooks.py:101) imgs_per_sec: 3737.334556206633
:::NVLOGv0.2.3 resnet 1646176253.967950821 (training_hooks.py:102) cross_entropy: 1.9711014032363892
:::NVLOGv0.2.3 resnet 1646176253.970749617 (training_hooks.py:103) l2_loss: 0.3724885880947113
:::NVLOGv0.2.3 resnet 1646176253.973543644 (training_hooks.py:104) total_loss: 2.343590021133423
:::NVLOGv0.2.3 resnet 1646176253.976323843 (training_hooks.py:105) learning_rate: 6.103515914901436e-08
:::NVLOGv0.2.3 resnet 1646176256.492153883 (training_hooks.py:112) epoch: 49
:::NVLOGv0.2.3 resnet 1646176256.495479345 (training_hooks.py:113) final_cross_entropy: 1.8775297403335571
:::NVLOGv0.2.3 resnet 1646176256.498734951 (training_hooks.py:114) final_l2_loss: 0.3724885582923889
:::NVLOGv0.2.3 resnet 1646176256.502001047 (training_hooks.py:115) final_total_loss: 2.250018358230591
:::NVLOGv0.2.3 resnet 1646176256.505250216 (training_hooks.py:116) final_learning_rate: 0.0
:::NVLOGv0.2.3 resnet 1646176265.872462511 (runner.py:488) Ending Model Training ...
:::NVLOGv0.2.3 resnet 1646176265.874462605 (runner.py:221) XLA is activated - Experimental Feature
:::NVLOGv0.2.3 resnet 1646176266.303135633 (runner.py:555) Starting Model Evaluation...
:::NVLOGv0.2.3 resnet 1646176266.304392338 (runner.py:556) Evaluation Epochs: 1.0
:::NVLOGv0.2.3 resnet 1646176266.305591822 (runner.py:557) Evaluation Steps: 195.0
:::NVLOGv0.2.3 resnet 1646176266.306783438 (runner.py:558) Decay Steps: 195.0
:::NVLOGv0.2.3 resnet 1646176266.307974577 (runner.py:559) Global Batch Size: 256
2022-03-01 23:11:09.188563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties: 
pciBusID: 0004:04:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.50GiB deviceMemoryBandwidth: 836.37GiB/s
2022-03-01 23:11:09.252778: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2022-03-01 23:11:09.357202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-03-01 23:11:09.391700: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2022-03-01 23:11:09.391726: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2022-03-01 23:11:09.391747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2022-03-01 23:11:09.391764: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2022-03-01 23:11:09.410918: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-03-01 23:11:09.413685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2022-03-01 23:11:09.413734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-01 23:11:09.413743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]      0 
2022-03-01 23:11:09.413750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0:   N 
2022-03-01 23:11:09.419813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30294 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2022-03-01 23:11:11.753567: I tensorflow/core/grappler/optimizers/generic_layout_optimizer.cc:345] Cancel Transpose nodes around Pad: transpose_before=resnet50_v1.5/input_reshape/transpose pad=resnet50_v1.5/conv2d/Pad transpose_after=resnet50_v1.5/conv2d/conv2d/Conv2D-0-TransposeNCHWToNHWC-LayoutOptimizer
:::NVLOGv0.2.3 resnet 1646176309.121007442 (runner.py:610) Top-1 Accuracy: 75.797
:::NVLOGv0.2.3 resnet 1646176309.122350454 (runner.py:611) Top-5 Accuracy: 92.817
:::NVLOGv0.2.3 resnet 1646176309.123546600 (runner.py:630) Ending Model Evaluation ...

---------------
Job output ends
=========================================================
SLURM job: finished date = Tue 1 Mar 23:11:50 GMT 2022
Total run time : 4 Hours 59 Minutes 45 Seconds
=========================================================
(END)

Attempting to run this via bede-ddlrun on the RHEL 8 image errored with the following:

No active IB device ports detected
[gpu013.bede.dur.ac.uk:98383] Error: common_pami.c:1087 - ompi_common_pami_init() 0: Unable to create 1 PAMI communication context(s) rc=1
No active IB device ports detected
[gpu013.bede.dur.ac.uk:98384] Error: common_pami.c:1087 - ompi_common_pami_init() 1: Unable to create 1 PAMI communication context(s) rc=1
No active IB device ports detected
[gpu013.bede.dur.ac.uk:98386] Error: common_pami.c:1087 - ompi_common_pami_init() 3: Unable to create 1 PAMI communication context(s) rc=1
No active IB device ports detected
[gpu013.bede.dur.ac.uk:98385] Error: common_pami.c:1087 - ompi_common_pami_init() 2: Unable to create 1 PAMI communication context(s) rc=1
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      gpu013
  Framework: pml
--------------------------------------------------------------------------
[gpu013.bede.dur.ac.uk:98383] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98357] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[gpu013.bede.dur.ac.uk:98357] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[gpu013.bede.dur.ac.uk:98357] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[gpu013.bede.dur.ac.uk:98386] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98384] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98385] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98357] 3 more processes have sent help message help-mca-base.txt / find-available:none found
[gpu013.bede.dur.ac.uk:98357] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

---------------
Job output ends
=========================================================
SLURM job: finished date = Tue 1 Mar 18:22:12 GMT 2022
Total run time : 0 Hours 1 Minutes 29 Seconds
=========================================================

This appears to confirm that ddlrun/bede-ddlrun does not work on the RHEL 8 nodes, though I'll double check this via slack later.

Attempting to run the benchmark on RHEL 8 without ddlrun, using a single GPU in a single node resulted in the job being killed for OOM.

---------------
Job output ends
=========================================================
SLURM job: finished date = Tue 1 Mar 22:13:38 GMT 2022
Total run time : 0 Hours 52 Minutes 24 Seconds
=========================================================
slurmstepd: error: Detected 1 oom-kill event(s) in step 356980.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I.e. it requires more than 1/4 of the node's memory, but otherwise WMLCE tensorflow does appear to work on RHEL 8, though given its very EOL most users should migrate to Open-CE or upstream tensorflow (and lose LMS)

Running the single GPU version on RHEL 7 also died due to OOM, but several hours further into the run.

To benchmark this, correctly, requesting a full node but only making one device available by CUDA_VISIBLE_DEVICES, or an inner srun might be required, but if this is not going to be directly comparable to a WMLCE or RHEL 8 benchmark of the same model it's probably not worthwhile reproducing / benchmarking.

Finding a more recent / more open benchmark to run might be a better plan.

=========================================================
SLURM job: finished date = Wed 2 Mar 00:49:40 GMT 2022
Total run time : 3 Hours 15 Minutes 36 Seconds
=========================================================
slurmstepd: error: Detected 1 oom-kill event(s) in step 356990.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

@ptheywood ptheywood marked this pull request as ready for review March 7, 2022 20:40
+ Adds Open-CE documentation page
  + Marks as successor to WMLCE
  + Lists the key features no longer availablle from WMLCE
  + Describes why to use Open-CE
  + provides instructions for installing Open-CE packages into conda environments
+ Updates TensorFlow page to refer to/use Open-CE not WMLCE
  + Replaces quickstart with installation via conda section
+ Updates PyToorch page to refer to/use Open-CE not WMLCE
  + Replaces quickstart with installation via conda section
+ Updates WMLCE page
  + Refer to Open-CE as successor, emphasising that WMLCE is deprecated / no longer supported
  + Update/Tweak tensorflow-benchmarks resnet50 usage+description.
+ Expands Conda documentation
  + Includes upgrading installation instructions to source the preffered etc/profile.d/conda.sh
    + https://github.com/conda/conda/blob/master/CHANGELOG.md#recommended-change-to-enable-conda-in-your-shell
  + conda python version selection should only use a single '='
+ Updates usage page emphasising ddlrun is not supported on RHEL 8

This does not include benchmarking of open-CE or RHEL 7/8 comparisons of WMLCE benchmarking due to ddlrun errors on RHEL 8.

Closes #63
Closes #72
@ptheywood ptheywood requested review from loveshack and a team March 7, 2022 20:42
@ptheywood ptheywood mentioned this pull request Mar 7, 2022
@loveshack
Copy link
Collaborator

loveshack commented Mar 8, 2022 via email

@jsteyn jsteyn merged commit 0782bad into main Apr 19, 2022
@ptheywood ptheywood deleted the wmlce-opence branch April 20, 2022 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Resnet50 benchmark scripts no longer usable confusion with wmcle
3 participants