Revised WMLCE + Open-CE documentation. #102

ptheywood · 2022-02-17T17:08:11Z

Improves the state of WMLCE documentation, adds Open-CE documentation and updates Tensorflow/PyTorch to reflect this change.

Closes #63
Closes #72

ptheywood · 2022-02-28T16:48:32Z

This may also have to include updates to general miniconda installation instructions.

The current instructions for installing into /nobackup/projects/<project> will actually install int /users/.

sh Miniconda3-latest-Linux-ppc64le.sh -b -p $(pwd)/miniconda

The above will install silently into a directory miniconda within the current directory and not update the users .bashrc which may or may not be desirable.

ptheywood · 2022-02-28T17:07:05Z

Aditionally, the current wmlce instructions add the wmlce channel to the users global conda channel configuration, not per environment.

This breaks subsequent use of open-ce. There is likely need to add instructions on how to deal with this (i.e. if an unsatisfiableError is raised).

It would also be better to adjust he wmlce instructions to only set the channel within the environment. The same applies to Open-CE (via the conda --env flag on conda config --set`?)

ptheywood · 2022-03-02T12:49:15Z

The WMLCE tensorflow-benchmarks/resnet50 benchmark script is not included in Open-CE. It is Apahce 2 licenced by Nvidia, but with IBM modifications with the follwoing restrictive / unclear licence, so I am not going to make this available for use outside of wmlce, so can't use this for a comparitive benchmark.

Licensed Materials - Property of IBM
(C) Copyright IBM Corp. 2020. All Rights Reserved.
US Government Users Restricted Rights - Use, duplication or
disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

The existing WMLCE benchmark on RHEL 7 with ddlrun and 4 gpus in a single node behaves as described, although it suspiciously took 15 seconds less than the 5 hours I requested( compared to 4 hours as previously suggested).

:::NVLOGv0.2.3 resnet 1646176253.965135813 (training_hooks.py:101) imgs_per_sec: 3737.334556206633
:::NVLOGv0.2.3 resnet 1646176253.967950821 (training_hooks.py:102) cross_entropy: 1.9711014032363892
:::NVLOGv0.2.3 resnet 1646176253.970749617 (training_hooks.py:103) l2_loss: 0.3724885880947113
:::NVLOGv0.2.3 resnet 1646176253.973543644 (training_hooks.py:104) total_loss: 2.343590021133423
:::NVLOGv0.2.3 resnet 1646176253.976323843 (training_hooks.py:105) learning_rate: 6.103515914901436e-08
:::NVLOGv0.2.3 resnet 1646176256.492153883 (training_hooks.py:112) epoch: 49
:::NVLOGv0.2.3 resnet 1646176256.495479345 (training_hooks.py:113) final_cross_entropy: 1.8775297403335571
:::NVLOGv0.2.3 resnet 1646176256.498734951 (training_hooks.py:114) final_l2_loss: 0.3724885582923889
:::NVLOGv0.2.3 resnet 1646176256.502001047 (training_hooks.py:115) final_total_loss: 2.250018358230591
:::NVLOGv0.2.3 resnet 1646176256.505250216 (training_hooks.py:116) final_learning_rate: 0.0
:::NVLOGv0.2.3 resnet 1646176265.872462511 (runner.py:488) Ending Model Training ...
:::NVLOGv0.2.3 resnet 1646176265.874462605 (runner.py:221) XLA is activated - Experimental Feature
:::NVLOGv0.2.3 resnet 1646176266.303135633 (runner.py:555) Starting Model Evaluation...
:::NVLOGv0.2.3 resnet 1646176266.304392338 (runner.py:556) Evaluation Epochs: 1.0
:::NVLOGv0.2.3 resnet 1646176266.305591822 (runner.py:557) Evaluation Steps: 195.0
:::NVLOGv0.2.3 resnet 1646176266.306783438 (runner.py:558) Decay Steps: 195.0
:::NVLOGv0.2.3 resnet 1646176266.307974577 (runner.py:559) Global Batch Size: 256
2022-03-01 23:11:09.188563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties: 
pciBusID: 0004:04:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.50GiB deviceMemoryBandwidth: 836.37GiB/s
2022-03-01 23:11:09.252778: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2022-03-01 23:11:09.357202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-03-01 23:11:09.391700: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2022-03-01 23:11:09.391726: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2022-03-01 23:11:09.391747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2022-03-01 23:11:09.391764: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2022-03-01 23:11:09.410918: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-03-01 23:11:09.413685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2022-03-01 23:11:09.413734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-01 23:11:09.413743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]      0 
2022-03-01 23:11:09.413750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0:   N 
2022-03-01 23:11:09.419813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30294 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2022-03-01 23:11:11.753567: I tensorflow/core/grappler/optimizers/generic_layout_optimizer.cc:345] Cancel Transpose nodes around Pad: transpose_before=resnet50_v1.5/input_reshape/transpose pad=resnet50_v1.5/conv2d/Pad transpose_after=resnet50_v1.5/conv2d/conv2d/Conv2D-0-TransposeNCHWToNHWC-LayoutOptimizer
:::NVLOGv0.2.3 resnet 1646176309.121007442 (runner.py:610) Top-1 Accuracy: 75.797
:::NVLOGv0.2.3 resnet 1646176309.122350454 (runner.py:611) Top-5 Accuracy: 92.817
:::NVLOGv0.2.3 resnet 1646176309.123546600 (runner.py:630) Ending Model Evaluation ...

---------------
Job output ends
=========================================================
SLURM job: finished date = Tue 1 Mar 23:11:50 GMT 2022
Total run time : 4 Hours 59 Minutes 45 Seconds
=========================================================
(END)

Attempting to run this via bede-ddlrun on the RHEL 8 image errored with the following:

No active IB device ports detected
[gpu013.bede.dur.ac.uk:98383] Error: common_pami.c:1087 - ompi_common_pami_init() 0: Unable to create 1 PAMI communication context(s) rc=1
No active IB device ports detected
[gpu013.bede.dur.ac.uk:98384] Error: common_pami.c:1087 - ompi_common_pami_init() 1: Unable to create 1 PAMI communication context(s) rc=1
No active IB device ports detected
[gpu013.bede.dur.ac.uk:98386] Error: common_pami.c:1087 - ompi_common_pami_init() 3: Unable to create 1 PAMI communication context(s) rc=1
No active IB device ports detected
[gpu013.bede.dur.ac.uk:98385] Error: common_pami.c:1087 - ompi_common_pami_init() 2: Unable to create 1 PAMI communication context(s) rc=1
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      gpu013
  Framework: pml
--------------------------------------------------------------------------
[gpu013.bede.dur.ac.uk:98383] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98357] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[gpu013.bede.dur.ac.uk:98357] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[gpu013.bede.dur.ac.uk:98357] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[gpu013.bede.dur.ac.uk:98386] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98384] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98385] PML pami cannot be selected
[gpu013.bede.dur.ac.uk:98357] 3 more processes have sent help message help-mca-base.txt / find-available:none found
[gpu013.bede.dur.ac.uk:98357] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

---------------
Job output ends
=========================================================
SLURM job: finished date = Tue 1 Mar 18:22:12 GMT 2022
Total run time : 0 Hours 1 Minutes 29 Seconds
=========================================================

This appears to confirm that ddlrun/bede-ddlrun does not work on the RHEL 8 nodes, though I'll double check this via slack later.

Attempting to run the benchmark on RHEL 8 without ddlrun, using a single GPU in a single node resulted in the job being killed for OOM.

---------------
Job output ends
=========================================================
SLURM job: finished date = Tue 1 Mar 22:13:38 GMT 2022
Total run time : 0 Hours 52 Minutes 24 Seconds
=========================================================
slurmstepd: error: Detected 1 oom-kill event(s) in step 356980.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I.e. it requires more than 1/4 of the node's memory, but otherwise WMLCE tensorflow does appear to work on RHEL 8, though given its very EOL most users should migrate to Open-CE or upstream tensorflow (and lose LMS)

Running the single GPU version on RHEL 7 also died due to OOM, but several hours further into the run.

To benchmark this, correctly, requesting a full node but only making one device available by CUDA_VISIBLE_DEVICES, or an inner srun might be required, but if this is not going to be directly comparable to a WMLCE or RHEL 8 benchmark of the same model it's probably not worthwhile reproducing / benchmarking.

Finding a more recent / more open benchmark to run might be a better plan.

=========================================================
SLURM job: finished date = Wed 2 Mar 00:49:40 GMT 2022
Total run time : 3 Hours 15 Minutes 36 Seconds
=========================================================
slurmstepd: error: Detected 1 oom-kill event(s) in step 356990.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

+ Adds Open-CE documentation page + Marks as successor to WMLCE + Lists the key features no longer availablle from WMLCE + Describes why to use Open-CE + provides instructions for installing Open-CE packages into conda environments + Updates TensorFlow page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates PyToorch page to refer to/use Open-CE not WMLCE + Replaces quickstart with installation via conda section + Updates WMLCE page + Refer to Open-CE as successor, emphasising that WMLCE is deprecated / no longer supported + Update/Tweak tensorflow-benchmarks resnet50 usage+description. + Expands Conda documentation + Includes upgrading installation instructions to source the preffered etc/profile.d/conda.sh + https://github.com/conda/conda/blob/master/CHANGELOG.md#recommended-change-to-enable-conda-in-your-shell + conda python version selection should only use a single '=' + Updates usage page emphasising ddlrun is not supported on RHEL 8 This does not include benchmarking of open-CE or RHEL 7/8 comparisons of WMLCE benchmarking due to ddlrun errors on RHEL 8. Closes #63 Closes #72

loveshack · 2022-03-08T12:12:46Z

I don't now remember the context, but I guess if it's documented to use open-ce rather than wmlce, that's OK. For what it's worth, there's something about it in the Summit docs (specifically about RHEL8, I think).

ptheywood force-pushed the wmlce-opence branch from 01fae60 to fe80467 Compare February 28, 2022 18:57

ptheywood force-pushed the wmlce-opence branch from 08fced3 to b0ccfb0 Compare March 7, 2022 20:36

ptheywood marked this pull request as ready for review March 7, 2022 20:40

ptheywood force-pushed the wmlce-opence branch from b0ccfb0 to bc10fae Compare March 7, 2022 20:41

ptheywood requested review from loveshack and a team March 7, 2022 20:42

ptheywood mentioned this pull request Mar 7, 2022

confusion with wmcle #63

Closed

Merge branch 'main' into wmlce-opence

2812d04

jsteyn approved these changes Apr 19, 2022

View reviewed changes

Merge branch 'main' into wmlce-opence

adb306f

jsteyn merged commit 0782bad into main Apr 19, 2022

ptheywood deleted the wmlce-opence branch April 20, 2022 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revised WMLCE + Open-CE documentation. #102

Revised WMLCE + Open-CE documentation. #102

ptheywood commented Feb 17, 2022 •

edited

Loading

ptheywood commented Feb 28, 2022

ptheywood commented Feb 28, 2022

ptheywood commented Mar 2, 2022 •

edited

Loading

loveshack commented Mar 8, 2022 via email

Revised WMLCE + Open-CE documentation. #102

Revised WMLCE + Open-CE documentation. #102

Conversation

ptheywood commented Feb 17, 2022 • edited Loading

ptheywood commented Feb 28, 2022

ptheywood commented Feb 28, 2022

ptheywood commented Mar 2, 2022 • edited Loading

loveshack commented Mar 8, 2022 via email

ptheywood commented Feb 17, 2022 •

edited

Loading

ptheywood commented Mar 2, 2022 •

edited

Loading