Update ml-slurm examples to use recent copies of pytorch and tensorflow #3226

tpdownes · 2024-11-06T18:48:07Z

Adopt recent versions of pytorch and tensorflow from pip which have improved predictability of CUDA adoption.

Example outputs

Running example on login node (CPU)

ext_tpdownes_google_com@mlexamplev-slurm-login-001:~$ conda activate pytorch
(pytorch) ext_tpdownes_google_com@mlexamplev-slurm-login-001:~$ python3 torch_test.py
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Using device: cpu
<torch.utils.benchmark.utils.common.Measurement object at 0x7f6df96daa90>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
  350.64 us
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f6dfadd88d0>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
  659.48 us

Running example on g2 node

Using device: cuda
NVIDIA L4
<torch.utils.benchmark.utils.common.Measurement object at 0x7fe93b7f5310>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
  420.50 us
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7fe93bdc2990>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
  863.33 us
  1 measurement, 100 runs , 1 thread

Running example on a2 node

Using device: cuda
NVIDIA A100-SXM4-40GB
<torch.utils.benchmark.utils.common.Measurement object at 0x7f18773cbd50>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
  419.16 us
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f1877b00550>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
  866.42 us
  1 measurement, 100 runs , 1 thread

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

Adopt recent versions of pytorch and tensorflow from pip which have improved predictability of CUDA adoption.

harshthakkar01 · 2024-11-06T19:29:45Z

g2g after test passes.

tpdownes added 3 commits November 6, 2024 18:43

Modernize ml-slurm v5 legacy example

4796b43

Adopt recent versions of pytorch and tensorflow from pip which have improved predictability of CUDA adoption.

Modernize ml-slurm example

4da2d62

Adopt recent versions of pytorch and tensorflow from pip which have improved predictability of CUDA adoption.

Update instructions for ml-slurm examples to explicitly request GPUs

7e9ced2

tpdownes added the release-version-updates Added to release notes under the "Version Updates" heading. label Nov 6, 2024

tpdownes requested a review from harshthakkar01 November 6, 2024 18:48

tpdownes assigned harshthakkar01 Nov 6, 2024

harshthakkar01 approved these changes Nov 6, 2024

View reviewed changes

tpdownes merged commit c06fa10 into GoogleCloudPlatform:develop Nov 6, 2024
11 of 57 checks passed

tpdownes deleted the fix_ml_slurm branch November 6, 2024 19:44

rohitramu mentioned this pull request Nov 20, 2024

Release 1.42.0 #3291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ml-slurm examples to use recent copies of pytorch and tensorflow #3226

Update ml-slurm examples to use recent copies of pytorch and tensorflow #3226

tpdownes commented Nov 6, 2024

harshthakkar01 commented Nov 6, 2024

Update ml-slurm examples to use recent copies of pytorch and tensorflow #3226

Update ml-slurm examples to use recent copies of pytorch and tensorflow #3226

Conversation

tpdownes commented Nov 6, 2024

Running example on login node (CPU)

Running example on g2 node

Running example on a2 node

Submission Checklist

harshthakkar01 commented Nov 6, 2024