Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

cw-tan · 2024-09-17T18:08:54Z

What does this PR do?

Single-line enhancement proposed in #2745, that is, to enable the propagation of the autograd graph after the all_gather operation. This is useful for syncing loss functions in a ddp setting.

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

📚 Documentation preview 📚: https://torchmetrics--2754.org.readthedocs.build/en/2754/

Borda · 2024-09-17T18:23:11Z

That sounds good to me, but can we add a test for this enhancement?

cw-tan · 2024-09-17T18:42:25Z

That sounds good to me, but can we add a test for this enhancement?

Thanks for the prompt response @Borda.

I'm thinking that _test_ddp_gather_uneven_tensors (here) and _test_ddp_gather_uneven_tensors_multidim (here) in tests/unittests/bases/test_ddp.py already cover the correctness of gather_all_tensors. I'm not sure what other ddp tests there are, but those tests should help tell us if the change I made isn't breaking existing functionality. Let me know if you had something else in mind for this.

I can make an additional unittest in tests/unittests/bases/test_ddp.py to give a tensor that requires_grad to gather_all_tensors, compute some scalar from them (proxy for a loss), and compute grads two ways (one going through the all_gather, one that doesn't) and compare. So this tests that the change achieves the desired effect. How does that sound?

codecov · 2024-09-17T19:01:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68%. Comparing base (d87aff7) to head (2033395).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #2754    +/-   ##
=======================================
- Coverage      68%     68%    -1%     
=======================================
  Files         324     324            
  Lines       18164   18168     +4     
=======================================
- Hits        12417   12314   -103     
- Misses       5747    5854   +107

Borda · 2024-09-17T19:04:33Z

I can make an additional unittest in tests/unittests/bases/test_ddp.py to give a tensor that requires_grad to gather_all_tensors, compute some scalar from them (proxy for a loss), and compute grads two ways (one going through the all_gather, one that doesn't) and compare. So this tests that the change achieves the desired effect. How does that sound?

yeah, that sounds good to me :)

cw-tan · 2024-09-18T04:27:50Z

Update: to accommodate both cases where tensors from different ranks have the same/different shape, the line to put the original tensor (holding the AD graph) back into the gathered list was added in two places in the code.

Because of the two cases, I wrote two unittests to account for each. Interestingly, both pass 2.X stable, but for 1.X LTS, the "same shape" test passes but "different shape" test fails, and for 1.10 oldest, the "different shape" test passes but "same shape" test fails😅. I'll double check for bugs, but the actual code change is just two lines (and all other tests pass, so existing functionality still works), and the unittests are pretty short. The dependency of the unittests passing on different torch versions seems to indicate that it might be a torch versioning issue, maybe to do with ddp behavior? Any thoughts, @Borda ?

Borda · 2024-09-19T09:16:26Z

I wrote two unittests to account for each. Interestingly, both pass 2.X stable, but for 1.X LTS, the "same shape" test passes but "different shape" test fails, and for 1.10 oldest, the "different shape" test passes but "same shape" test fails😅.

that is strange and worse some more investigation...
cc: @SkafteNicki

SkafteNicki

I looked briefly why the tests do not pass on older versions of Pytorch but could not find a reason.

I think we should just only support this for Pytorch > 2.0 and then add this to the documentation.

src/torchmetrics/utilities/distributed.py

tests/unittests/bases/test_ddp.py

SkafteNicki

seeems the two test functions are now included twice in the test_ddp.py file, please check

src/torchmetrics/utilities/distributed.py

SkafteNicki · 2024-10-10T08:43:19Z

@cw-tan this is really strange, I am trying to debug this locally and I am seeing that the tests are failing at random. Eg. if I run them 10 times in a row I get a output from pytest like this:

FAILED tests/unittests/bases/test_ddp.py::test_ddp_autograd[_test_ddp_gather_autograd_same_shape-1-10] - AssertionError
FAILED tests/unittests/bases/test_ddp.py::test_ddp_autograd[_test_ddp_gather_autograd_same_shape-3-10] - AssertionError
FAILED tests/unittests/bases/test_ddp.py::test_ddp_autograd[_test_ddp_gather_autograd_same_shape-7-10] - AssertionError
FAILED tests/unittests/bases/test_ddp.py::test_ddp_autograd[_test_ddp_gather_autograd_same_shape-9-10] - AssertionError
FAILED tests/unittests/bases/test_ddp.py::test_ddp_autograd[_test_ddp_gather_autograd_different_shape-5-10] - AssertionError
FAILED tests/unittests/bases/test_ddp.py::test_ddp_autograd[_test_ddp_gather_autograd_different_shape-6-10] - AssertionError
FAILED tests/unittests/bases/test_ddp.py::test_ddp_autograd[_test_ddp_gather_autograd_different_shape-8-10] - AssertionError
FAILED tests/unittests/bases/test_ddp.py::test_ddp_autograd[_test_ddp_gather_autograd_different_shape-9-10] - AssertionError
FAILED tests/unittests/bases/test_ddp.py::test_ddp_autograd[_test_ddp_gather_autograd_different_shape-10-10] - AssertionError

with 4/10 of the "same shape" tests failing and 5/10 of the "different shape" test failing. But I cannot see there is any randomization going on in the tests?

cw-tan · 2024-10-10T23:36:56Z

@SkafteNicki indeed this is an odd one. Though adding the with torch.no_grad(): in my recent commit only had the "different shape" test failing -- were both "same shape" and "different shape" tests failing before? I'm thinking maybe covering more of the code with torch.no_grad() except for the parts we want for the autograd graph to be propagated might be worth a try. Though I don't actually know why it would help a priori.

for more information, see https://pre-commit.ci

cw-tan · 2024-10-11T04:09:06Z

@SkafteNicki sorry for the mess, I'm just trying to use the CI tests on all torch versions again but hopefully incorporating several trials (to check for indeterminism) and with the no_grad changes I made.

SkafteNicki · 2024-10-11T09:12:52Z

@SkafteNicki sorry for the mess, I'm just trying to use the CI tests on all torch versions again but hopefully incorporating several trials (to check for indeterminism) and with the no_grad changes I made.

@cw-tan That is completely okay whatever it takes to debug the issue. If you want to locally to run tests multiple times i recommend installing:
https://pypi.org/project/pytest-repeat/
and then running pytest command with
pytest --count=X
for X repeating evaluations

cw-tan requested review from SkafteNicki, Borda, justusschock and stancld as code owners September 17, 2024 18:08

Borda added the enhancement New feature or request label Sep 17, 2024

cw-tan force-pushed the all_gather_ad branch 4 times, most recently from 6c926d7 to 1d0dabe Compare September 18, 2024 02:54

SkafteNicki reviewed Oct 8, 2024

View reviewed changes

src/torchmetrics/utilities/distributed.py Show resolved Hide resolved

src/torchmetrics/utilities/distributed.py Show resolved Hide resolved

tests/unittests/bases/test_ddp.py Outdated Show resolved Hide resolved

tests/unittests/bases/test_ddp.py Outdated Show resolved Hide resolved

propagate rank result to gathered result for autograd compatibility

8122e9f

cw-tan force-pushed the all_gather_ad branch 2 times, most recently from dc35370 to e693ace Compare October 8, 2024 16:28

Borda requested a review from SkafteNicki October 8, 2024 17:26

cw-tan force-pushed the all_gather_ad branch 2 times, most recently from ce5dca1 to ffc67f6 Compare October 8, 2024 18:47

add unittest for dpp gather autograd compatibility

c2b6d19

cw-tan force-pushed the all_gather_ad branch from ffc67f6 to c2b6d19 Compare October 8, 2024 19:24

SkafteNicki reviewed Oct 9, 2024

View reviewed changes

src/torchmetrics/utilities/distributed.py Show resolved Hide resolved

src/torchmetrics/utilities/distributed.py Show resolved Hide resolved

Merge branch 'master' into all_gather_ad

7dec9b4

SkafteNicki added this to the v1.4.x milestone Oct 9, 2024

SkafteNicki added 2 commits October 9, 2024 10:06

changelog

d1e64e4

add to docs

fc366b8

SkafteNicki requested a review from lantiga as a code owner October 9, 2024 08:35

mergify bot removed the ready label Oct 10, 2024

try no_grad for the all gather

f854bf2

cw-tan and others added 3 commits October 10, 2024 23:59

retry with all tested torch versions

25ffff2

[pre-commit.ci] auto fixes from pre-commit.com hooks

e82c70e

for more information, see https://pre-commit.ci

incorporate trials

b5f285d

mergify bot added the has conflicts label Oct 11, 2024

Merge branch 'master' into all_gather_ad

4e1e836

mergify bot removed the has conflicts label Oct 12, 2024

Borda and others added 6 commits October 14, 2024 20:10

Merge branch 'master' into all_gather_ad

5b9f79d

Merge branch 'master' into all_gather_ad

5164e1d

lint

91cff5e

Merge branch 'master' into all_gather_ad

8fdc912

try adding contiguous

4c13d6c

Merge branch 'master' into all_gather_ad

74bf6b2

mergify bot added the has conflicts label Oct 18, 2024

Merge branch 'master' into all_gather_ad

00935f1

mergify bot removed the has conflicts label Oct 18, 2024

cw-tan and others added 3 commits October 18, 2024 11:39

try using float64

150251c

Merge branch 'master' into all_gather_ad

70967ba

try using random numbers

9b17d6f

SkafteNicki modified the milestones: v1.4.x, v1.5.x Oct 21, 2024

Borda added 3 commits October 21, 2024 17:23

Merge branch 'master' into all_gather_ad

6e476ea

Merge branch 'master' into all_gather_ad

c20f07c

Merge branch 'master' into all_gather_ad

2033395

Borda requested a review from baskrahmer October 25, 2024 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

cw-tan commented Sep 17, 2024 •

edited by SkafteNicki

Loading

Borda commented Sep 17, 2024

cw-tan commented Sep 17, 2024

codecov bot commented Sep 17, 2024 •

edited

Loading

Borda commented Sep 17, 2024

cw-tan commented Sep 18, 2024

Borda commented Sep 19, 2024

SkafteNicki left a comment

SkafteNicki left a comment

SkafteNicki commented Oct 10, 2024

cw-tan commented Oct 10, 2024

cw-tan commented Oct 11, 2024

SkafteNicki commented Oct 11, 2024

Enable autograd graph to propagate after multi-device syncing (for loss functions in ddp) #2754

Are you sure you want to change the base?

Enable autograd graph to propagate after multi-device syncing (for loss functions in ddp) #2754

Conversation

cw-tan commented Sep 17, 2024 • edited by SkafteNicki Loading

What does this PR do?

Did you have fun?

Borda commented Sep 17, 2024

cw-tan commented Sep 17, 2024

codecov bot commented Sep 17, 2024 • edited Loading

Codecov Report

Borda commented Sep 17, 2024

cw-tan commented Sep 18, 2024

Borda commented Sep 19, 2024

SkafteNicki left a comment

Choose a reason for hiding this comment

SkafteNicki left a comment

Choose a reason for hiding this comment

SkafteNicki commented Oct 10, 2024

cw-tan commented Oct 10, 2024

cw-tan commented Oct 11, 2024

SkafteNicki commented Oct 11, 2024

Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

cw-tan commented Sep 17, 2024 •

edited by SkafteNicki

Loading

codecov bot commented Sep 17, 2024 •

edited

Loading