Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW]Multi-trainers cugraph-DGL examples #3212

Merged
merged 18 commits into from
Apr 2, 2023

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Feb 1, 2023

This PR adds a working Multi-GPU Graph (on 2 dask workers) being trained/loaded on multiple pytorch trainers. (3)

Todo:

  • Verify works on multiple trainers and multiple dask workers
  • Show scaling as you increase training GPUs

At 1 second we become bottlenecked by sampling dask cluster, but we see perf improvement by going from 1 GPU->2GPU.
On OBGN-Products

| Number of Training GPUs | Time per epoch |
|-------------------------|----------------|
| 1                       | 2.3 s          |
| 2                       | 0.582 s        |
| 4                       | 0.792 s        |

This PR depends upon: #3393
CC: @rlratzel , @alexbarghi-nv , @BradReesWork

@VibhuJawa VibhuJawa requested a review from a team as a code owner February 1, 2023 03:23
@VibhuJawa VibhuJawa added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Feb 1, 2023
@codecov-commenter
Copy link

codecov-commenter commented Feb 1, 2023

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@1543356). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.04    #3212   +/-   ##
===============================================
  Coverage                ?   56.27%           
===============================================
  Files                   ?      153           
  Lines                   ?     9662           
  Branches                ?        0           
===============================================
  Hits                    ?     5437           
  Misses                  ?     4225           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@alexbarghi-nv
Copy link
Member

@VibhuJawa is this is still WIP? I thought you said this was ready for review.

@VibhuJawa
Copy link
Member Author

@VibhuJawa is this is still WIP? I thought you said this was ready for review.

Sorry for the confusion. This is still WIP. Will let you know when it is ready for review

@VibhuJawa VibhuJawa changed the title [WIP]Multi-trainers cugraph-DGL [WIP]Multi-trainers cugraph-DGL examples Mar 29, 2023
@VibhuJawa VibhuJawa changed the title [WIP]Multi-trainers cugraph-DGL examples [REVIEW]Multi-trainers cugraph-DGL examples Mar 29, 2023
@VibhuJawa
Copy link
Member Author

Please note that this example is failing because of a bug that i am trying to triage. That said, the bug is independent of the PR so we can probably review it.

@BradReesWork BradReesWork added this to the 23.04 milestone Mar 30, 2023
Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, just had a couple comments that don't need to hold up merging this PR.

@BradReesWork BradReesWork added the Blocked Cannot progress due to external reasons label Mar 31, 2023
@alexbarghi-nv alexbarghi-nv removed the Blocked Cannot progress due to external reasons label Apr 2, 2023
@alexbarghi-nv
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit f4a0778 into rapidsai:branch-23.04 Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants