Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reindex Start Vertices and Batch Ids Prior to Sampling Call #3393

Merged

Conversation

alexbarghi-nv
Copy link
Member

@alexbarghi-nv alexbarghi-nv commented Mar 29, 2023

This PR fixes a bug where output sample batch ids do not match those expected when using the bulk sampler, causing subgraphs that are larger than expected and incorrect. Without reindexing, the wrong batch ids are assigned to the start vertices. Reindexing ensures that the same order is preserved for batch ids and start vertices.

This PR also changes the empty dataframe passed to dask in uniform_neighbor_sample to match the correct ordering of batch_id and hop_id. This ensures that the columns are named correctly and are not inadvertently renamed due to them being created in a different order.

This PR is non-breaking because it restores the original behavior of bulk sampling and reverses a bug that was inadvertently introduced with the dask updates.

Resolves #3390

@alexbarghi-nv alexbarghi-nv self-assigned this Mar 29, 2023
@alexbarghi-nv alexbarghi-nv added bug Something isn't working non-breaking Non-breaking change labels Mar 29, 2023
@alexbarghi-nv alexbarghi-nv added this to the 23.04 milestone Mar 29, 2023
@alexbarghi-nv alexbarghi-nv marked this pull request as ready for review March 29, 2023 23:21
@alexbarghi-nv alexbarghi-nv requested a review from a team as a code owner March 29, 2023 23:21
Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good. Thanks for debugging but we should add a test to catch it please.

rapids-bot bot pushed a commit that referenced this pull request Apr 2, 2023
This PR adds a working Multi-GPU Graph (on 2 dask workers)  being trained/loaded on multiple pytorch trainers.  (3)

Todo: 
- [x] Verify works on multiple trainers and multiple dask workers
- [x] Show scaling as you increase training GPUs 

At 1 second we become bottlenecked by sampling dask cluster, but we see perf improvement by going from `1 GPU`->`2GPU`.   
**On OBGN-Products**
```md
| Number of Training GPUs | Time per epoch |
|-------------------------|----------------|
| 1                       | 2.3 s          |
| 2                       | 0.582 s        |
| 4                       | 0.792 s        |
```

This PR depends upon:  #3393
CC: @rlratzel , @alexbarghi-nv , @BradReesWork

Authors:
  - Vibhu Jawa (https://github.com/VibhuJawa)
  - Alex Barghi (https://github.com/alexbarghi-nv)

Approvers:
  - Alex Barghi (https://github.com/alexbarghi-nv)

URL: #3212
@alexbarghi-nv
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 1281bb8 into rapidsai:branch-23.04 Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failures present in MG bulk sampler tests
4 participants