Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate renumbering and compression to cugraph-dgl to accelerate MFG creation #3887

Merged
merged 108 commits into from
Oct 3, 2023

Conversation

tingyu66
Copy link
Member

@tingyu66 tingyu66 commented Sep 26, 2023

Allow cugraph-dgl dataloader to consume sampled outputs from BulkSampler in CSC format.

@tingyu66 tingyu66 marked this pull request as draft September 26, 2023 19:11
@tingyu66 tingyu66 self-assigned this Sep 27, 2023
@BradReesWork BradReesWork added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 27, 2023
@BradReesWork BradReesWork added this to the 23.10 milestone Sep 27, 2023
@seunghwak
Copy link
Contributor

I think all the C/C++ updates in this PR is from #3841, this won't require C/C++ review once #3841 gets merged to the main branch.

@tingyu66 tingyu66 marked this pull request as ready for review September 28, 2023 16:21
Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no C/C++ updates and approving without review.

@tingyu66
Copy link
Member Author

tingyu66 commented Oct 2, 2023

Using ogbn-products dataset, the time spent on MFG creation (excluding IO) for each epoch is reduced from 4.34 s to 2.95 s. The script used for the comparison is here. CC: @VibhuJawa

batch_size: 128, batches_per_partition: 50.

COO

image

CSC

image

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes but mostly looks great. Thanks again @tingyu66 for the great work here.

Comment on lines 54 to 55
batch_size: int = 1024,
drop_last: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tingyu66 , Can we also change seeds_per_call to 100_000 to make a better default based on your testing ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change it to the default value of BulkSampler: 200_000? After our call the other day, I tested a wide range of seeds_per_call values and none of the runs threw a OOM error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, if it just works we can probably set the default to None and let upstream handle it ? What do you think, any default which is reasonable and just works is fine by me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the value to 200_000 in 45f93f2 to align with BulkSampler. Did not set to None to avoid the extra step of handling None case.

Comment on lines +447 to +451
major_offsets = df.major_offsets.dropna().values
label_hop_offsets = df.label_hop_offsets.dropna().values
renumber_map_offsets = df.renumber_map_offsets.dropna().values
renumber_map = df.map.dropna().values
minors = df.minors.dropna().values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assumes that the length of the renumber_map is smaller than major_offsets. I will check this again if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are making any assumptions here. renumber_map and major_offsets are simply two different tensors that happen to be stored in a single df.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, My bad I just re-read the code below and I think we should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each batch:

  • The length of renumber_map = number of distinct nodes (i.e., the number of sources in hop 0)
  • The length of major_offsets = sum of number of destination nodes in each hop

Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@VibhuJawa
Copy link
Member

Using ogbn-products dataset, the time spent on MFG creation (excluding IO) for each epoch is reduced from 4.34 s to 2.95 s. The script used for the comparison is here. CC: @VibhuJawa

Curious: What is the batch size being tested here , @tingyu66 ?

@tingyu66
Copy link
Member Author

tingyu66 commented Oct 2, 2023

Using ogbn-products dataset, the time spent on MFG creation (excluding IO) for each epoch is reduced from 4.34 s to 2.95 s. The script used for the comparison is here. CC: @VibhuJawa

Curious: What is the batch size being tested here , @tingyu66 ?

batch_size: 128, batches_per_partition: 50.

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rlratzel
Copy link
Contributor

rlratzel commented Oct 3, 2023

/merge

@rapids-bot rapids-bot bot merged commit a863835 into rapidsai:branch-23.10 Oct 3, 2023
73 checks passed
@tingyu66 tingyu66 deleted the dgl-mfg-integration branch October 3, 2023 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants