-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PG: join new vertex data by vertex ids #2796
Conversation
Fixes rapidsai#2793. We are using the assumption that there should be a single row for each vertex id.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #2796 +/- ##
===============================================
Coverage ? 62.60%
===============================================
Files ? 118
Lines ? 6570
Branches ? 0
===============================================
Hits ? 4113
Misses ? 2457
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
This should fix the quadratic scaling we're seeing when adding new data. CC @VibhuJawa. I'm still trying to improve the merges for MG to be like #2796, but I'm encountering issues. Authors: - Erik Welch (https://github.com/eriknw) Approvers: - Vibhu Jawa (https://github.com/VibhuJawa) - Rick Ratzel (https://github.com/rlratzel) - Alex Barghi (https://github.com/alexbarghi-nv) URL: #2805
This PR needs rapidsai/cudf#11998 to pass tests. I think we ought to try to benchmark this before merging. Here is the mini-benchmark
and here is the benchmark for the current branch
We shouldn't yet read too much into these numbers, but this PR actually isn't as fast as I was expecting. I'll investigate further and experiment with MAG240 dataset. @rlratzel, this reworks the merge in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor clarification. Looks good other wise.
if df.npartitions > 2 * self.__num_workers: | ||
# TODO: better understand behavior of npartitions argument in join | ||
df = df.repartition(npartitions=self.__num_workers).persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With dask_cudf
merges/operations, I think we should not actually decrease number of partitions to less than or equal to _num_workers
unless we actually need to because it decrease parallelization and worker starving can become a problem.
Though the shuffle cost of the merge is nlogn
but if your input partitions are less than available workers , worker starving can become a problem.
Maybe lets pick 2*_num_workers
or something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable. Done. I think we may want to explore these values as we scale out.
I know it's bad if we don't repartition though and the number of partitions grows too large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed . Yup , it becomes really slow as you increase partitions . In case you are curios on how the shuffle works under the hood, below is my favorite explanation by rjzamora which i think still holds true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@gpucibot merge |
Fixes #2793
Fixes #2794
We are using the assumption that there should be a single row for each vertex id.
This does not yet handle MG. Let's figure out how we want SG to behave first.