-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Critical: Force cudf.concat when passing in a cudf Series to MG Uniform Neighbor Sample #3416
[BUG] Critical: Force cudf.concat when passing in a cudf Series to MG Uniform Neighbor Sample #3416
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a test. I have one minor change request related to the docstring, otherwise it LGTM.
…v/cugraph into sampling-fix-concat
start_list = start_list.to_frame() | ||
batch_id_list = batch_id_list.to_frame() | ||
ddf = start_list.merge( | ||
batch_id_list, | ||
how="left", | ||
left_index=True, | ||
right_index=True, | ||
) | ||
else: | ||
# sg input | ||
ddf = cudf.concat( | ||
[ | ||
start_list, | ||
batch_id_list, | ||
], | ||
axis=1, | ||
) | ||
else: | ||
ddf = start_list | ||
ddf = start_list.to_frame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really care about the index here ? I think not . Does below work ?
start_list = start_list.reset_index(drop=True)
batch_id_list = batch_id_list.reset_index(drop=True)
if isinstance(start_list, dask_cudf.Series):
ddf = dd.concat([start_list, batch_id_list], ignore_unknown_divisions=True, axis=1)
else:
ddf = cudf.concat([start_list, batch_id_list], axis =1, ignore_index=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reset index can we join batch id and start list correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also, I ran into an issue with dask_cudf.concat
where the name of the series was dropped in one of my first attempts at a solution. dask_cudf.merge
doesn't have that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to, from the logic you shared , we are merging on index ( left_index=True, right_index=True)
in dask which is the same thing but more inefficient.
Edit: Also added ingore_index=True
to make it more concrete in cuDF
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me try this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VibhuJawa I just confirmed this is not an issue with dask-cudf, it's an issue with our get_distributed_data
function. I will make an issue for cugraph instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why calling merge
instead of concat
before get_distributed_data
works, but for some reason the bug completely disappears with merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a look too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating an issue .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should link it here, sorry: #3420
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
Currently, cudf does not merge series properly when they already share an index. I'm not sure if this is a bug in cudf, or intentional behavior. This issue does not occur with dask_cudf. The resolution is to use
cudf.concat
when passing acudf.Series
for start vertices and batch ids, anddf.to_frame().merge
when passing in adask_cudf.Series
for start vertices and batch ids.This PR also adds an additional test which tests both cudf and dask_cudf inputs to catch these sort of problems in the future.