-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler #3669
[FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler #3669
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…ghi-nv/cugraph into python-sampling-fix-hang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more minor changes, but i think we should be good to go after that. I think the only thing is what @rlratzel requested around keeping the now legacy implementation around.
python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py
Outdated
Show resolved
Hide resolved
python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py
Outdated
Show resolved
Hide resolved
python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py
Outdated
Show resolved
Hide resolved
This still seems to be actually hanging (not perf issues) for me when running 10 times on bulk samplinng. :-( . Can you try running it and ensure we dont cause hangs. dataset='ogbn_papers100M'
dataset_root="/datasets/abarghi/"
output_root="/raid/vjawa/"
reverse_edges=True
add_edge_types=False
batch_size=1024
seeds_per_call=524288
fanout=[10,10,10]
replication_factor=4
seed=123
dataset_dir=dataset_root
output_path=output_root
persist=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor asks around raising deprecated warnings
python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py
Outdated
Show resolved
Hide resolved
if isinstance(self.__batches, dask_cudf.DataFrame): | ||
min_batch_id = min_batch_id.compute() | ||
min_batch_id = int(min_batch_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we add this compute back ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client.submit
doesn't accept it uncomputed. What we had before worked with client.compute
. I'm trying to find alternative that works for submit
.
The error we get is this:
AttributeError("'Scalar' object has no attribute '_parent_meta'")
And wrapping with delayed doesn't work either, it gives a similar error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e.
delayed(lambda df : df.min())(self.__batches)
doesn't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found a way around that issue, but we have another problem where the execution hangs because we are setting allow_other_workers=False so it can't get the min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can let it be for now and try to get other optimizations as a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've since figured out that what we had with client.compute
and not computing the min beforehand was wrong. It should cause a hang, which we were seeing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and thanks for adding support for the legacy args. I'm filing this issue to remind us to remove it for 23.10 (which gives users one release to migrate).
I have a question and a possible FIXME suggestion which need not hold up approval now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do want to run this nightly.
/merge |
Some dask operations were not being done correctly, and time was being lost in broadcasting the rank and label arrays to all workers. This PR resolves those issues.
Also pulls in the previously-experimental changes that add logging to the bulk sampler.
Credit to @VibhuJawa for isolating and fixing the issues with the column merge in
uniform_neighbor_sample
and the new sampling notebook and shell script.This PR does modify the sampling APIs so it is breaking. The API changes are necessary to avoid unnecessary shuffling, and eventually, to improve batch id assignment.
Dataset:
ogbn_papers100M x 2
; Fanout:[25, 25]
; Batch Size:512
; Seeds Per Call:524288
Current runtime: 2.69 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Previous runtime: 4.51 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Speedup: 1.7x
Dataset:
ogbn_papers100M x 4
; Fanout:[25, 25]
; Batch Size:512
; Seeds Per Call:524288
Current runtime: 6.32 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Previous runtime: 10.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Speedup: 1.7x