[FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler #3669

alexbarghi-nv · 2023-06-23T01:27:52Z

Some dask operations were not being done correctly, and time was being lost in broadcasting the rank and label arrays to all workers. This PR resolves those issues.

Also pulls in the previously-experimental changes that add logging to the bulk sampler.

Credit to @VibhuJawa for isolating and fixing the issues with the column merge in uniform_neighbor_sample and the new sampling notebook and shell script.

This PR does modify the sampling APIs so it is breaking. The API changes are necessary to avoid unnecessary shuffling, and eventually, to improve batch id assignment.

Dataset: ogbn_papers100M x 2; Fanout: [25, 25]; Batch Size: 512; Seeds Per Call: 524288
Current runtime: 2.69 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Previous runtime: 4.51 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Speedup: 1.7x

Dataset: ogbn_papers100M x 4; Fanout: [25, 25]; Batch Size: 512; Seeds Per Call: 524288
Current runtime: 6.32 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Previous runtime: 10.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Speedup: 1.7x

review-notebook-app · 2023-06-27T23:15:44Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…ghi-nv/cugraph into python-sampling-fix-hang

VibhuJawa

Some more minor changes, but i think we should be good to go after that. I think the only thing is what @rlratzel requested around keeping the now legacy implementation around.

python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py

VibhuJawa · 2023-06-29T01:19:49Z

This still seems to be actually hanging (not perf issues) for me when running 10 times on bulk samplinng. :-( . Can you try running it and ensure we dont cause hangs.

dataset='ogbn_papers100M'
dataset_root="/datasets/abarghi/"
output_root="/raid/vjawa/"
reverse_edges=True
add_edge_types=False
batch_size=1024
seeds_per_call=524288
fanout=[10,10,10]
replication_factor=4
seed=123

dataset_dir=dataset_root
output_path=output_root
persist=False

VibhuJawa

Minor asks around raising deprecated warnings

python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py

python/cugraph/cugraph/gnn/data_loading/bulk_sampler.py

python/cugraph/cugraph/sampling/uniform_neighbor_sample.py

VibhuJawa

LGTM

python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py

VibhuJawa · 2023-06-30T19:55:53Z

python/cugraph/cugraph/gnn/data_loading/bulk_sampler.py

+        if isinstance(self.__batches, dask_cudf.DataFrame):
+            min_batch_id = min_batch_id.compute()
+        min_batch_id = int(min_batch_id)


Why did we add this compute back ?

client.submit doesn't accept it uncomputed. What we had before worked with client.compute. I'm trying to find alternative that works for submit.

The error we get is this:

AttributeError("'Scalar' object has no attribute '_parent_meta'")

And wrapping with delayed doesn't work either, it gives a similar error.

i.e.
delayed(lambda df : df.min())(self.__batches) doesn't work

I think I found a way around that issue, but we have another problem where the execution hangs because we are setting allow_other_workers=False so it can't get the min

I guess we can let it be for now and try to get other optimizations as a followup

I've since figured out that what we had with client.compute and not computing the min beforehand was wrong. It should cause a hang, which we were seeing.

rlratzel

LGTM, and thanks for adding support for the legacy args. I'm filing this issue to remind us to remove it for 23.10 (which gives users one release to migrate).

I have a question and a possible FIXME suggestion which need not hold up approval now.

rlratzel · 2023-07-06T17:07:54Z

benchmarks/cugraph/standalone/bulk_sampling/benchmarking_script.ipynb

A notebook is nice, but just so I can plan ahead: do we want to run this notebook as part of the nightlies, and if so, can we use the notebook conversion script to run it from a command-line (this is what we do in CI, as seen here, which calls this)

I think we do want to run this nightly.

benchmarks/cugraph/standalone/bulk_sampling/bulk_sampling.sh

alexbarghi-nv · 2023-07-06T19:23:24Z

/merge

alexbarghi-nv added 2 commits June 23, 2023 01:18

fix hang

3982222

remove unwanted files

1f5f956

alexbarghi-nv self-assigned this Jun 23, 2023

alexbarghi-nv added bug Something isn't working breaking Breaking change labels Jun 23, 2023

alexbarghi-nv added this to the 23.08 milestone Jun 23, 2023

pull in changes from other branch

59c6ffc

alexbarghi-nv changed the title ~~[FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample~~ [FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler Jun 26, 2023

alexbarghi-nv and others added 13 commits June 26, 2023 14:24

style fixes

fe7ef04

mess with filtering

009d123

remove unwanted files

f82cbed

b

40a520a

remove stats

c01ebdc

revert filtering changes

e27a339

remove unwanted files

ea5ecb8

persist the filter

2bfaabe

cleaned up unwanted files

9f9d893

repartitioning

046c267

remove unwanted files

c9a79e4

remove shuffle

055d4b0

Reduce overhead by 2x

b0a4c08

VibhuJawa mentioned this pull request Jun 27, 2023

[WIP]Python sampling fix hang #3671

Closed

VibhuJawa and others added 3 commits June 26, 2023 22:14

Added bulk sampling script

9289f18

m

e0f1ac9

cleanup

11bd9a0

alexbarghi-nv added 4 commits June 27, 2023 23:16

remove debug file

8620824

revert config

1e86d71

update sampling tests

2df5e8f

update sg sampling

85b4ff9

alexbarghi-nv added 7 commits June 28, 2023 19:48

Merge branch 'python-sampling-fix-hang' of https://github.com/alexbar…

115b656

…ghi-nv/cugraph into python-sampling-fix-hang

switch to isinstance

26c5900

remove log

f40bab3

more isinstance checks

a13ec72

correction to multi-column behavior

14e0aea

style

eeada64

update notebook with run numbers

bec749f

VibhuJawa suggested changes Jun 29, 2023

View reviewed changes

alexbarghi-nv added 3 commits June 29, 2023 01:43

add back the wait calls

7af87f9

merge multiple waits into one

11163be

do work

0d561a4

VibhuJawa suggested changes Jun 30, 2023

View reviewed changes

python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py Outdated Show resolved Hide resolved

python/cugraph/cugraph/gnn/data_loading/bulk_sampler.py Outdated Show resolved Hide resolved

python/cugraph/cugraph/sampling/uniform_neighbor_sample.py Show resolved Hide resolved

alexbarghi-nv commented Jun 30, 2023

View reviewed changes

python/cugraph/cugraph/sampling/uniform_neighbor_sample.py Outdated Show resolved Hide resolved

VibhuJawa approved these changes Jun 30, 2023

View reviewed changes

alexbarghi-nv commented Jun 30, 2023

View reviewed changes

python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py Outdated Show resolved Hide resolved

ensure sg code path works

b90f9da

VibhuJawa reviewed Jun 30, 2023

View reviewed changes

alexbarghi-nv added 2 commits June 30, 2023 20:05

ensure empty partitions return a empty df

fc2eabf

style, min/max change but still requires client copy

a4a30cf

alexbarghi-nv mentioned this pull request Jul 5, 2023

[BUG] Unable to Send dask_cudf.core.Scalar Object Between Workers #3690

Closed

alexbarghi-nv added 2 commits July 5, 2023 14:54

update benchmarking script

5062de9

remove paths from script

315e2a3

rlratzel mentioned this pull request Jul 6, 2023

Remove deprecated support for legacy uniform_neighbor_sampling args in 23.10 #3698

Closed

rlratzel approved these changes Jul 6, 2023

View reviewed changes

alexbarghi-nv and others added 2 commits July 6, 2023 18:20

add arguments in shell script

b41cfda

Merge branch 'branch-23.08' into python-sampling-fix-hang

de2acbd

rapids-bot bot merged commit 0372396 into rapidsai:branch-23.08 Jul 6, 2023

VibhuJawa mentioned this pull request Jul 7, 2023

[FEA] Improve Bulksampling Perf #3663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler #3669

[FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler #3669

alexbarghi-nv commented Jun 23, 2023 •

edited

Loading

review-notebook-app bot commented Jun 27, 2023

VibhuJawa left a comment

VibhuJawa commented Jun 29, 2023

VibhuJawa left a comment

VibhuJawa left a comment

VibhuJawa Jun 30, 2023

alexbarghi-nv Jul 3, 2023

alexbarghi-nv Jul 3, 2023

alexbarghi-nv Jul 3, 2023

VibhuJawa Jul 3, 2023

alexbarghi-nv Jul 3, 2023

alexbarghi-nv Jul 5, 2023

rlratzel left a comment

rlratzel Jul 6, 2023

alexbarghi-nv Jul 6, 2023

alexbarghi-nv commented Jul 6, 2023

[FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler #3669

[FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler #3669

Conversation

alexbarghi-nv commented Jun 23, 2023 • edited Loading

review-notebook-app bot commented Jun 27, 2023

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa commented Jun 29, 2023

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlratzel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexbarghi-nv commented Jul 6, 2023

alexbarghi-nv commented Jun 23, 2023 •

edited

Loading