Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Optimize dask.uniform_neighbor_sample #2887

Merged

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Nov 3, 2022

This PR closes #2872 and part of https://github.com/rapidsai/graph_dl/issues/74 .

On a preliminary benchmark i am seeing a 6x , 3.5x improvement and i expect more improvement on bigger clusters.

Before PR :

%%timeit
df = cugraph.dask.uniform_neighbor_sample(dask_g, start_list[:10_000], [20])
187 ms ± 40.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After PR

%%timeit
df = cugraph.dask.uniform_neighbor_sample(dask_g, start_list[:10_000], [20])
50.1 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Updated benchmark

Before PR:

---------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------
Name (time in ms, mem in bytes)                            Mean              GPU mem            GPU Leaked mem            Rounds            GPU Rounds          
----------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_uniform_neigbour_sample_email_eu_core[1000]      170.2833 (1.0)        300,224 (1.0)                   0 (1.0)           1           1
bench_uniform_neigbour_sample_email_eu_core[5000]      191.7691 (1.13)     1,200,944 (4.00)                  0 (1.0)           1           1
bench_uniform_neigbour_sample_email_eu_core[10000]     194.8658 (1.14)     1,200,944 (4.00)                  0 (1.0)           1           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------

After PR:

--------------------------------------------------------------------- benchmark: 3 tests --------------------------------------------------------------------
Name (time in ms, mem in bytes)                           Mean            GPU mem            GPU Leaked mem            Rounds            GPU Rounds          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_uniform_neigbour_sample_email_eu_core[1000]      54.0802 (1.0)            0 (1.0)                   0 (1.0)           1           1
bench_uniform_neigbour_sample_email_eu_core[5000]      63.0271 (1.17)           0 (1.0)                   0 (1.0)           1           1
bench_uniform_neigbour_sample_email_eu_core[10000]     70.1445 (1.30)           0 (1.0)                   0 (1.0)           1           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------

We should probably follow similar things for other algos too.
CC: @jnke2016 , @rlratzel , @alexbarghi-nv .

@VibhuJawa VibhuJawa requested a review from a team as a code owner November 3, 2022 22:24
@VibhuJawa VibhuJawa added this to the 22.12 milestone Nov 3, 2022
@VibhuJawa VibhuJawa self-assigned this Nov 3, 2022
@VibhuJawa VibhuJawa added 2 - In Progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 3, 2022
@VibhuJawa VibhuJawa changed the title [WIP] Optimize dask.uniform_neighbor_sample [REVIEW] Optimize dask.uniform_neighbor_sample Nov 4, 2022
@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2022

Codecov Report

Base: 60.43% // Head: 61.71% // Increases project coverage by +1.27% 🎉

Coverage data is based on head (632d584) compared to base (d86c933).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12    #2887      +/-   ##
================================================
+ Coverage         60.43%   61.71%   +1.27%     
================================================
  Files               114      123       +9     
  Lines              6547     7543     +996     
================================================
+ Hits               3957     4655     +698     
- Misses             2590     2888     +298     
Impacted Files Coverage Δ
python/cugraph/cugraph/components/connectivity.py 94.11% <ø> (-0.89%) ⬇️
python/cugraph/cugraph/dask/__init__.py 100.00% <ø> (ø)
...on/cugraph/cugraph/dask/components/connectivity.py 31.03% <ø> (+5.10%) ⬆️
...thon/cugraph/cugraph/dask/sampling/random_walks.py 22.91% <ø> (ø)
...h/cugraph/dask/sampling/uniform_neighbor_sample.py 25.42% <ø> (+4.26%) ⬆️
...ugraph/cugraph/dask/structure/mg_property_graph.py 14.12% <ø> (+0.46%) ⬆️
python/cugraph/cugraph/dask/traversal/bfs.py 25.39% <ø> (-0.84%) ⬇️
...ph/cugraph/experimental/pyg_extensions/__init__.py 100.00% <ø> (ø)
python/cugraph/cugraph/gnn/__init__.py 100.00% <ø> (ø)
...h/cugraph/gnn/dgl_extensions/base_cugraph_store.py 84.21% <ø> (ø)
... and 47 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

wait(cudf_result)

ddf = dask_cudf.from_delayed(cudf_result).persist()
# Send tasks all at once
Copy link
Member

@alexbarghi-nv alexbarghi-nv Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment; it would be nice to have a helper function for this so we can apply it to more algorithms. But I suggest we leave that to a future PR given how much work we have left before burndown.

To clarify, this was referencing lines 174-180

Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jnke2016 jnke2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation experiences a hang on 2, 3 and 4 GPUs similar to this issue that was closed few months ago by this PR. To reproduce, just run this implementation of uniform_nwighbor_sample several times(less than 10) on 2 GPUs.

@VibhuJawa
Copy link
Member Author

This implementation experiences a hang on 2, 3 and 4 GPUs similar to this issue that was closed few months ago by this PR. To reproduce, just run this implementation of uniform_nwighbor_sample several times(less than 10) on 2 GPUs.

Thanks for testing this thoroughly @jnke2016 , I really appreciate it. Can you post a link to the code that you ran to get to a hand as i could not get it to hand in my testing on 8 GPUs .

We can probably add those statements back and give back 10 ms of perf on 8GPUs but it is not a real solution as it will be much worse on larger cluster (say 128 GPUs) .

I am gonna try looking into a real fix of the problem today

@jnke2016
Copy link
Contributor

jnke2016 commented Nov 9, 2022

You can run this on 2 GPUs.

import cugraph
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask
from cugraph.dask.comms import comms as Comms
import cudf
import dask_cudf
from cugraph.dask.common.mg_utils import get_visible_devices
import cugraph.dask as dcg
from cugraph.generators import rmat

import time

def setup():
    cluster = LocalCUDACluster()
    client = Client(cluster)
    client.wait_for_workers(len(get_visible_devices()))

    Comms.initialize(p2p=True)
    return (client, cluster)

def teardown(client, cluster=None):
    Comms.destroy()
    client.close()
    if cluster:
        cluster.close()

def generate_edgelist(scale,
                      edgefactor,
                      seed=None,
                      unweighted=False,
                     ):

    df = rmat(
        scale,
        (2**scale)*edgefactor,
        0.57,  # from Graph500
        0.19,  # from Graph500
        0.19,  # from Graph500
        seed or 42,
        clip_and_flip=False,
        scramble_vertex_ids=True,
        create_using=None,  # return edgelist instead of Graph instance
        mg=True
    )

    return df

if __name__ == "__main__":

    setup_objs = setup()
    client = setup_objs[0]
    num_workers = len(client.scheduler_info()['workers'])

    df = generate_edgelist(scale=10, edgefactor=16, seed=5)

    m_G = cugraph.Graph()
    m_G.from_dask_cudf_edgelist(
        df, source='src', destination='dst', renumber=True, legacy_renum_only=True)

    vertices = m_G.nodes().persist()
    print("number of nodes are ", len(vertices))
    print("number of edges is ", len(m_G.input_df))
    vertices = vertices.compute().reset_index(drop=True).iloc[:10000]


    time_list = []
    for i in range(10):
        print("*****iteration ", i, "*****")
        t0 = time.time()
        results = cugraph.dask.uniform_neighbor_sample(m_G, vertices, [2])
        t1 = time.time()
        total = t1-t0
        time_list.append(total)


    import statistics
    print("the mean is ", statistics.mean(time_list))
    print("the standard deviation is ", statistics.stdev(time_list))
    print("number of GPUs is ", num_workers)
    
    teardown(*setup_objs)

@VibhuJawa
Copy link
Member Author

@jnke2016 , Thanks for this script. Let me investigate more :-)

@VibhuJawa VibhuJawa changed the title [REVIEW] Optimize dask.uniform_neighbor_sample [WIP] Optimize dask.uniform_neighbor_sample Nov 10, 2022
@VibhuJawa
Copy link
Member Author

VibhuJawa commented Nov 17, 2022

@jnke2016 ,

Swapping client.submit to client.map is runnning into issues with work-stealing (See internal thread). Swapping it back gets rid of the hang .

We still see speedups (See updated benchmarks in the PR description on a 8 GPU cluster) but we will not be as fast as we should be on large clusters . (We currently loose 5ms per worker)

Anyways, I have verified that your script works on 1,2,3,4,5,6,7,8 GPUs with 50 iterations each without hanging.

I will make the cluster agnostic changes in 23.02 as it is more involved.

Please feel free to review again

@VibhuJawa VibhuJawa changed the title [WIP] Optimize dask.uniform_neighbor_sample [REVIEW] Optimize dask.uniform_neighbor_sample Nov 17, 2022
@VibhuJawa VibhuJawa requested review from a team as code owners November 17, 2022 03:59
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@VibhuJawa VibhuJawa changed the base branch from branch-22.12 to branch-22.10 November 17, 2022 04:00
@VibhuJawa VibhuJawa changed the base branch from branch-22.10 to branch-22.12 November 17, 2022 04:01
@VibhuJawa
Copy link
Member Author

@rerun tests

Copy link
Contributor

@wangxiaoyunNV wangxiaoyunNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good.

@wangxiaoyunNV wangxiaoyunNV reopened this Nov 17, 2022
Copy link
Contributor

@jnke2016 jnke2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@BradReesWork
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 34e3dd7 into rapidsai:branch-22.12 Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA]: Reduce the dask overhead in uniform_neighbor_sample
6 participants