Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MG get_two_hop_neighbors fails with KeyError when accessing start_vertices #3745

Closed
rlratzel opened this issue Jul 26, 2023 · 0 comments · Fixed by #3778
Closed

MG get_two_hop_neighbors fails with KeyError when accessing start_vertices #3745

rlratzel opened this issue Jul 26, 2023 · 0 comments · Fixed by #3778
Assignees
Labels
bug Something isn't working CRITICAL BUG! BUG that needs to be FIX NOW !!!!

Comments

@rlratzel
Copy link
Contributor

On a system with more visible devices than are needed for the distributed start_vertices list, a KeyError is raised when the get_two_hop_neighbors implementation attempts to access it:

  File "/home/user/miniconda3/envs/cugraph_dev-23.08/lib/python3.10/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 781, in <listcomp>
    start_vertices[w][0],
KeyError: 'tcp://127.0.0.1:46347'

In this case, the system had 4 GPUs, and the workaround was to restrict the run to 2 GPUs:

(cugraph_dev-23.08) user@machine ~/nvidia/demo> CUDA_VISIBLE_DEVICES=0,1 python get_two_hop_demo.py

The fix is to not assume the start_vertices list is always distributed across every worker in the cluster.

@rlratzel rlratzel added the bug Something isn't working label Jul 26, 2023
@rlratzel rlratzel self-assigned this Jul 26, 2023
@rlratzel rlratzel added the CRITICAL BUG! BUG that needs to be FIX NOW !!!! label Aug 8, 2023
rapids-bot bot pushed a commit that referenced this issue Aug 12, 2023
… start vertices list (#3778)

closes #3745 

This PR adds updates to replace the `get_distributed_data()` call with `persist_dask_df_equal_parts_per_worker()` and `get_persisted_df_worker_map()` to avoid a problem where `get_distributed_data()` does not distribute data properly across all workers.  This resulted in a `KeyError` when the data was accessed via worker, when that worker was not a key in the map.

More details are in the [linked issue](#3745).

This PR also does minor refactoring in `get_two_hop_neighbors()` and reorganizes the imports according to [PEP 8](https://peps.python.org/pep-0008/#imports).

Tested manually on a 4-GPU system, where the problem described in #3745 was reproduced, the change in the PR applied and re-run, and the error no longer occurring.

Authors:
  - Rick Ratzel (https://github.com/rlratzel)

Approvers:
  - Vibhu Jawa (https://github.com/VibhuJawa)
  - Brad Rees (https://github.com/BradReesWork)

URL: #3778
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CRITICAL BUG! BUG that needs to be FIX NOW !!!!
Projects
None yet
1 participant