You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a system with more visible devices than are needed for the distributed start_vertices list, a KeyError is raised when the get_two_hop_neighbors implementation attempts to access it:
File "/home/user/miniconda3/envs/cugraph_dev-23.08/lib/python3.10/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 781, in <listcomp>
start_vertices[w][0],
KeyError: 'tcp://127.0.0.1:46347'
In this case, the system had 4 GPUs, and the workaround was to restrict the run to 2 GPUs:
… start vertices list (#3778)
closes#3745
This PR adds updates to replace the `get_distributed_data()` call with `persist_dask_df_equal_parts_per_worker()` and `get_persisted_df_worker_map()` to avoid a problem where `get_distributed_data()` does not distribute data properly across all workers. This resulted in a `KeyError` when the data was accessed via worker, when that worker was not a key in the map.
More details are in the [linked issue](#3745).
This PR also does minor refactoring in `get_two_hop_neighbors()` and reorganizes the imports according to [PEP 8](https://peps.python.org/pep-0008/#imports).
Tested manually on a 4-GPU system, where the problem described in #3745 was reproduced, the change in the PR applied and re-run, and the error no longer occurring.
Authors:
- Rick Ratzel (https://github.com/rlratzel)
Approvers:
- Vibhu Jawa (https://github.com/VibhuJawa)
- Brad Rees (https://github.com/BradReesWork)
URL: #3778
On a system with more visible devices than are needed for the distributed start_vertices list, a KeyError is raised when the get_two_hop_neighbors implementation attempts to access it:
In this case, the system had 4 GPUs, and the workaround was to restrict the run to 2 GPUs:
The fix is to not assume the start_vertices list is always distributed across every worker in the cluster.
The text was updated successfully, but these errors were encountered: