Support Heterogeneous Renumbering in the C++ API #4412

alexbarghi-nv · 2024-05-10T18:44:39Z

We currently support heterogeneous renumbering in the C++ sampling API. This issue is for extending that to heterogeneous graphs. Fundamentally, the process is the same with a few differences:

Input:

A heterogeneous graph can have multiple node types
- BUT we can always assume they will be renumbered from 0 to N by cuGraph-PyG and cuGraph-DGL
- We can also assume that each node type has a starting and ending offset (i.e. type 0 is nodes 0-30, type 1 is nodes 31-50, etc.) and that the user must provide this in sampling.
A heterogeneous graph can have multiple edge types
- AND each edge type has its own numbering from 0 to #edges_of_type (cuGraph-PyG/cuGraph-DGL don't renumber these)
- So from the libcugraph perspective, there may be multiple edges with id 0, but there can only be one id 0 per edge type
For each edge type, there is a single input node type and output node type
- i.e. (author, writes, paper) has "author" as the input node type and "paper" as the output node type.
- The user will be required to provide a mapping of edge type id to src/dst node type id.

Output:

Every edge type and every node type has its own renumbering for each output minibatch
- i.e. "paper" has node ids from 0 to #papers_in_minibatch - 1 and "author" has node ids from 0 to #authors_in_minibatch - 1
- These node numberings are consistent across edge types. We only renumber the nodes once, globally, or per hop in the DGL case.
- This obviously means that each node type needs a renumber map.
Each edge type corresponds to a separate subgraph in the output minibatch.
- libcugraph needs to output a list of offsets for where each edge type graph starts and ends
- If there is no output for a particular edge type, it should still be noted (i.e. the offsets [0,30,30,70] means that edge type 1 had no output edges in the current minibatch.
- Each edge type needs a renumber map. These renumber maps are created the same way they are for nodes.
- In the case of a separate graph outputted per hop (DGL), every hop would still need a separate renumbering, just as it is now.
We also use the user-provided node offsets array to subtract off the starting offset for each node type in each node's renumber map

Other Notes:

We are probably only going to expose this through pylibcugraph for the time being.
We should make sure the node offsets requirement is well-documented since it requires an extra renumbering step that cuGraph-DGL/cuGraph-PyG does (even though this is trivial and already implemented by both frameworks)
For MG, the node offsets and edge mapping are global. Each worker will get the exact same values for each array. These should be respected globally when doing renumbering. No concatenation should occur. I suppose we could have an expensive check that verifies these arrays are identical across workers.

seunghwak · 2024-05-10T22:22:04Z

We currently support heterogeneous renumbering in the C++ sampling API.

Do we? What does heterogenous sampling mean for non-heterogeneous graph?

A heterogeneous graph can have multiple node types
BUT we can always assume they will be renumbered from 0 to N by cuGraph-PyG and cuGraph-DGL
We can also assume that each node type has a starting and ending offset (i.e. type 0 is nodes 0-30, type 1 is nodes 31-50, etc.) and that the user must provide this in sampling.

So, we assume external vertex IDs are in [0, N) where N is # nodes (i.e. vertices). cuGraph internally renumbers vertex IDs, so we may keep nodes types as node property values to quickly query vertex types. Does this align with your understanding?

Every edge type and every node type has its own renumbering for each output minibatch
i.e. "paper" has node ids from 0 to #papers_in_minibatch - 1 and "author" has node ids from 0 to #authors_in_minibatch - 1
These node numberings are consistent across edge types. We only renumber the nodes once, globally, or per hop in the DGL case.
This obviously means that each node type needs a renumber map.

Are we renumbering just node (vertex) IDs or renumbering both node IDs and edge IDs?
Renumbering for different batches are independent (as before), correct?
We have separate node (vertex) renumber map per node type & batch, correct?
"We only renumber the nodes once, globally, or per hop" means within batch, right?
"per hop in the DGL case", this does not mean we have separate renumber map per hop, right? We currently have compress_per_hop option but does not renumber per hop (but we consider hop # a vertex first appears in renumbering).

Each edge type needs a renumber map. These renumber maps are created the same way they are for nodes.

What do you mean by this? This doesn't mean node renumber map, correct?

In the case of a separate graph outputted per hop (DGL), every hop would still need a separate renumbering, just as it is now.

We currently do not renumber separately per hop, but considers hop number in generating node renumbering map.

seunghwak · 2024-05-10T22:27:03Z

Each edge type corresponds to a separate subgraph in the output minibatch.

And I am not sure how our compress_per_hop option should work with this.

If there are multiple node types, major ranges won't be well separated unless we create separate CSR outputs per major node type.

alexbarghi-nv · 2024-05-23T19:50:24Z

@seunghwak we can start out just supporting COO for now and figure out CSR later. But I believe we would have a separate CSR per edge type.

Every edge type has a source node type and a destination node type. All edges of a type have the same source/destination type. So the CSR for an edge type only ever has one source node type and one destination node type.

seunghwak · 2024-06-05T17:22:14Z

@alexbarghi-nv #4463 needs your review.

alexbarghi-nv added this to the 24.08 milestone May 10, 2024

alexbarghi-nv assigned seunghwak May 10, 2024

alexbarghi-nv mentioned this issue May 10, 2024

Algorithm - Heterogeneous Sampling #4299

Open

alexbarghi-nv added the feature request New feature or request label May 10, 2024

ChuckHastings mentioned this issue Aug 19, 2024

Heterogeneous renumbering implementation #4602

Merged

rapids-bot bot closed this as completed in #4602 Sep 3, 2024

rapids-bot bot closed this as completed in 338e5e0 Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Heterogeneous Renumbering in the C++ API #4412

Support Heterogeneous Renumbering in the C++ API #4412

alexbarghi-nv commented May 10, 2024

seunghwak commented May 10, 2024

seunghwak commented May 10, 2024

alexbarghi-nv commented May 23, 2024

seunghwak commented Jun 5, 2024

Support Heterogeneous Renumbering in the C++ API #4412

Support Heterogeneous Renumbering in the C++ API #4412

Comments

alexbarghi-nv commented May 10, 2024

seunghwak commented May 10, 2024

seunghwak commented May 10, 2024

alexbarghi-nv commented May 23, 2024

seunghwak commented Jun 5, 2024