-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(data): simplify remote backend num_nodes
computation
#5307
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #5307 +/- ##
==========================================
+ Coverage 83.33% 83.36% +0.03%
==========================================
Files 337 338 +1
Lines 18641 18633 -8
==========================================
- Hits 15535 15534 -1
+ Misses 3106 3099 -7
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
num_nodes
computation
from torch_geometric.typing import EdgeType, NodeType | ||
|
||
|
||
def num_nodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like the previous implementation for each feature store and graph store better. That is more intuitive to me compare with get all stores and query the nodes with a query (which sounds more heavy weight).
Could we just expose num_nodes and sizes for features and edges ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can expose num nodes for features and edges, but there are instances where an edge index is not part of a graph (e.g. link prediction) or we just want to get the number of nodes from a group name, not a full edge index type. I think it improves simplicity if we have num_nodes
as a common interface for this reason. I also don't think it's too heavyweight to query all the edge attributes and node attributes, since these are small data structures and are somewhat bounded.
However, if you think this overhead would be significant, we can expose methods in the feature store and graph store to get a TensorAttr/EdgeAttr from a partial specification, and directly query these (instead of listing all of them and iterating). This can be done as an improvement in a separate PR. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second look, I think the diverge comes from whether we want to introduce this new remote_backend
or consolidate logic in GraphStore
, i.e we can consider GraphStore
to be a backend that can be remote
. And then have an api like
class GraphStore:
def num_nodes(Union[NodeType, EdgeType]):
We can also hide FeatureStore
behind GraphStore
or FeatureStore
metadata as part of information in GraphStore
.
If we want to introduce remote_backend
, you may want to think more about whether it is a Backend
object or it is a helper provide util functions talk to FeatureStore
and GraphStore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few clarifications
- Yes, a GraphStore can be remote.
- We do not want to hide FeatureStore behind GraphStore; ideally, a feature store owns features, and a graph store owns the graph. The complications come because there can be edges in the graph store with no corresponding features in the feature store, etc. as mentioned above.
remote_backend
is not aBackend
object; it's really a helper providing utility functions to talk toFeatureStore
andGraphStore
(as in the comment at the top ofremote_backend.py
). However, I agree that the name is confusing. I can change it toremote_backend_util.py
.
|
||
# 1. Check GraphStore: | ||
edge_attrs = graph_store.get_all_edge_attrs() | ||
for edge_attr in edge_attrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any opinion on avoiding the for-loop here? Isn't this an implementation detail how the GraphStore
/FeatureStore
save the edge attributes? If it is done as part of some hash map, we should be able to leverage this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think graph store and feature store should expose (optional) methods to obtain a tensorattr/edgeattr from the first member of the corresponding dataclass. But wanted to leave that for another PR.
This PR introduces
remote_backend_utils
(currently just a set of functions, open to making it a static class as well) that helps define common utilities to be used across remote backends. The first (and perhaps most useful) function included isnum_nodes
, which allows one to infer the number of nodes in a (node type, edge type) by leveraging attributes in a feature store and graph store. This significantly simplifies internal code and also simplifies some external interfaces as well.