Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(data): simplify remote backend num_nodes computation #5307

Merged
merged 16 commits into from
Aug 30, 2022

Conversation

mananshah99
Copy link
Contributor

@mananshah99 mananshah99 commented Aug 29, 2022

This PR introduces remote_backend_utils (currently just a set of functions, open to making it a static class as well) that helps define common utilities to be used across remote backends. The first (and perhaps most useful) function included is num_nodes, which allows one to infer the number of nodes in a (node type, edge type) by leveraging attributes in a feature store and graph store. This significantly simplifies internal code and also simplifies some external interfaces as well.

@codecov
Copy link

codecov bot commented Aug 29, 2022

Codecov Report

Merging #5307 (5f8171a) into master (96fbf43) will increase coverage by 0.03%.
The diff coverage is 88.00%.

@@            Coverage Diff             @@
##           master    #5307      +/-   ##
==========================================
+ Coverage   83.33%   83.36%   +0.03%     
==========================================
  Files         337      338       +1     
  Lines       18641    18633       -8     
==========================================
- Hits        15535    15534       -1     
+ Misses       3106     3099       -7     
Impacted Files Coverage Δ
torch_geometric/data/data.py 91.98% <ø> (+0.62%) ⬆️
torch_geometric/data/hetero_data.py 95.96% <ø> (+1.42%) ⬆️
torch_geometric/data/lightning_datamodule.py 48.82% <ø> (ø)
torch_geometric/testing/graph_store.py 100.00% <ø> (ø)
torch_geometric/loader/neighbor_loader.py 94.77% <75.00%> (-0.17%) ⬇️
torch_geometric/data/remote_backend_utils.py 84.84% <84.84%> (ø)
torch_geometric/data/feature_store.py 88.81% <100.00%> (+0.65%) ⬆️
torch_geometric/data/graph_store.py 92.85% <100.00%> (+0.72%) ⬆️
torch_geometric/loader/link_neighbor_loader.py 94.47% <100.00%> (-0.24%) ⬇️
... and 2 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mananshah99 mananshah99 changed the title draft: remote backend utilities refactor(data): simplify remote backend num_nodes computation Aug 29, 2022
@mananshah99 mananshah99 marked this pull request as ready for review August 29, 2022 19:28
from torch_geometric.typing import EdgeType, NodeType


def num_nodes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like the previous implementation for each feature store and graph store better. That is more intuitive to me compare with get all stores and query the nodes with a query (which sounds more heavy weight).
Could we just expose num_nodes and sizes for features and edges ?

Copy link
Contributor Author

@mananshah99 mananshah99 Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can expose num nodes for features and edges, but there are instances where an edge index is not part of a graph (e.g. link prediction) or we just want to get the number of nodes from a group name, not a full edge index type. I think it improves simplicity if we have num_nodes as a common interface for this reason. I also don't think it's too heavyweight to query all the edge attributes and node attributes, since these are small data structures and are somewhat bounded.

However, if you think this overhead would be significant, we can expose methods in the feature store and graph store to get a TensorAttr/EdgeAttr from a partial specification, and directly query these (instead of listing all of them and iterating). This can be done as an improvement in a separate PR. Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second look, I think the diverge comes from whether we want to introduce this new remote_backend or consolidate logic in GraphStore, i.e we can consider GraphStore to be a backend that can be remote. And then have an api like

class GraphStore:
  def num_nodes(Union[NodeType, EdgeType]):

We can also hide FeatureStore behind GraphStore or FeatureStore metadata as part of information in GraphStore.

If we want to introduce remote_backend, you may want to think more about whether it is a Backend object or it is a helper provide util functions talk to FeatureStore and GraphStore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few clarifications

  • Yes, a GraphStore can be remote.
  • We do not want to hide FeatureStore behind GraphStore; ideally, a feature store owns features, and a graph store owns the graph. The complications come because there can be edges in the graph store with no corresponding features in the feature store, etc. as mentioned above.
  • remote_backend is not a Backend object; it's really a helper providing utility functions to talk to FeatureStore and GraphStore (as in the comment at the top of remote_backend.py). However, I agree that the name is confusing. I can change it to remote_backend_util.py.

torch_geometric/data/remote_backend.py Outdated Show resolved Hide resolved
torch_geometric/loader/link_neighbor_loader.py Outdated Show resolved Hide resolved
torch_geometric/data/remote_backend.py Outdated Show resolved Hide resolved
torch_geometric/loader/link_neighbor_loader.py Outdated Show resolved Hide resolved
test/data/test_remote_backend.py Outdated Show resolved Hide resolved
torch_geometric/data/remote_backend_utils.py Show resolved Hide resolved
torch_geometric/data/remote_backend_utils.py Show resolved Hide resolved
torch_geometric/data/remote_backend_utils.py Show resolved Hide resolved
torch_geometric/data/remote_backend_utils.py Show resolved Hide resolved

# 1. Check GraphStore:
edge_attrs = graph_store.get_all_edge_attrs()
for edge_attr in edge_attrs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any opinion on avoiding the for-loop here? Isn't this an implementation detail how the GraphStore/FeatureStore save the edge attributes? If it is done as part of some hash map, we should be able to leverage this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think graph store and feature store should expose (optional) methods to obtain a tensorattr/edgeattr from the first member of the corresponding dataclass. But wanted to leave that for another PR.

@mananshah99 mananshah99 merged commit be471ee into master Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants