Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link-level NeighborLoader #4026

Closed
rusty1s opened this issue Feb 8, 2022 · 32 comments · Fixed by #4396
Closed

Link-level NeighborLoader #4026

rusty1s opened this issue Feb 8, 2022 · 32 comments · Fixed by #4396

Comments

@rusty1s
Copy link
Member

rusty1s commented Feb 8, 2022

🚀 The feature, motivation and pitch

Currently, NeighborLoader is designed to be applied in node-level tasks and there exists no option for mini-batching in link-level tasks.

To achieve this, users currently rely on a simple but hacky workaround, first utilized in ogbl-citation2 in this example.

The idea is straightforward and simple: For input_nodes, we pass in both the source and destination nodes for every link we want to do link prediction on (both positive and negative):

loader = NeighborLoader(data, input_nodes=edge_label_index.view(-1), ...)

Nonetheless, PyG should provide a dedicated class to perform mini-batch on link-level tasks, re-using functionality from NeighborLoader under-the-hood. An API could look like:

class LinkLevelNeighborLoader(
    data,
     input_edges=...
     input_edge_labels=...
     with_negative_sampling=True,
     **kwargs,
)

NOTE: This workaround currently only works for homogenous graphs!

@RexYing @JiaxuanYou

@Jeriousman
Copy link

How is the progress? on the LinkLevelNeighborLoader..?

@rusty1s
Copy link
Member Author

rusty1s commented Mar 4, 2022

We will post here once we make progress. Sorry for the delay.

@Padarn
Copy link
Contributor

Padarn commented Mar 8, 2022

Hey @rusty1s want a hand with this one?

@rusty1s
Copy link
Member Author

rusty1s commented Mar 8, 2022

Help is always good, thank you! Let me know how we want to proceed with this. @RexYing might have further thoughts.

@Padarn
Copy link
Contributor

Padarn commented Mar 8, 2022

If I'm picking it up I would plan to start with the proposed API in the top of this issue and see how it would look for regular and heterogenous graphs. I need to play around a bit to understand the requirements. With a working example (even if a bit of a hack) we can align on the rest of the implementation details?

What do you think?

@rusty1s
Copy link
Member Author

rusty1s commented Mar 9, 2022

We could follow along with that in a similar fashion as in your label masked prop PR, and discuss as we go :)

@Padarn
Copy link
Contributor

Padarn commented Mar 9, 2022 via email

@Padarn
Copy link
Contributor

Padarn commented Mar 13, 2022

Hey @rusty1s I'm trying to understand something in the existing NeighbourLoader for hetro graphs: It looks to me that it is set up to only work with a single type of node being sampled. Is this intentional? I couldn't find a clear description of this in the docstring.

As for this change, after reading the code, reading the issue and playing around here is my rough plan (in order of steps):

  • 1. build a LinkNeighborLoader which works for homogenous graphs and doesn't have negative sampling, but wraps the hack above
  • 2. introduce negative sampling
  • 3. adapt to work for heterogeneous graphs
  • 4. create example of use
  • 5. any cleaning/optimisation or functional enhancements we need for it to be useful as a first version

There are a couple of random questions in my mind which I may not worry about too much right now but feel free to comment on:

  • Do we want to make sure sampling from both ends of a link always happen? Or are there case where one might only want to follow the direction of a link.
  • Should 'num neighbours' apply to the link as a whole, or each node attached as a link?

What do you think?

@Padarn
Copy link
Contributor

Padarn commented Mar 13, 2022

Sorry one more question: The hack in PositiveLinkNeighborSampler am I right in thinking that we don't do deduplication before sampling?

@rusty1s
Copy link
Member Author

rusty1s commented Mar 13, 2022

It looks to me that it is set up to only work with a single type of node being sampled. Is this intentional?

Yes, this is intentional. There rarely exists use-cases where we perform node classification across different node types. The reason we currently restrict it is more due to implementation details though, as we somehow need to map the indices produced by the underlying PyTorch DataLoader to the respective node types once again. The underlying C++/CUDA sampling procedure can handle multiple node types though (it expects a dictionary of node indices for a subset of node types).

I think your roadmap is super useful. Thanks a lot for setting this up. Regarding your questions:

Do we want to make sure sampling from both ends of a link always happen? Or are there case where one might only want to follow the direction of a link.

I think the user still needs to specify the links to compute embeddings for. That's what I originally meant with the input_edges/input_links argument. If it is not set, the sampler will iterate over all edges present in the data. Alternatively, we make it a required argument.

Should 'num neighbours' apply to the link as a whole, or each node attached as a link?

Yes, I think so. In the end, we simply sample num_neighbors from both source and destination node for each edge.

The hack in PositiveLinkNeighborSampler am I right in thinking that we don't do deduplication before sampling?

Can you explain what you mean?

@Padarn
Copy link
Contributor

Padarn commented Mar 13, 2022

Can you explain what you mean?

My understanding from https://github.com/snap-stanford/ogb/blob/master/examples/linkproppred/citation2/sampler.py#L17-L41

batch = torch.cat([row[edge_idx], col[edge_idx]], dim=0)

Is that we take the nodes from start and end of each edge in the batch and then do neighbourhood expansion. But there may be duplicate nodes in the result of this cat.

@rusty1s
Copy link
Member Author

rusty1s commented Mar 13, 2022

I agree. I think the implementation is easier if we do not merge duplicated nodes, and the gains in efficiency may by negligible. This also aligns with the intuition that each example in a batch is isolated from each other.

@Padarn
Copy link
Contributor

Padarn commented Mar 13, 2022

Okay agreed. Thanks for the thoughts!

@rusty1s
Copy link
Member Author

rusty1s commented Apr 8, 2022

A first prototype was integrated via #4396, see loader.LinkNeighborLoader (thanks to @Padarn). Any feedback is highly appreciated. It supports homogeneous and heterogeneous link prediction tasks. A current limitation is that it does not support internal negative sampling yet. We are working on it.

@shishixuezi
Copy link

Hello, thanks for adding this feature!

I have a small question. If I want to use the result of RandomLinkSplit to generate batch by LinkNeighborLoader, it will produce an IndexError.

I think it may be due to the generated edge_label_index attribute by RandomLinkSplit. The shape of edge_label_index is [2, num_edges]. But in the LinkNeighborLoader, if the key of split attribute is an edge attribute, it will only select index from dimension zero, which will cause an IndexError.

Do you have some suggestions for this case? Thank you very much!

@Padarn
Copy link
Contributor

Padarn commented May 11, 2022

Hi @shishixuezi thanks for the report - could you provide a short example of what you're doing exactly and I can take a look to see.

@shishixuezi
Copy link

shishixuezi commented May 11, 2022

Hello, @Padarn Thank you for your reply. I created a toy case, please check. Thank you very much!

import torch
from torch_geometric.data import Data
from torch_geometric.loader import LinkNeighborLoader
import torch_geometric.transforms as T


def main():
    edge_index = torch.tensor([[0, 1, 1, 2, 0, 1, 2],
                               [1, 0, 2, 1, 3, 3, 3]], dtype=torch.long)
    x = torch.tensor([[-1], [0], [1], [4]], dtype=torch.float)
    edge_attr = torch.tensor([[1.0], [2.0], [1.0], [1.0], [1.0], [1.0], [1.0]], dtype=torch.float)
    data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr)

    transform = T.Compose([
        T.NormalizeFeatures(),
        T.ToDevice('cuda' if torch.cuda.is_available() else 'cpu'),
        T.RandomLinkSplit(num_val=0.1, num_test=0.05, is_undirected=False,
                          add_negative_train_samples=False, neg_sampling_ratio=0.0,
                          key='edge_attr')])

    train_data, val_data, test_data = transform(data)

    # No Problem
    # loader = LinkNeighborLoader(data, num_neighbors=[2]*2)

    # Cause Error
    loader = LinkNeighborLoader(train_data, num_neighbors=[2]*2)

    print(next(iter(loader)))


if __name__ == '__main__':
    main()

@Padarn
Copy link
Contributor

Padarn commented May 11, 2022 via email

@Padarn
Copy link
Contributor

Padarn commented May 12, 2022

So I see the problem, but I can't yet think of a clean fix. A workaround you could use for now:

train_data.edge_attr_index = train_data.edge_attr_index.t()
loader = LinkNeighborLoader(train_data, num_neighbors=[2]*2)

I'll raise a MR with a potential fix.

@rusty1s
Copy link
Member Author

rusty1s commented May 12, 2022

Fixed via #4629.

@shishixuezi
Copy link

Wow, so cool! Thank you very much! @Padarn @rusty1s

@kamibrumi
Copy link

kamibrumi commented Jul 18, 2022

Hi, I'm trying to import the LinkNeighborLoader in JupyterLab but I'm getting this error:
`---> 35 from torch_geometric.loader import LinkNeighborLoader

ImportError: cannot import name 'LinkNeighborLoader' from 'torch_geometric.loader' (/Users/cbrumar/.local/share/virtualenvs/gnn-TG0lFQrB/lib/python3.9/site-packages/torch_geometric/loader/init.py)

I'm using PyTorch Geometric version 2.0.4 and I installed it using pip.

Edit: I am using PyTorch version 1.11.0.

@rusty1s
Copy link
Member Author

rusty1s commented Jul 18, 2022

You need to install PyG master or from nightly.

@kamibrumi
Copy link

Thank you, @rusty1s!

@YoavLotem
Copy link

YoavLotem commented Jul 25, 2022

Hey!
First of all thanks for this amazing tool I appreciate your work.

I'm currently working on an edge classification problem in an environment in which I can't use the LinkNeighborLoader due to irrelevant constraints.

I didn't fully understand the previous work-around:
Should I use loader = NeighborLoader(data, input_nodes=edge_label_index.view(-1), ...) ?
Or should I use the PositiveLinkNeighborSampler from this example.

Thanks in advance.

@rusty1s
Copy link
Member Author

rusty1s commented Jul 25, 2022

If you cannot use the new LinkNeighborLoader interface, it is recommended to follow the OGB example

@YoavLotem
Copy link

YoavLotem commented Jul 27, 2022

Thanks for your response.

how can I use the 'PositiveLinkNeighborSampler' in the OGB example to sample a subgraph for specific edges, similarly to 'edge_label_index' in 'LinkNeighborLoader'?

@rusty1s
Copy link
Member Author

rusty1s commented Jul 27, 2022

You can save it in the constructor, initialize edge_idx as torch.arange(edge_label_index.size(1)), and then access subsets of it inside sample.

@francyya
Copy link

@rusty1s When I apply the LinkNeighborLoader on training data with negative samples, it turned out target labels 0,1,2. I'm trying to understand how target label=2 shows up given that there is only two class label for link prediction task? Thanks.

@rusty1s
Copy link
Member Author

rusty1s commented Dec 17, 2022

When using LinkNeighborLoader with negative samples, we will automatically add zero labels for these negative edges, and adjust the initial edge_label by incrementing it by one.

@brovatten
Copy link
Contributor

Hi,

While using disjoint=True in torch-sparse it recommends me to use pyg-lib. But when I activate pyg-lib through typing by

torch_geometric.typing.WITH_PYG_LIB = True
torch_geometric.typing.WITH_TORCH_SPARSE = False

it throws me this error:
AttributeError: '_OpNamespace' 'pyg' object has no attribute 'neighbor_sample'

reference code:

from torch_geometric.loader import LinkNeighborLoader
LinkNeighborLoader(data=data, num_neighbors=[10], batch_size=1, disjoint=True, shuffle=False)

torch 2.0.1. What could be the issue?

@rusty1s
Copy link
Member Author

rusty1s commented Aug 14, 2023

What does

import torch
import pyg_lib
print(pyg_lib.__version__)
print(torch.ops.pyg.neighbor_sample)

return?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants