Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset] IGB-HOM dataset wrong number of edges #55

Closed
BowenYao18 opened this issue Sep 6, 2024 · 10 comments
Closed

[Dataset] IGB-HOM dataset wrong number of edges #55

BowenYao18 opened this issue Sep 6, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@BowenYao18
Copy link

BowenYao18 commented Sep 6, 2024

Describe the bug
The num edges in paper is 3995777033 in paper but the actual number of edges I download is 3727095830.

To Reproduce

Below is the download command:
wget https://igb-public-awsopen.s3.amazonaws.com/IGBH/processed/paper__cites__paper/edge_index.npy
Then, "array = np.load("/path/to/dataset", mmap_mode='r+')" to load the downloaded file and check "arr.shape"

Expected behavior
The shape should be (3727095830, 2), which does not match 3995777033 reported in paper.
This is the link to the paper: https://arxiv.org/pdf/2302.13522

Screenshots
This is the IGB-HOM info table:
image

Software information:

  • OS, ...

Additional context
Add any other context about the problem here.

@BowenYao18 BowenYao18 added the bug Something isn't working label Sep 6, 2024
@akhatua2
Copy link
Contributor

akhatua2 commented Sep 7, 2024

I will update the edge file on the s3 bucket with the local copy soon. It should have the right number of edges (3995777033). Thanks for bringing this to our attention.

@BowenYao18
Copy link
Author

I will update the edge file on the s3 bucket with the local copy soon. It should have the right number of edges (3995777033). Thanks for bringing this to our attention.

Thank you. After the dataset being updated, I should be able to download through the original link?
wget https://igb-public-awsopen.s3.amazonaws.com/IGBH/processed/paper__cites__paper/edge_index.npy

@akhatua2
Copy link
Contributor

Hi, if its urgent please use this file as a temporary solution. This is the last 268,681,203 edges. I will upload the edge_index.npy file as soon as I can to the s2.

@BowenYao18
Copy link
Author

Is the edges you are updating a simple remove self edge followed by adding self edge of each node?
edges = add_self_edges(remove_self_edge(edges))

@akhatua2
Copy link
Contributor

No these should be edges between different nodes. You can run this edges = add_self_edges(remove_self_edge(edges)) as a preprocessing step for your usecase.

@BowenYao18
Copy link
Author

BowenYao18 commented Sep 14, 2024

Thank you. Also, I assume het dataset has different paper__cites__paper edges from hom? Will you also update the igb-het paper__cites__paper?

@akhatua2
Copy link
Contributor

Both the datasets have the same paper nodes and paper_edges. The het dataset just has more types of nodes and types of edges.

You can reuse the same edges for both datasets.

@BowenYao18
Copy link
Author

BowenYao18 commented Sep 15, 2024

image
However, the repo writes that there are 3995777033 edges for the igb-hom and 3996442004 edges for igb-het. Should they be different sets of edges?

@BowenYao18
Copy link
Author

Also, I don't know if this is a coincidence. but if you try to run edges = add_self_edges(remove_self_edge(edges)) on the paper__cites__paper edge of the igb-hom, the result will have 3996442004 (the number of edges that the igb-het written in the above graph).

@akhatua2
Copy link
Contributor

I believe the number 3996442004 should be the total edges we finally published (including the self edges). There are some inconsistencies between parts of the repo and the paper due to the difference in different internal versions of the full dataset.

  • The het and hom datasets should have the same number of paper edges. We used the edges + self loops count for our initial benchmark runs.

Thanks for pointing it out so I could take a second look. You shouldn't need to use the extra edges as that shouldn't be part of the final dataset. Please use the edges = add_self_edges(remove_self_edge(edges)) and this will the expected edges for the full dataset (homogeneous and heterogeneous).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants