Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pagerank Online doesn't update correctly when the graph is changed #510

Open
pippellia-btc opened this issue Sep 12, 2024 · 4 comments
Assignees
Labels
bug bug community community Frequency - Monthly Frequency - Monthly Reach - VeryFew Reach - VeryFew Severity - S1 Severity - S1

Comments

@pippellia-btc
Copy link

Describe the bug
There are various problems with the implementation of pagerank_online (https://memgraph.com/docs/advanced-algorithms/available-algorithms/pagerank_online) :

  1. (small) the parameter epsilon isn't coherent with the non-online version of pagerank (which uses d, the dampening).
    d is supposed to be 1 - epsilon, but the default values don't match. My suggestion is to use either d or epsilon in both algos.

  2. (medium) The CalculatePagerank function (see here ) takes the vector of all the visits, then divides it by a constant and then normalizes it.
    This doesn't make any sense, since normalizing right away would be faster and give the same result.

  3. (urgent) The algorithm simply doesn't return the correct values when the graph gets updates, as shown in the next section

To Reproduce
Steps to reproduce the behavior, starting from an empty database:

  1. Create the graph
    CREATE (a:Person {name: 'A'}), (b:Person {name: 'B'}), (c:Person {name: 'C'}), (a)-[:FOLLOWS]->(b), (b)-[:FOLLOWS]->(c), (c)-[:FOLLOWS]->(a);

  2. Set pagerank_online
    CALL pagerank_online.set(1000, 0.15) YIELD node, rank;

  3. create a trigger so pagerank_online is updated when the graph changes
    CREATE TRIGGER pagerank_trigger BEFORE COMMIT EXECUTE CALL pagerank_online.update(createdVertices, createdEdges, deletedVertices, deletedEdges) YIELD node, rank

  4. Update the graph
    MATCH (c:Person {name: 'C'}) CREATE (d:Person {name: 'D'}) CREATE (c)-[:FOLLOWS]->(d);

  5. Get new pagerank values:
    CALL pagerank_online.get() YIELD node, rank RETURN node, rank;

  6. Compare the values with other implementations of the pagerank algorithm:

import networkx as nx

# Create an empty directed graph
G = nx.DiGraph()

# Add nodes to the graph
nodes = ['A', 'B', 'C', 'D']
G.add_nodes_from(nodes)

# Add edges to the graph
edges = [('A','B'), ('B','C'), ('C','A'), ('C','D')]
G.add_edges_from(edges)

alpha = 0.85
print(nx.pagerank(G, alpha))

or use this simple website tool: http://computerscience.chemeketa.edu/cs160Reader/_static/pageRankApp/index.html

Expected behavior
It should give the correct results

Video
Here is a link to a video where I comment these issues and provide additional context.
It's on your Discord server: https://discord.com/channels/842007348272169002/852201290880385044/1280112013921619999

@katarinasupe
Copy link
Contributor

Hi @pippellia-btc, thank you for opening the issue 🙏

For concerns 1 and 2, I'll let the engineers dive deeper once the issue becomes a priority.

Regarding 3, the fact that the values are not the same might be related to the fact that the online algorithm is only an approximation of PageRank to make it as fast as possible. Still, it carries the same information—the likelihood of a random walk ending in a particular vertex. Does that make sense?

Does this block your further work with Memgraph?

@pippellia-btc
Copy link
Author

Hey katarina, thanks for the fast reply.

Regarding 3, the fact that the values are not the same might be related to the fact that the online algorithm is only an approximation of PageRank to make it as fast as possible. Still, it carries the same information—the likelihood of a random walk ending in a particular vertex. Does that make sense?

No, that's not the reason. As I show in the video, when one re-set it with
CALL pagerank_online.set(1000, 0.15) YIELD node, rank;
the error is small (<1%).

When I change the graph the error is order of magnitudes higher, like ~50%. In the paper it is shown that the error should always be quite small with that many random walks (1000 in the example), so it must be a bug in the implementation.

Does this block your further work with Memgraph?

Yes, because I'll have to rewrite most of the code myself for the use-case I have in mind.

@katarinasupe katarinasupe added Severity - S1 Severity - S1 and removed Severity - S3 Severity - S3 labels Sep 13, 2024
@katarinasupe
Copy link
Contributor

katarinasupe commented Sep 13, 2024

Thank you for the further explanation and a really great video, @pippellia-btc.
Just one small question - what happens if you set epsilon to default 1.0 or 0.85 instead of 0.15? I tried that and it seems that the values are then close to what you expected?

EDIT: Oh sorry, that might be related to me resetting, like you mentioned at the end of the video.

@katarinasupe
Copy link
Contributor

I agree with you that it would be better if it's consistent, having just d in both pagerank and pagerank online. Not sure why there was a choice to implement it differently, but probably a different dev worked on each. Still, not sure if it's good to change API and a priority is definitely to have it working correctly. I read on Discord that you tried changing half epsilon and that didn't work. In case you have some other idea on how to fix it, feel free to open a PR since MAGE is an open source project. I would be really happy to see outside contributions 😄

@katarinasupe katarinasupe added the community community label Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bug community community Frequency - Monthly Frequency - Monthly Reach - VeryFew Reach - VeryFew Severity - S1 Severity - S1
Projects
Development

No branches or pull requests

3 participants