Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apoc.hashing.fingerprintGraph returns different fingersprints for identical graphs #478

Open
Cyber-Ware opened this issue Aug 17, 2023 · 3 comments

Comments

@Cyber-Ware
Copy link

Expected Behavior

When creating the fingerprints of two graphs, I expect them to be the same if graphs are the same.

Actual Behavior

Nodes with the same label(s) and same (or no) properties have identical hashes.

I observed that whenever there are nodes with identical hashes present in the graph, different fingerprints of the graph can be created even though the graphs are exactly the same.

How to Reproduce the Problem

Cypher 1: identical nodes inserted in the order they are used when creating relationships

CREATE
  (t1:Thing),
  (t2:Thing),
  (t3:Thing),
  (t4:Thing),
  (t1)-[:IS_RELATED_TO]->(t2),
  (t2)-[:IS_RELATED_TO]->(t3),
  (t3)-[:IS_RELATED_TO]->(t4)

Cypher 2: identical nodes inserted in a random order

CREATE
  (t4:Thing),
  (t1:Thing),
  (t3:Thing),
  (t2:Thing),
  (t1)-[:IS_RELATED_TO]->(t2),
  (t2)-[:IS_RELATED_TO]->(t3),
  (t3)-[:IS_RELATED_TO]->(t4)

Steps

  1. Empty database
  2. Execute cypher 1
  3. Execute RETURN apoc.hashing.fingerprintGraph() returns fingerprint: E235B8C8CBFDA8FA5DD97F019CF2CFF0
  4. Empty database
  5. Execute cypher 2
  6. Execute RETURN apoc.hashing.fingerprintGraph() returns fingerprint: DBB9CD233F4EF6F200D1D0A550CEC62F (even though the graph is exactly the same)

Origin of the bug

I looked into the source code of the fingerprintGraph() function and found that:

  1. A TreeMap is used for the idToNodeHash variable which will automatically sort its items in natural order (aka sort by node ID). When building the inverse nodeHashToId map, the idToNodeHash indexes are used which results in sorted Lists as values of the inverse map.
  2. The order of processing nodes with identical hashes affects the fingerprint. Because of (1) the list is sorted by node ID, but the fingerprint should be always the same no matter which order the nodes with identical hashes were processed in.

Solution for the bug

TL;DR: it's complicated so here are a few thoughts of my own

The most obvious solution in my opinion is to loop over the identical nodes in the list, hash every node with their relationships and end nodes, then sort the hashes and only then add them to the actual fingerprint. This trick would solve the issue partially because there is still another edge case: only the directly related nodes are accounted for, not the position of a node in the bigger/entire graph structure. This means that in a graph with a chain of identical nodes (e.g. a chemical molecule), another node could theoretically be almost anywhere in the chain and the algorithm will return the same fingerprint.

The only actual solution I can see is to use a non-recursive depth-first algorithm (example code here in chapter 12.3.2) which traverses through the graph (more complicated when they are disconnected) and uses the hash of the "parent" node as seed for the hash of the current node or relationship. If a node were to have multiple relationships with nodes that have the same hash, multiple fingerprints for the entire graph are generated, sorted and combined to make up the actual fingerprint of the graph.

Specifications

Even though I'm using one of the latest Neo4J APOC releases, I found that this bug has always been present by checking the version history of the core/src/main/java/apoc/hashing/Fingerprint.java file.

Versions

  • OS: Debian 11 (Bullseye) with Docker
  • Neo4j: 5.9.0-community
  • Neo4j-Apoc: 5.9.0-core
@Lojjs
Copy link
Contributor

Lojjs commented Aug 21, 2023

@Cyber-Ware Thanks for your very comprehensive bug report! We will look into this and report back here.

Best regards Louise, Neo4j

@Lojjs Lojjs assigned Lojjs and unassigned Lojjs Aug 21, 2023
@Cyber-Ware
Copy link
Author

Hello,
I would like to know if there is any update regarding this issue I submitted 2 months ago.
Kind regards,
Cyberware

@Lojjs
Copy link
Contributor

Lojjs commented Oct 16, 2023

@Cyber-Ware We currently have a limited engineering capacity for APOC within Neo4j, where we focus on customer support and security issues so we have a bit of a backlog of bugs to work through unfortunately. It is still in our list of things to look into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants