Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a USING JOIN? #76

Open
cbizon opened this issue May 31, 2023 · 0 comments
Open

Add a USING JOIN? #76

cbizon opened this issue May 31, 2023 · 0 comments

Comments

@cbizon
Copy link
Contributor

cbizon commented May 31, 2023

Here is a graph that takes minutes in an ubergraph or robokopkg

MATCH (`h`)-[`edge_1`:`biolink:has_phenotype`]->(`on`) WHERE  ( (`edge_1`.`biolink:primary_knowledge_source` IS NOT NULL) )  MATCH (`on`)-[`on_subclass_edge`:`biolink:subclass_of`*0..1]->(`on_superclass`:`biolink:Disease` {}) USING INDEX `on_superclass`:`biolink:Disease`(id) WHERE  ( (`on_superclass`.id in ["MONDO:0001982", "MONDO:0018982"]) )  MATCH (`h`)-[`h_subclass_edge`:`biolink:subclass_of`*0..1]->(`h_superclass`:`biolink:Disease` {}) USING INDEX `h_superclass`:`biolink:Disease`(id) WHERE  ( (`h_superclass`.id in ["MONDO:0005144", "MONDO:0007256", "MONDO:0001156", "MONDO:0005550", "MONDO:0017376", "MONDO:0007263", "MONDO:0003019", "MONDO:0002025", "MONDO:0005277", "MONDO:0008433", "MONDO:0004967", "MONDO:0002050", "MONDO:0005798", "MONDO:0001609", "MONDO:0005027", "MONDO:0008947", "MONDO:0019338", "MONDO:0007972", "MONDO:0000598", "MONDO:0006730", "UMLS:C0008066", "MONDO:0005101", "MONDO:0009061", "MONDO:0004058", "MONDO:0001071", "MONDO:0003240", "MONDO:0005364", "MONDO:0002254", "MONDO:0002806", "MONDO:0004992", "MONDO:0007254", "MONDO:0004988", "MONDO:0005071", "MONDO:0015129", "MONDO:0004355", "MONDO:0004609", "MONDO:0005546", "MONDO:0009690", "MONDO:0005059", "MONDO:0005139", "MONDO:0007182", "MONDO:0044903", "MONDO:0007264", "MONDO:0001158", "MONDO:0015667", "MONDO:0007603", "MONDO:0001673", "MONDO:0000249", "MONDO:0002909", "MONDO:0005485", "MONDO:0007863", "MONDO:0005180", "MONDO:0002049", "MONDO:0002009", "MONDO:0001866", "MONDO:0006585", "MONDO:0005260", "MONDO:0007661", "MONDO:0005015", "MONDO:0020320", "MONDO:0042485", "MONDO:0004375", "MONDO:0001475", "MONDO:0005352", "MONDO:0019496", "MONDO:0004979", "MONDO:0015447", "MONDO:0005770", "MONDO:0003785", "MONDO:0005147", "MONDO:0001252", "MONDO:0004425", "MONDO:0002032", "MONDO:0005146", "MONDO:0017181", "MONDO:0009820", "MONDO:0018935", "MONDO:0018874", "MONDO:0005258", "MONDO:0100096", "MONDO:0005070", "MONDO:0004985", "MONDO:0005420", "MONDO:0005492", "MONDO:0004975", "MONDO:0006745", "MONDO:0005618", "MONDO:0021042", "MONDO:0006166", "MONDO:0019091", "MONDO:0000693", "MONDO:0000437", "MONDO:0006664", "MONDO:0005155", "MONDO:0005203", "MONDO:0021178", "MONDO:0002623", "MONDO:0001627", "MONDO:0011918", "UMLS:C2931853", "MONDO:0020128", "MONDO:0005145", "MONDO:0021063", "MONDO:0001153", "MONDO:0005363", "MONDO:0001741", "MONDO:0011122", "MONDO:0003634", "MONDO:0001484", "MONDO:0005466", "MONDO:0010096", "MONDO:0004617", "MONDO:0005371", "MONDO:0005351", "MONDO:0010138", "MONDO:0007739", "MONDO:0010383", "MONDO:0019960", "MONDO:0005570", "MONDO:0002280", "MONDO:0005148", "MONDO:0005354", "MONDO:0004976", "MONDO:0015909", "MONDO:0007452", "MONDO:0004996", "MONDO:0001442", "MONDO:0009807", "UMLS:C0001723", "MONDO:0005395", "MONDO:0005140", "MONDO:0004471", "MONDO:0005090", "MONDO:0016383", "MONDO:0001185", "MONDO:0013600", "MONDO:0017178", "MONDO:0005021", "MONDO:0001566", "MONDO:0005046", "MONDO:0018088", "MONDO:0005072", "MONDO:0005084"]) ) 

This is basically a 1 hop, but pinned at both ends, and with multiple possible values at each end. Of course, transpiler also cares about all the subclasses of these input identifiers. Notice also that there are a couple of id's on one end and many on the other. So the final query that gets made into cypher basically looks like:

(a few Ids)<-subclass_of-(on)-has_phenotype->(h)-subclass_of->(a lot of id's).

The profile looks like:
image

So it gets the subclass nodes on and h. It then has a choice: should it take the phenotype edges from the side where there are lots of ids or the side where there are just a few. Surprisingly, it chooses from the side where there are a lot, so then there are a lot of has phenotype edges that have to be pulled.

In this case, the query can be made much faster by setting where you want to join, like this:

USING JOIN on `h`

Then it forces the has_phenotype to come from the small side. If we wanted to make use of this, we'd need to be able to spot the situation: the input query is a single hop, with both ends pinned, and with more nodes on one side than the other. In that case, it MIGHT make sense to add this Join hint.

There are a couple of caveats that need exploration though before anything gets going. First, it's hard to say that it is much faster, because of the way neo4j caches results. You can look at the # of db hits, but I think that a more careful analysis turning off the cache would be called for. Second, and maybe worse? Is that what happens is that from the small side, none of the nodes have a has_phenotype edge. So the whole query gets abandoned at that point (saves a lot of time). But this might be a special case - perhaps if there were even one edge, you'd have to do enough work on the long side that the savings dissappear, or even worse perhaps neo4j's plan would actually be better in those cases for some reason.

So this will probably require some more careful testing before any implementation. Thoughts @YaphetKG ? @EvanDietzMorris ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant