Neo4j Cypher Lowest Common Ancestor performance issue

1k views Asked by At

I am trying to create an application in spring-neo4j using the graphrepository. One of the requirements is finding a Lowest Common Ancestor (lca) between two child nodes.

Currently I use the following query to achieve this:

 @Query("MATCH path = (c1:Concept)-[r:Relation*{type: 'Is a'}]->(ca:Concept)<-[r:Relation*{type: 'Is a'}]-(c2:Concept) 
      WHERE c1.conceptID = {conceptId1} AND c2.conceptID = {conceptId2}
      RETURN ca ORDER BY length(path) LIMIT 1")

Concept findLowestCommonAncestor(@Param("conceptId1") Long conceptId1, @Param("conceptId2") Long conceptId2);

The problem here is performance. Originally my graph consisted of 330 000 nodes and 2 000 000 relations. The relations I am interested in are the ones having type: "Is a". They only move upwards in the tree (connecting child nodes with parent nodes). The maximum upwards distance is 3.

This is the tree structure:

Tree structure example

Because I have several tree structures like this one, I decided to add a root node connecting all the different tree structures together. So that the lca will always be found. But adding this root node changed my performance drastically: From 558 ms to 562286 ms

I know adding a root node will impact the performance but it should not be this much, right? And if so, is there a better way to calculate the lca?

I would think that my cypher query would only look for nodes upwards in the tree. So in that case, adding an extra root node should not have this much influence on performance.

2

There are 2 answers

0
Bart Praats On BEST ANSWER

First, I want to thank @Supamiu for his help on my question.

Although his answer is probably sufficient for some cases. In my case it was not really possible to change the model.

Surprisingly enough I managed to increase the performance just by changing the cypher query.

My original query:

MATCH path = (c1:Concept)-[r1:Relation*{type: "Is a"}]->(ca:Concept)<-[r2:Relation*{type: "Is a"}]-(c2:Concept)
WHERE c1.conceptID=35104066 AND c2.conceptID=35808913
RETURN ca
ORDER BY length(path)
LIMIT 1

PROFILE: https://i.stack.imgur.com/b3Xa8.png

I changed it to:

MATCH (c1:Concept {conceptID: 35104066})-[:Relation*{type: "Is a"}]->(p1:Concept)
MATCH (:Concept {conceptID: 35808913})-[:Relation*{type: "Is a"}]->(p2:Concept)
WHERE p1.conceptID = p2.conceptID
MATCH path = (c1)-[:Relation*{type: "Is a"}]->(p1)
RETURN p1
ORDER BY length(path)
LIMIT 1

PROFILE: https://i.stack.imgur.com/sc18H.jpg

This gives my the Lowest Common Ancestor in around 100 ms, a significant improvement!

I still don't fully understand why this makes such a difference. But I think one of the reasons is that I use a tree structure. I think somehow in the second query Cypher only searches upwards in the tree, resulting in less relations to search. So in a blob structure this would not really help, I think.

Can someone maybe confirm this?

6
Supamiu On

Your problem is a Relation label problem, let me explain:

First of all, your request is matching with an infinite depth on :Relation relation.

You said that every tree in your database are now related to a root node, I guess it's related to the root node using a :Relation relation.

So, when you match every relationship labeled :Relation with an infinite depth, it will match every relations in your database before filtering on type property. This is why this root node added such a perf issue, because it connects every nodes together with the same relationship.

How to fix that?

Change your relation Labels, use them instead of matching with a property:

(ca:Concept{properties...})-[:IS_A]->(ca2:Concept)

and link your top tree nodes to the root node using another relation label.

Then, when you'll match your nodes using an infinite depth request, you'll be able to match using label, avoiding the problem you actually have.

The other solution would be to add a depth maximum in your query, since your tree is 3 levels maximum, you can limit relationship depth to 3. But the first solution is way better in my opinion, because creating only one type of relationship and filtering using their properties is a bad practice in neo4j.

EDIT

So, here is your first profile: https://i.stack.imgur.com/Pgkqi.png

You can see here that matching [:Relation*] has 31,328 hits before being filtered, which is pretty good.

On the second profile (with root node): https://i.stack.imgur.com/b3Xa8.png

You can see that the same matching got almost 2,000,000 hits, which, after being filtered, are 500,000... This is way too much.

I think the solution is Path, you might take a look at this:

The problem can not be solved with only a "better" cypher query. I think that your issue can be solved with a better data model, but it's hard to find which one without working on the project itself. The only think I can provide is a guide: http://neo4j.com/developer/guide-data-modeling/