I'm using Property Graph and Cypher from Neo4j. As described in the title, I'm trying to delete a number of nodes which can be reached from another node without passing other nodes and only have 1 incoming relationship. Here is the example of this case:

Example graph

Each node has its label (big, bold character) and a property called nodeId, which is unique among nodes. The labels of relationships are omitted because we cannot rely on it for some reasons. The nodeId property is already indexed with a unique constraint.

Now, starting from node A {nodeId: 1}, I want to delete it and all other nodes which:

  • can be reached from A {nodeId: 1} without passing another A-label node.
  • only has 1 incoming relationship

So, the nodes will be deleted are: A {nodeId: 1}, B {nodeId: 3}, C {nodeId: 4}, and C {nodeId: 8}.

Below is my Cypher code:

MATCH p = (s:A {nodeId: 1 }) -[*1..10]-> (e)
WHERE NONE (x in NODES(p) WHERE x:A AND NOT x.nodeId = 1)
WITH s, e
MATCH (e) <-[r]- ()
WITH count(r) AS num_r, s, e
WHERE num_r < 2
DETACH DELETE e
DETACH DELETE s

The code works fine but as my graph grows, it becomes slower and slower. In the beginning, it takes less than 10 ms. But now, when I have around 1 million of nodes and 2 million of relationships, it takes more than 1 second.

What should I do to improve the performance of that code?

Thank you for your help.

2

There are 2 answers

1
InverseFalcon On

Tezra has the right idea for the Cypher version of this query (but without usage of shortestPath()).

An alternate approach that might work better in a more complex graph is to use APOC Procedures, which has path expansion procs that will work well for your use case, finding only single paths to each distinct node, and filtering efficiently on labels.

Here's how you might use this, using apoc.path.subgraphNodes()

MATCH (s:A {nodeId: 1 })
CALL apoc.path.subgraphNodes(s, {maxLevel:10, labelFilter:'-A'}) YIELD node as e
WITH s, e 
WHERE size((e)<--()) = 1
DETACH DELETE e
WITH distinct s
DETACH DELETE s

The labelFilter in the procedure call ensures that no node in the expansion has an :A label (the filter doesn't apply to the starting node of the expansion by default, which works in your case here, though this is configurable).

EDIT

One flaw in this approach, however, is that this expands any relationship, in any direction.

While the relationshipFilter can filter by direction, there's currently a bug which won't allow us to specify only relationship direction without a type.

UPDATE

As of the APOC Summer releases in 2018 (3.3.0.4 along the 3.3.x line, and 3.4.0.2 along the 3.4.x line), you can now specify typeless, direction-only in the relationshipFilter: relationshipFilter:'>'

2
Tezra On

Since you only care if there is A path, you should use shortestPath instead of just (a)-[*]->(b). That way, Cypher just needs to find 1 valid path instead of all possible paths (This can be a life saver in larger sets) Also, you can use TAIL to cut off the first item in a list so that you can (Cypher can) skip that check.

Depending on your Neo4j version, Using MATCH <path> WHERE <stuff> WITH DISTINCT startnode ,endnode may be more effective, as later Cypher Planners can use the WITH DISTINCT hint to do a faster, less exhaustive path matching. On earlier versions, this will hang Neo4j, and you will need to use the APOC neo4j library.

MATCH (s:A {nodeId: 1 })
WITH s MATCH p=shortestPath((s)-[*1..10]->(e))
WHERE NONE (x in TAIL(NODES(p)) WHERE x:A) AND NOT ()-->(e)<--()
WITH DISTINCT s, e
DETACH DELETE e
DETACH DELETE s

You can also change NOT ()-->(e)<--() to SIZE(()-->(e)) < 2 if you need to change that number. The former may perform better in some Cypher Planners though. You may need to change that to "All parents of e are contained in path" if that is a scenario where e can have more than 2 incoming relationships but still need to be deleted.

If your logic gets more complicated than that (where what nodes get deleted can change what other nodes can be deleted