Adding millions of nodes to neo4j spatial layer using cypher and apoc

625 views Asked by At

I have a data set of 3.8million nodes and I'm trying to load all of these into Neo4j spatial. The nodes are going into a simple point layer, so have the required latitude and longitude fields. I've tried:

MATCH (d:pointnode) 
WITH collect(d) as pn 
CALL spatial.addNodes("point_geom", pn) yield count return count

But this just keeps spinning without anything happening. I've also tried (I've been running the next query all on one line, but I've just split it up for ease of reading):

CALL apoc.periodic.iterate("MATCH (d:pointnode) 
WITH collect(d) AS pnodes return pnodes",
"CALL spatial.addNodes('point_geom', pnodes) YIELD count return count", 
{batchSize:10000, parallel:false, listIterate:true})

But again a lot of spinning and the occasional JAVA heap error.

The final approach I tried was to use FME with the HTTP caller, this works but is exceptionally slow so doesn't scale well for millions of nodes.

Any advice or suggestions would be much appreciated. Would apoc.periodic.commit or apoc.periodic.rock_n_roll be a better choice than periodic iterate?

2

There are 2 answers

3
SAB On BEST ANSWER

After a bit of trial and error periodic commit has led to a relatively quick solution (still going to take 2-3 hours)

call apoc.periodic.commit("match (n:pointnode) 
where not (n)-[:RTREE_REFERENCE]-() with n limit {limit} 
WITH collect(n) AS pnodes 
CALL spatial.addNodes('point_geom', pnodes) YIELD count return count",
{limit:1000})

May be quicker with larger batch sizes

EDIT with a batch size of 5000 it takes 45 minutes

3
Tom Geudens On

You have 3 800 000 nodes, you collect them in one list ... and then you do one call to have that list added to the layer ... that is going to take a while and eat loads of memory. apoc.periodic.iterate makes absolutely no difference because you only do one call to spatial.addNodes ...

It may take a while, but why not add them node by node ?

CALL apoc.periodic.iterate(
  "MATCH (d:pointnode) RETURN d",
  "CALL spatial.addNode('point_geom', d) YIELD node RETURN node"
  {batchSize:10000, parallel:false, listIterate:true})

Hope this helps (or at least explains why you are having issues).

Regards, Tom