load csv in neo4j got java.lang.OutOfMemoryError: GC overhead limit exceeded exception

927 views Asked by At

I tried to load csv data into embedded neo4j database(v2.1.7、on windows). The csv file has 1,000,000 rows(1 Million). And data model is simple too. just as following:

csv data

"num1","num2","datatime"
"13931345724","18409958023","2014-12-31 12:00:00"
"13931345724","13710622859","2014-12-31 12:00:00"
"13931345724","18919875049","2014-12-31 12:00:00"
"13931345724","13460873081","2014-12-31 12:00:00"
...

loading cypher sql

USING PERIODIC COMMIT 5000 
LOAD CSV FROM 'file:C:/tmpFiles/calls100w.csv' AS line FIELDTERMINATOR ','
 WITH line 
MERGE (n0:Phone:_Phone {phoneNumber : line[0]}) 
MERGE (n1:Phone:_Phone {phoneNumber : line[1]}) 
MERGE (n0)-[:CALL{callAt: line[2]}]->(n1) 

And this took a long time and got this exception:

java.lang.OutOfMemoryError: GC overhead limit exceeded

I tried to add neo4j-wrapper.conf file in the database location folder. but it seemed to have no effect.

neo4j-wrapper.conf

wrapper.java.additional.1=-XX:-UseConcMarkSweepGC
wrapper.java.additional.1=-Xloggc:c:/neo4jdb/log/neo4j-gc.log
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096

And here was what showed in messages.log file

2015-06-09 09:55:00.513+0000 INFO  [org.neo4j]: System memory information:
    Total Physical memory: 7.70 GB
    Free Physical memory: 1.99 GB
    Committed virtual memory: 361.05 MB
    Total swap space: 14.48 GB
    Free swap space: 5.76 GB
2015-06-09 09:55:00.519+0000 INFO  [org.neo4j]: JVM memory information:
    Free  memory: 135.42 MB
    Total memory: 148.94 MB
    Max   memory: 1.71 GB
    Garbage Collector: PS Scavenge: [PS Eden Space, PS Survivor Space]
    Garbage Collector: PS MarkSweep: [PS Eden Space, PS Survivor Space, PS Old Gen, PS Perm Gen]
    Memory Pool: Code Cache (Non-heap memory): committed=2.44 MB, used=833.00 kB, max=48.00 MB, threshold=0.00 B
    Memory Pool: PS Eden Space (Heap memory): committed=61.63 MB, used=7.28 MB, max=647.06 MB, threshold=?
    Memory Pool: PS Survivor Space (Heap memory): committed=5.13 MB, used=5.12 MB, max=5.13 MB, threshold=?
    Memory Pool: PS Old Gen (Heap memory): committed=82.19 MB, used=1.12 MB, max=1.28 GB, threshold=0.00 B
    Memory Pool: PS Perm Gen (Non-heap memory): committed=20.75 MB, used=12.78 MB, max=82.00 MB, threshold=0.00 B

On windows there is no conf/ folder in database location folder, so I made one and put neo4j-wrapper.conf in it. Have conf file been placed in the right place?

database location folder

C:\NEO4JDB
|   index.db
|   messages.log
|   neo4j.properties
|   neostore
|   neostore.id
|   neostore.labeltokenstore.db
|   ....
+---conf
|       neo4j-wrapper.conf
+---index
|       lucene-store.db
|       lucene.log.active
|       ...
\---schema
    +---...
2

There are 2 answers

0
Michael Hunger On BEST ANSWER

Split it into 2 Imports:

You run into the problem where cypher creates an Eager pipe to assert correct separation, which causes all your CSV lines to be pulled in eagerly, rendering periodic commit effectless, see:

explain LOAD CSV FROM 'file:C:/tmpFiles/calls100w.csv' AS line FIELDTERMINATOR ','
>  WITH line 
> MERGE (n0:Phone:_Phone {phoneNumber : line[0]}) 
> MERGE (n1:Phone:_Phone {phoneNumber : line[1]}) 
> MERGE (n0)-[:CALL{callAt: line[2]}]->(n1) ;
+--------------+-------------------------+-------------------+
| Operator     | Identifiers             | Other             |
+--------------+-------------------------+-------------------+
| +EmptyResult |                         |                   |
| |            +-------------------------+-------------------+
| +UpdateGraph | anon[195], line, n0, n1 | MergePattern      |
| |            +-------------------------+-------------------+
| +UpdateGraph | line, n0, n1            | MergeNode; :Phone |
| |            +-------------------------+-------------------+
| +Eager       | line, n0                |                   |
| |            +-------------------------+-------------------+
| +UpdateGraph | line, n0                | MergeNode; :Phone |
| |            +-------------------------+-------------------+
| +LoadCSV     | line                    |                   |
+--------------+-------------------------+-------------------+

I presume you have an index/constraint on :Phone(phoneNumber). If you split your query up into two parts it will work:

USING PERIODIC COMMIT 5000 
LOAD CSV FROM 'file:C:/tmpFiles/calls100w.csv' AS line FIELDTERMINATOR ','
 WITH line 
MERGE (n0:Phone:_Phone {phoneNumber : line[0]}) 
MERGE (n1:Phone:_Phone {phoneNumber : line[1]});

USING PERIODIC COMMIT 5000 
LOAD CSV FROM 'file:C:/tmpFiles/calls100w.csv' AS line FIELDTERMINATOR ','
 WITH line 
MATCH (n0:Phone:_Phone {phoneNumber : line[0]}) 
MATCH (n1:Phone:_Phone {phoneNumber : line[1]}) 
MERGE (n0)-[:CALL{callAt: line[2]}]->(n1);
1
adam On

You may want to try splitting the import into node import and relationships import. Instead of having MERGE statements you can just use CREATE, assuming that your node and relationship files are de-deduplicated.

So, for example, make a "num1.csv" file that only contains the first column (num1) of your "calls100w.csv" file and remove all of the duplicates. Make a "num2.csv" that only contains the second column of "calls100w.csv" and remove duplicates.

Then load your node csv files:

USING PERIODIC COMMIT 5000 
LOAD CSV FROM 'file:C:/tmpFiles/num1.csv' AS line FIELDTERMINATOR ','
WITH line 
CREATE (n0:Phone:_Phone {phoneNumber : line[0]})

AND

USING PERIODIC COMMIT 5000 
LOAD CSV FROM 'file:C:/tmpFiles/num2.csv' AS line FIELDTERMINATOR ','
WITH line 
CREATE (n1:Phone:_Phone {phoneNumber : line[0]}) 

Then create your index:

CREATE INDEX ON :Phone(phoneNumber)

Now load your original csv to create the relationships:

USING PERIODIC COMMIT 5000 
LOAD CSV FROM 'file:C:/tmpFiles/calls100w.csv' AS line FIELDTERMINATOR ','
WITH line 
MATCH (n0:Phone:_Phone {phoneNumber : line[0]}) 
MATCH (n1:Phone:_Phone {phoneNumber : line[1]}) 
MERGE (n0)-[:CALL{callAt: line[2]}]->(n1)