I am trying to load a CSV file into my Neo4j database hosted on GrapheneDB. It worked fine on the first file with 5000 rows. It took about 16 seconds to complete this file.
I am now importing the second file with the same schema and same amount of rows. The data is just different. The Cypher query has been running over 30 minutes and its still not complete. I am not sure what its doing and why its so slow. Here is my cypher:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'http://example.com/some.csv' AS line
Match (c:Customer {customerID: line.CustomerID})
MERGE (c)<-[r:DEPT_OF]-(dept:Dept { name: line.Category})
ON CREATE
SET dept.name = line.Category, dept.deptID=line.DeptID, dept.createdDTS=1453478149463
MERGE (dept)<-[r1:IN_DEPT]-(pt:ProductType {name: dept.name})
ON CREATE
SET pt.name = dept.name, pt.packQty = line.PackQty, pt.createdDTS = 1453478149463,
pt.productTypeID = line.ProductTypeID
MERGE (pt)<-[r2:OF_TYPE]-(st:Style {name: line.Style})
ON CREATE
SET st.name = line.Style, st.styleID = line.StyleID, st.styleNum = line.StyleNo, st.price = line.Price
MERGE (st)<-[r3:OF_STYLE]-(p:Product {productNum: line.UPC})
ON CREATE
SET p.floorMin = line.MinFloor, p.floorMax = line.FloorMax, p.color = line.Color, p.createdDTS = 1453478149463,
p.size = line.Size, p.productID = line.ProductID;
For rows from my csv:
UPC,Category,Style,StyleNo,Color,Size,MinFloor,MaxFloor,Price,ProductID,CustomerID,ProductType,PackQty,DeptID,StyleID,ProductTypeID,ProductID
33383605005,FRESH VEGETABLES,GREEN ONIONS 24/10 OZ,NA,NA,NA,0,0,1.79,,5f795a69-47cb-49c8-a334-0cf5d67be423,FRESH VEGETABLES,1,538a02c6-b6b7-4d0d-8dca-5ff3a513d59e,3e08dabb-415a-4826-86e8-44efb9813892,cc6a0f3c-1c05-44a0-b603-37cbbb60954e,3324b2b1-954a-4547-a82d-553be66d7b54
52867010005,FRESH VEGETABLES,GREEN ONIONS 24/10 OZ,NA,NA,NA,0,0,1.79,,5f795a69-47cb-49c8-a334-0cf5d67be423,FRESH VEGETABLES,1,edfa998f-3749-4d1f-bd96-3f4a5db0de67,fb11a8e5-de49-44da-924a-9ebc5f7f01d2,d47fd5d8-dbf0-4110-b701-543e6ed0ae40,28a9d206-96c6-4fe4-b528-84e446ba3c16
Update 1:
I have added the following indexes based off the response from Nicole.
- CREATE INDEX ON :Customer(customerID)
- CREATE INDEX ON :Dept(name)
- CREATE INDEX ON :ProductType(name)
- CREATE INDEX ON :Style(name)
- CREATE INDEX ON :Product(productNum)
This helped a lot, but still takes about 20 seconds for 5K rows. Is that normal?
Any help is appreciated.
Update 2:
Based off the response from @michael, I researched a little further and found the following article very useful:
http://graphaware.com/neo4j/2014/07/31/cypher-merge-explained.html
Update 3:
I have updated my cypher to the following to avoid duplicates. I hope this looks right?
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM '...' AS line
Match (c:Customer {customerID: line.CustomerID})
MERGE (c)<-[r:DEPT_OF]-(dept:Dept { name: line.Category })
ON CREATE
SET dept.name = dept.name, dept.deptID=line.categoryID, dept.createdDTS=1453742532269, dept.modifiedDTS = 1453742532269
MERGE (c)<-[r22:DEPT_OF]-(dept)
MERGE (dept)<-[r1:IN_DEPT]-(pt:ProductType {name: dept.name})
ON CREATE
SET pt.name = dept.name, pt.packQty = line.PackQty, pt.createdDTS = 1453742532269, pt.productTypeID = line.ProductTypeID, pt.modifiedDTS = 1453742532269
MERGE (c)<-[r2:DEPT_OF]-(dept)
MERGE (dept)<-[r3:IN_DEPT]-(pt)
MERGE (pt)<-[r4:OF_TYPE]-(st:Style {name: line.Style})
ON CREATE
SET st.name = line.Style, st.styleID = line.StyleID, st.styleNum = line.StyleNo, st.price = line.Price, st.modifiedDTS = 1453742532269, st.createdDTS = 1453742532269
MERGE (c)<-[r5:DEPT_OF]-(dept)
MERGE (dept)<-[r6:IN_DEPT]-(pt)
MERGE (pt)<-[r7:OF_TYPE]-(st)
MERGE (st)<-[r8:OF_STYLE]-(p:Product {productNum: line.UPC})
ON CREATE
SET p.floorMin = line.MinFloor, p.floorMax = line.FloorMax, p.color = line.Color, p.createdDTS = 1453742532269,p.modifiedDTS = 1453742532269, p.size = line.Size, p.productID = line.ProductID;