I'm new to Neo4j, Currently I'm trying to make dating site as POC. I have 4GB of Input file which is look like bellow format.
This contains viewerId(male/female), viewedId which is list of id's they have viewed. Based on this history file, I need to give recommendation when any user comes to online.
Input file:
viewerId viewedId
12345 123456,23456,987653
23456 23456,123456,234567
34567 234567,765678,987653
:
For this task, I tried the following way,
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:/home/hadoopuser/Neo-input " AS row
FIELDTERMINATOR '\t'
WITH row, split(row.viewedId, ",") AS viewedIds
UNWIND viewedIds AS viewedId
MERGE (p2:Persons2 {viewerId: row.viewerId})
MERGE (c2:Companies2 {viewedId: viewedId})
MERGE (p2)-[:Friends]->(c2)
MERGE (c2)-[:Sees]->(p2);
And My Cypher query to get result is,
MATCH (p2:Persons2)-[r*1..3]->(c2: Companies2)
RETURN p2,r, COLLECT(DISTINCT c2) as friends
To complete this task, it will take 3 days.
My system config:
Ubuntu -14.04
RAM -24GB
Neo4j Config:
neo4j.properties:
neostore.nodestore.db.mapped_memory=200M
neostore.propertystore.db.mapped_memory=2300M
neostore.propertystore.db.arrays.mapped_memory=5M
neostore.propertystore.db.strings.mapped_memory=3200M
neostore.relationshipstore.db.mapped_memory=800M
neo4j-wrapper.conf
wrapper.java.initmemory=12000
wrapper.java.maxmemory=12000
To reduce time, I search and get one idea in internet like Batch importer from the following link, https://github.com/jexp/batch-import
In that link, they have node.csv, rels.csv files, they imported into Neo4j. I'm not getting any idea about how they are creating node.csv and rels.csv files which scripts they're are using and all.
Can anyone give me sample script to make node.csv and rels.csv files for my data?
Or can you give any suggestions to make import and retrieve data faster?
Thanks in Advance.
You don't need the inverse relationship, only one is good enough !
For the Import configure your heap (neo4j-wrapper.conf) to 12G, configure page-cache (neo4j.properties) to 10G.
Try this, it should be done in a few minutes.
For the relationship-merge if you have some companies which have hundreds of thousands up to millions of views, you might want to use this instead:
Regarding your query?
What do you want to achieve by retrieving the cross products between all people and all companies up to 3 levels deep? These might be trillions of paths?
Usually you want to know this for a single person or company.
Update Your Query
for all companies, this might help: