Blulbflow Neo4j Graph Database Slow

Question

Blulbflow Neo4j Graph Database Slow

129 views Asked by Pratik Poddar At 02 January 2014 at 20:21

I am trying to create 500,000 nodes in a graph database. I plan to add edges as per my requirements later. I have a text file with 500,000 lines representing the data to be stored in each node.

from bulbs.neo4jserver import Graph, Config, NEO4J_URI
config = Config(NEO4J_URI)
g = Graph(config)

def get_or_create_node(text, crsqid):
    v = g.vertices.index.lookup(crsqid=crsqid)
    if v==None:
            v = g.vertices.create(crsqid=crsqid)
            print text + " - node created"
    v.text = text
    v.save()
    return v

I then loop over each line in the text file,

count = 1
with open('titles-sorted.txt') as f:
    for line in f:
        get_or_create_node(line, count)
        count += 1

This is terribly slow. This gives me 5000 nodes in 10 minutes. Can this be improved? Thanks

Original Q&A

There are 2 answers

espeed On 08 January 2014 at 22:36

Batch loading 500k nodes individually via REST is not ideal. Use Michael's batch loader or the Gremlin shell -- see Marko's movie recommendation blog post for an example of how to do this from the Gremlin shell.

**FrobberOfBits** · Accepted Answer · 2014-01-03T15:24:51+00:00

I don't see any transaction code in there, establishing one, or signaling transaction success. You should look into that -- if you're doing one transaction for every single node creation, that's going to be slow. You should probably create one transaction, insert thousands of nodes, then commit the whole batch.

I'm not familiar with bulbs, so I can't tell you how to do that with this python framework, but here is a place to start: this page suggests you can use a coding style like this, with some python/neo bindings:

with db.transaction:
  foo()

also, if you're trying to load mass amounts of data and you need performance, you should check this page for information on bulk importing. It's unlikely that doing it in your own script is going to be the most performant. You might instead consider using your script to generate cypher queries, which get piped to the neo4j-shell.

Finally a thing to consider is indexes. Looks like you're indexing on crsqid - if you get rid of that index, creates may go faster. I don't know how your IDs are distributed, but it might be better to break records up into batches to test if they exist, rather than using the get_or_create() pattern.

TechQA.

Blulbflow Neo4j Graph Database Slow

There are 2 answers

Related Questions in NEO4J

Related Questions in GRAPH-DATABASES

Related Questions in BULBS

Popular Questions

Popular Tags

Trending Questions