Blulbflow Neo4j Graph Database Slow

137 views Asked by At

I am trying to create 500,000 nodes in a graph database. I plan to add edges as per my requirements later. I have a text file with 500,000 lines representing the data to be stored in each node.

from bulbs.neo4jserver import Graph, Config, NEO4J_URI
config = Config(NEO4J_URI)
g = Graph(config)

def get_or_create_node(text, crsqid):
    v = g.vertices.index.lookup(crsqid=crsqid)
    if v==None:
            v = g.vertices.create(crsqid=crsqid)
            print text + " - node created"
    v.text = text
    v.save()
    return v

I then loop over each line in the text file,

count = 1
with open('titles-sorted.txt') as f:
    for line in f:
        get_or_create_node(line, count)
        count += 1

This is terribly slow. This gives me 5000 nodes in 10 minutes. Can this be improved? Thanks

2

There are 2 answers

0
FrobberOfBits On BEST ANSWER

I don't see any transaction code in there, establishing one, or signaling transaction success. You should look into that -- if you're doing one transaction for every single node creation, that's going to be slow. You should probably create one transaction, insert thousands of nodes, then commit the whole batch.

I'm not familiar with bulbs, so I can't tell you how to do that with this python framework, but here is a place to start: this page suggests you can use a coding style like this, with some python/neo bindings:

with db.transaction:
  foo()

also, if you're trying to load mass amounts of data and you need performance, you should check this page for information on bulk importing. It's unlikely that doing it in your own script is going to be the most performant. You might instead consider using your script to generate cypher queries, which get piped to the neo4j-shell.

Finally a thing to consider is indexes. Looks like you're indexing on crsqid - if you get rid of that index, creates may go faster. I don't know how your IDs are distributed, but it might be better to break records up into batches to test if they exist, rather than using the get_or_create() pattern.

0
espeed On

Batch loading 500k nodes individually via REST is not ideal. Use Michael's batch loader or the Gremlin shell -- see Marko's movie recommendation blog post for an example of how to do this from the Gremlin shell.