Janusgraph(GremlinServer) Import improve performance

782 views Asked by At

I'm trying to import graph data of 1GB (consists of ~100k vertices, 3.6 million edges) which is gryo format. I tried to import through gremlin-client, I'm getting the following error:

gremlin> graph.io(IoCore.gryo()).readGraph('janusgraph_dump_2020_09_30_local.gryo') GC overhead limit exceeded Type ':help' or ':h' for help. Display stack trace? [yN]y java.lang.OutOfMemoryError: GC overhead limit exceeded at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.(NonBlockingHashMapLong.java:471) at org.cliffc.high_scale_lib.NonBlockingHashMapLong.initialize(NonBlockingHashMapLong.java:241)

Gremlin-Server, Cassandra details as follows:

Gremlin-Server:

Janusgraph Version: 0.5.2 Gremlin Version: 3.4.6

Heap: JAVA_OPTIONS="-Xms4G -Xmx4G …
// gremlin conf
threadPoolWorker: 8
gremlinPool: 16
scriptEvaluationTimeout: 90000
// cql props
query.batch=true

Cassandra is in Cluster with 3 nodes

Cassandra version: 3.11.0

Node1: RAM: 8GB, Cassandra Heap: 1GB (-Xms1G -Xmx1G)
Node2: RAM: 8GB, Cassandra Heap: 4GB (-Xms4G -Xmx4G)
Node3: RAM: 8GB, Cassandra Heap: 4GB (-Xms4G -Xmx4G)

Each node has installed with Gremlin-Server (Load Balancer for clients). But we are executing gremlin queries in Node1.

Can someone help me on the following:

What do I need to do import(any configuration changes) ?

>>> What is the best way to export/import huge data into Janusgraph(Gremlin-Server)? (I need answer for this)

Is there any way I can export the data in chunks and import in chunks ?

Thanks in advance.

Edit:

I've increased Node1, Gremlin-Server Heap to 2GB. Import query response is cancelled. Perhaps, for both Gremlin and Cassandra, RAM allocation is not sufficient. That's why I've kept it to 1GB, so that the query will be executed.

Considering huge data (billions of vertices/edges), this is very less, hope 8GB RAM and 2/4 core would be sufficient for each node in cluster.

1

There are 1 answers

0
stephen mallette On

Graph.io() and the now preferred Gremlin step io() use the GryoReader to read your file (unless the graph provider overrides the latter Gremlin io() step and I don't think that JansuGraph does). So, if you use GryoReader you typically end up needing a lot of memory (more than you would expect) because it holds a cache of all vertices to speed loading. Ultimately, it is not terribly efficient at loading and the expectation has been from TinkerPop's perspective, that providers would optimize loading with their own native bulk loader by intercepting the io() step when encountered. In absence of this optimization, the general recommendation is to use the bulk loading tools of the graph you are using directly. For JanusGraph that likely means parallelizing the load your self as part of a script or using a Gremlin OLAP method of loading. Some recommendations can be found in the JanusGraph Documentation as well as in these blog posts:

https://medium.com/@nitinpoddar/bulk-loading-data-into-janusgraph-ace7d146af05 https://medium.com/@nitinpoddar/bulk-loading-data-into-janusgraph-part-2-ca946db26582

You can also consider a custom VertexProgram for bulk loading. TinkerPop has the CloneVertexProgram which is the more general successor to the BulkLoaderVertexProgram (now deprecated/removed in recent versions) which had some popularity with JanusGraph as it's generalized bulk loading tool before TinkerPop moved away from trying to supply that sort of functionality.

At your scale of a few million edges, I probably would have wrote a small groovy script that would run in Gremlin Console to just load my data directly to the graph and avoid trying to go to a intermediate format like Gryo first. It would probably go much faster and would save you from having to dig too far into bulk loading tactics for JanusGraph. If you choose that case, then that link to the JanusGraph Documentation I supplied above should be of most help to you. You can save worrying about using OLAP, Spark and other options until you have hundreds of millions of edges (or more) to load.