I'm trying to import graph data of 1GB (consists of ~100k vertices, 3.6 million edges) which is gryo format. I tried to import through gremlin-client, I'm getting the following error:
gremlin> graph.io(IoCore.gryo()).readGraph('janusgraph_dump_2020_09_30_local.gryo') GC overhead limit exceeded Type ':help' or ':h' for help. Display stack trace? [yN]y java.lang.OutOfMemoryError: GC overhead limit exceeded at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.(NonBlockingHashMapLong.java:471) at org.cliffc.high_scale_lib.NonBlockingHashMapLong.initialize(NonBlockingHashMapLong.java:241)
Gremlin-Server, Cassandra details as follows:
Gremlin-Server:
Janusgraph Version: 0.5.2 Gremlin Version: 3.4.6
Heap: JAVA_OPTIONS="-Xms4G -Xmx4G …
// gremlin conf
threadPoolWorker: 8
gremlinPool: 16
scriptEvaluationTimeout: 90000
// cql props
query.batch=true
Cassandra is in Cluster with 3 nodes
Cassandra version: 3.11.0
Node1: RAM: 8GB, Cassandra Heap: 1GB (-Xms1G -Xmx1G)
Node2: RAM: 8GB, Cassandra Heap: 4GB (-Xms4G -Xmx4G)
Node3: RAM: 8GB, Cassandra Heap: 4GB (-Xms4G -Xmx4G)
Each node has installed with Gremlin-Server (Load Balancer for clients). But we are executing gremlin queries in Node1.
Can someone help me on the following:
What do I need to do import(any configuration changes) ?
>>> What is the best way to export/import huge data into Janusgraph(Gremlin-Server)? (I need answer for this)
Is there any way I can export the data in chunks and import in chunks ?
Thanks in advance.
Edit:
I've increased Node1, Gremlin-Server Heap to 2GB. Import query response is cancelled. Perhaps, for both Gremlin and Cassandra, RAM allocation is not sufficient. That's why I've kept it to 1GB, so that the query will be executed.
Considering huge data (billions of vertices/edges), this is very less, hope 8GB RAM and 2/4 core would be sufficient for each node in cluster.
Graph.io()
and the now preferred Gremlin stepio()
use theGryoReader
to read your file (unless the graph provider overrides the latter Gremlinio()
step and I don't think that JansuGraph does). So, if you useGryoReader
you typically end up needing a lot of memory (more than you would expect) because it holds a cache of all vertices to speed loading. Ultimately, it is not terribly efficient at loading and the expectation has been from TinkerPop's perspective, that providers would optimize loading with their own native bulk loader by intercepting theio()
step when encountered. In absence of this optimization, the general recommendation is to use the bulk loading tools of the graph you are using directly. For JanusGraph that likely means parallelizing the load your self as part of a script or using a Gremlin OLAP method of loading. Some recommendations can be found in the JanusGraph Documentation as well as in these blog posts:https://medium.com/@nitinpoddar/bulk-loading-data-into-janusgraph-ace7d146af05 https://medium.com/@nitinpoddar/bulk-loading-data-into-janusgraph-part-2-ca946db26582
You can also consider a custom
VertexProgram
for bulk loading. TinkerPop has theCloneVertexProgram
which is the more general successor to theBulkLoaderVertexProgram
(now deprecated/removed in recent versions) which had some popularity with JanusGraph as it's generalized bulk loading tool before TinkerPop moved away from trying to supply that sort of functionality.At your scale of a few million edges, I probably would have wrote a small groovy script that would run in Gremlin Console to just load my data directly to the graph and avoid trying to go to a intermediate format like Gryo first. It would probably go much faster and would save you from having to dig too far into bulk loading tactics for JanusGraph. If you choose that case, then that link to the JanusGraph Documentation I supplied above should be of most help to you. You can save worrying about using OLAP, Spark and other options until you have hundreds of millions of edges (or more) to load.