Gremlin: ConcurrentModificationException and multithreading

1.9k views Asked by At

My application is not live yet, so I'm testing the performance of my Gremlin queries before it gets into production.

To test I'm using a query that adds edges from one vertex to 300 other vertices. It does more things but that's the simple description. I added this mentioned workload of 300 just for testing.

If I run the query 300 times one after the other it takes almost 3 minutes to finish and 90.000 edges are created (300 x 300).

I'm worried because if I have like 60.000 users using my application at the same time they probably are going to be creating 90.000 edges in 2 minutes using this query and 60.000 users at the same time is not much in my case. If I have 1 million users at the same time I'm going to be needing many servers at full capacity, that is out of my budget.

Then I noticed that when my test is executing the CPU doesn't show much activity, I don't know why, I don't know how the DB works internally. So I thought that maybe a more real situation could be to call my queries all at the same time because that is what's going to happen with the real users, when I tried test that I got ConcurrentModificationException.

Is far as I understand this error happens because an edge or vertex is being read or written in 2 queries at the same time, this is something that could happen a lot in my application because all the user vertices are changing connections to the same 4 vertices all the time, this "collisions" are going to be happening all the time.

I'm testing in local using gremlin server 3.4.8 connecting using sockets with Node.js. My plan is to use AWS Neptune as my database when it goes to production.

What can I do to recover hope? There must be very important things about this subject that I don't know because I don't know how graph databases work internally.

Edit

I implemented a logic to retry the queries requests when receiving an error using the "Exponential Backoff" approach. It fixed the ConcurrentModificationException but there are a lot of problems in Gremlin Server when sending multiple queries at the same time that shows how multi-threading is totally unsupported and unstable in Gremlin Server and we should try multi-threading in other Gremlin compatible databases as the response says. I experienced random inconsistencies in the data returned by the queries and errors like NegativeArraySize and other random stuff coming from the database, be warned of this to not waste time thinking that your code could be bugged like it happened to me.

2

There are 2 answers

6
stephen mallette On BEST ANSWER

While TinkerPop and Gremlin try to provide a vendor agnostic experience, they really only do so at their interface level. So while you might be able to run the same query in JanusGraph, Neptune, CosmosDB, etc, you will likely find that there are differences in performance depending on the nature of the query and the degree to which the graph in question is capable of optimizing that query.

For your case, consider TinkerGraph as you are running your tests there locally. TinkerGraph is an in-memory graph without transaction capability and is not proven thread safe for writes. If you apply a heavy write workload to it, I could envision ConcurrentModificationException easy to generate. Now consider JanusGraph. If you had tested your heavy write workload with that, you might have found that you were struck with tons of TemporaryLockingException errors if your schema had required a unique property key and would have to modify your code to do transactional retries with exponential backoff.

The point here is that if your target graph is Neptune and you have a traversal you've tested for correctness and now are concerned about performance, it's probably time to load test on Neptune to see if any problems arise there.

I'm worried because if I have like 60.000 users using my application at the same time they probably are going to be creating 90.000 edges in 2 minutes using this query and 60.000 users at the same time is not much in my case. If I have 1 million users at the same time I'm going to be needing many servers at full capacity, that is out of my budget.

You will want to develop a realistic testing plan. Is 60,000 users all pressing "submit" to trigger this query at the exact same time really what's going to happen? Or is it more likely that you have 100,000 users with some doing reads and perhaps every half second three of them happens to click the "submit".

Your graph growth rate seems fairly high and the expected usage you've described here will put your graph in the category of billions of edges quite quickly (not to mention whatever other writes you might have). Have you tested your read workloads on a graph with billions of edges? Have you tested that explicitly on Neptune? Have you thought about how you will maintain a multi-billion edge graph (e.g. changing the schema when a new feature is needed, ensuring that it is growing correctly, etc.)?

All of these questions are rhetorical and just designed to make you think about your direction. Good luck!

0
Rahul Nimbal On

Although there is already an accepted answer I want to suggest another approach to deal with this problem.

Idea is to figure out whether you really need to do synchronous writes on the Graph. I'm suggesting that when you receive a request, just use the attributes in this request to fetch the sub-graph / neighbors, and continue with the business logic.

Simultaneously put the event in an SQS or something to have the writes taken care of by an Asynchronous system - say AWS Lambda. Because in SQS + Lambda you can decide the write concurrency to your systems' comfort (low enough that your query does NOT cause above exception).

Further suggestion: You have abnormally high write blast radius - your query should NOT touch that many nodes while writing. You can try converting some of the edges to vertices in order to reduce the radius. Then while inserting a node you'll just have to make one edge to the vertex which was previously an edge instead of making hundreds of edges to all the vertices this node is related to. I hope this makes sense.