GridGain transactions issues

247 views Asked by At

In our project we are testing how transactions work in distributed environment. As a part of the project we are testing opensource edition of GridGain 6.5.5.

We have faced lots of problems in the following testcase:

  1. We are testing a cache without any additional rules.
  2. The cache stores an id-String as a key and BigDecimal as a value.
  3. We are testing base operations (addition and subtraction) on values of the first cache from 6, 12 and 18 clients. One operation looks like "subtract X from A, add X to B".
  4. GridGain application is deployed as a .war file in WildFly.
  5. Clients are connecting to WildFly with deployed GridGain using HTTP and send a list of operations to do (we are testing batches with 1 operation, 50, 500, 1000, 5000 operations).
  6. We are testing clustered multinode mode with transactions, configuration files that we have used are attached further.
  7. We have tested both pessimistic and optimistic transactions separately.
  8. We call result values "consistent" if they are equal to the dummy-mode: one client, batch=1, one node. We have a dummy program for cross-check (its results in this mode is always equal to GridGain in local mode).

The issues are:

  1. If we are doing transaction as-is (subtract from one keys value, add to another) we face two problems: deadlocks and inconsistency if we get no deadlocks. The number of inconsistent values is small but we can't avoid it -- it's about 12 per 1000 key-values.
  2. If we transform our requests to be sorted by key in each client (so the order of operation may change) we can avoid deadlocks and inconsistency. But we get another issues: if the batch is at least 500, we have non-ending transaction failures. If the batch is small, we have GridGain failing completely (it doesn't respond to the current query).
  3. Everything is working very slow and we have almost no CPU load at the same time (About 6 seconds for batch=1000 operations). Is it ok?

Our hardware:

8x Dell M620 blades, 256GB RAM, 2x8 core Xeon E2650v2, 10GbE network.

Attaches:

  1. GridGain optimistic config: https://gist.github.com/al-indigo/a2824aa62a3af8b18932
  2. GridGain pessimistic config: the same but with
  3. GridGain log for second issue: https://gist.github.com/al-indigo/233058772418fba8d341
2

There are 2 answers

6
Alexey On

(Moving from the comment)

In order to avoid deadlocks you need to make sure that you acquire locks in the same order. This must be done when working with transactions in any system of records, be that Oracle database or GridGain data grid. 



As for the performance, it should be very fast. Most likely it is a matter of configuration. Can I ask you to provide a reproducible example? (you can use pastbin.com to share your code)

0
Dmitriy On

I have looked at the logs and am seeing the following JVM parameters

-Xms64m -Xmx512m -XX:MaxPermSize=256m

With high degree of probability you are running into long GC pauses, and that is the likely reason why you are getting lock timeout exceptions. With memory settings like this, JVM can go into GC pauses for as much as 5 minutes, during which the world is locked and nothing can be done. To confirm this, you can collect GC logs using the following JVM options:

-Xloggc:/opt/server/logs/gc.log \
-verbose:gc \
-XX:+PrintGC \
-XX:+PrintGCTimeStamps \
-XX:+PrintGCDetails

My recommendation is to allocate about 10GB maximum per JVM and start more JVM instances. You can also try using off-heap memory feature of GridGain and allocate large memory space outside of the main Java heap - http://doc.gridgain.org/latest/Off-Heap+Memory. Also, please take a look at GC tuning parameters here: http://doc.gridgain.org/latest/Performance+Tips#PerformanceTips-TuneGarbageCollection

Another big suggestion is that you should not do individual get(...) operations in your transaction, but do a one getAll(...) call instead:

try (GridCacheTx tx = balanceCache.txStart()) {

    /*
     * ==============================================
     * This while loop calls get(...) many times and acquires one lock at a time.
     * It should be replaced with one getAll(...) call.
     * ===============================================
     */
    while (changes.hasNext()) {
        Map.Entry<String, BigDecimal> ent = changes.next();

        BigDecimal oldBalance = balanceCache.get(ent.getKey());

        balanceCache.putx(ent.getKey(), oldBalance.add(ent.getValue()));
    }

    tx.commit();
} catch (GridException ex) {
     throw new Exception("transaction failed", ex);
}