Concurrency logging to sql DB - threads not running parallel

482 views Asked by At

I tested this with 2.14.0 and 2.13.3 I used the JDBC Appender in combination with the DynamicThresholdFilter and tried out a normal Logger and also the AsyncLogger.

In the JDBC Appender i also tried out the PoolingDriver and the ConnectionFactory approach.

It turns out that the Threads are not started parallel because of Log4j2.

Using the AsyncLogger even made it worse since the Output said that the Appender is not started and of 15.000 expected logs only 13.517 are in the DB.

To reproduce the issue i made a github repo see here: https://github.com/stefanwendelmann/Log4j_JDBC_Test

Threads

EDIT

I replaced the mssql-jdbc with a h2db and the threads dont block. JMC auto analysis say that there are locking instances of JdbcDatabaseManager.

Is there any configuration problem in my PoolableConnectionFactory for mssql-jdbc or is there a general problem with dbcp / jdbc driver pooling?

Edit 2

Created Ticket on Apaches LOGJ2 Jira: https://issues.apache.org/jira/browse/LOG4J2-3022

Edit 3

Added a longer flight recording for mssql and h2:file

https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_docker_mssql_asynclogger_10000_runs.jfr

https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_h2_file_asynclogger_10000_runs.jfr

2

There are 2 answers

6
saver On

I tested on MySQL db and I found lock on following method:

org.apache.logging.log4j.core.appender.db.AbstractDatabaseManager.write(org.apache.logging.log4j.core.LogEvent, java.io.Serializable) (line: 261)

because in the source code you can see synchronization on write method:

    /**
     * This method manages buffering and writing of events.
     *
     * @param event The event to write to the database.
     * @param serializable Serializable event
     */
    public final synchronized void write(final LogEvent event, final Serializable serializable) {
        if (isBuffered()) {
            buffer(event);
        } else {
            writeThrough(event, serializable);
        }
    }

I think if you specify buffer size it will increase throughput, because logs will be collected into batches and synchronization will be pretty low.

After updating log4j2 config file on using AsyncLogger you will see lock on:

org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(org.apache.logging.log4j.core.LogEvent, org.apache.logging.log4j.core.async.AsyncLoggerConfig) (line: 375)

and implementation of that method:

private void enqueue(final LogEvent logEvent, final AsyncLoggerConfig asyncLoggerConfig) {
        if (synchronizeEnqueueWhenQueueFull()) {
            synchronized (queueFullEnqueueLock) {
                disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
            }
        } else {
            disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
        }
    }

synchronizeEnqueueWhenQueueFull is true by default, and it produces locks on threads, you can manage these parameters:

    /**
     * LOG4J2-2606: Users encountered excessive CPU utilization with Disruptor v3.4.2 when the application
     * was logging more than the underlying appender could keep up with and the ringbuffer became full,
     * especially when the number of application threads vastly outnumbered the number of cores.
     * CPU utilization is significantly reduced by restricting access to the enqueue operation.
     */
    static final boolean ASYNC_LOGGER_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
            .getBooleanProperty("AsyncLogger.SynchronizeEnqueueWhenQueueFull", true);
    static final boolean ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
            .getBooleanProperty("AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull", true);

But you should know about side effect of using these parameters, as mentioned in code snippet.

Ideas why can a DB become a bottleneck?:

  • remoteness db(vpn and etc.)
  • check what strategy is used for id column (SEQUENCE, TABLE, IDENTITY) to avoid additional db call
  • are there indexes on columns? (it can produce reindex operation on each transaction commit)
0
kolossus On

Thanks for getting the flight recordings up. This is a pretty interesting scenario, but I'm afraid I can't give conclusive answers, mostly because for some reason

  1. The information in your flight recordings is weirdly incomplete. I'll explain a little more shortly
  2. There seems to be other things going on in your system that may be muddying the diagnosis. You might benefit from killing any other running process in your machine.

So, what now/(TL;DR)

  1. You need to be sure that your connection source to the database is pooled
  2. Make sure you start your load test on a calm, clear-headed CPU
  3. Configure your next flight recording to take sufficient, intermittent thread dumps. This is probably the most important next step, if you're interested in figuring out what exactly all these threads are waiting for. Don't post up another flight recording until you're positive it contains multiple thread dumps that feature all the live threads in your JVM.
  4. Maybe 10k threads isn't reasonable for your local machine
  5. I also noticed from the flight recording that you have a heap size maxed at 7GB. If you're not on a 64bit OS, that could actually be harmful. A 32-bit OS can address a max of 4GB.
  6. Make sure there aren't any actual database failures causing the whole thing to thrash. Are you running out of connections? Are there any SQLExceptions blowing up somewhere? Any exceptions at all?

Here's what I could tell from your recordings:

CPU

Both flight recordings show that your CPU was struggling for good chunks of both of your recordings:

The MSSQL recording (46 mins total)

enter image description here

JFR even warns in the MSSQL recording that:

An average CPU load of 42 % was caused by other processes during 1 min 17 s starting at 2/18/21 7:28:58 AM.

The H2 recording (20.3s total)enter image description here

I noticed that your flight recordings are titled XXXX_10000. If this means "10k concurrent requests", it may simply mean that your machine simply can't deal with the load you're putting on it. You may also benefit from first ensuring that your cores don't have a bunch of other things hogging their time before you kick off another test. At any rate, hitting 100% CPU utilization is bound to cause lock contention as a matter of course, due to context switching. Your flight recording shows that you're running on an 8-core machine; but you noted that you're running a dockerized MSSQL. How many cores did you allocate to Docker?

Blocked Threads

There's a tonne of blocking in your setup, and there are smoking guns everywhere. The thread identified by Log4j2-TF-1-AsyncLoggerConfig-1 was blocked a lot by the garbage collector, just as the CPU was thrashing:

The H2 flight recording:

enter image description here

All but the last 3 ticks across that graph were blockings of the log4j2 thread. There was still significant blocking of the other pooled threads by GC (more on that further down)

The MSSQL flight recording had smoother GC, but both flight recordings featured blocking by GC and the consequent super high CPU utilization. One thing was clear from the MSSQL and H2 recording: every other pooled thread was blocked, waiting for a lock on the same object ID

For MSSQL, lock ID: 0x1304EA4F#40; for H2, lock ID: 0x21A1100D7D0

Every thread except the main thread and pool-1-thread-1 (which was blocked by garbage collection) exhibits this behavior.

These 7 threads are all waiting for the same object. There is definitely some blocking or even a deadlock somewhere in your setup.

The small specks of green also corroborate the intermittent transfer of monitor locks between the various threads, confirming that they're sort of deadlocked. The pane that shows the threads at the bottom gives a timeline of each thread's blockage. Red indicates blocked; green indicates running. If you hover over each thread's red portion, it shows you that

  1. The thread is blocked, waiting to acquire a lock (RED)
  2. The ID of the lock that the thread is trying to acquire and is currently unable
  3. The ID of the thread that last held the lock
  4. Green indicates a running, unblocked thread.

When you hover over the red slices in your flight recording, you'll see that they're all waiting to acquire the same lock. That lock is intermittently held between the various pooled threads.

MSSQL (threads blocked waiting for 0x1304EA4F40):

enter image description here

H2 (threads blocked waiting for 0x21A1100D7D0):

enter image description here

In both flight recordings, pool-1-thread-1 is the sole thread that isn't blocked while trying to acquire a lock. That blank row for pool-1-thread-1 is solely due to garbage collection, which I covered earlier.

Dumps

Ideally, your flight recordings should contain a bunch of thread dumps, especially the one that you ran for over 40 mins; never mind the 20s one. Unfortunately, both recordings contain just 2 recordings each; only one of each of them even contains the stacktrace for pool-1-thread-1. Singular thread dumps are worthless. You'll need multiple snapshots over a length of time to make use of them. With a thread dump (or a heap dump), one could identify what objects the IDs 0x1304EA4F40 and 0x21A1100D7D0 refer to. The most I could figure out from the dumps is that they're all waiting for an instance of "Object":

enter image description here

It literally could be anything. Your very first flight recording at least showed that the threads were locked on org.apache.logging.log4j.core.appender.db.jdbc.JdbcDatabaseManager:

enter image description here

That very first recording shows the same pattern in the locks pane, that all the threads were waiting for that single object:

enter image description here

That first recording also shows us what pool-1-thread-1 was up to at that one instant:

enter image description here

From there, I would hazard a guess that that thread was in the middle of closing a database connection? Nothing conclusive can be said until multiple successive thread dumps show the thread activity over a span of time.