We're using ZGC on JDK 17.0.1. The application successfully runs without unexpected GC stalls for a period of time, after which it gets into a state where the ZGC does not free much memory when the GC is being done. This results in multiple periodic Allocation Stalls within the application from then on.
The application itself does a lot of allocation at a periodic interval of around 35 milliseconds. This is roughly 2GB/s allocation when it’s at peak load. On a typical start in the morning, there is about 30 minutes in which the load is low, till a point at which external data starts arriving, and load is high from then on. The load profile does change as the application is running as a result. Due to the predictable allocation during load, we've set the collection at an interval using the ZCollectionInterval.
We cannot consistently reproduce the Allocation Stalls, even between different machines where the load is very similar. Oddly, if the application is restarted while the load is already high, then it does not typically run into issues.
We did play around with the UseDynamicNumberOfGCThreads as well previously, but it seems to come up with a number of threads that is 2. The chart attached uses SoftMaxHeapSize=1G. This is a bit aggressive, but we did see the same issue without setting the SofMaxHeapSize. We did also notice that the Concurrent Mark phase of the GC seems to increase leading up to this issue. We also didn’t see anything of note about ZGC within the Release notes of JDK versions up till 17.0.5.
Has anyone run into similar issues before? We're unsure as to why this happens or if this is expected behavior when setting the flags as such. As much hasn't been set and the rarity at which it occurs, we were hoping some one can give some pointers as to if this is actually a known issue with an existing fix or we're just setting it up wrong.
These are the relevant flags.
-XX:-OmitStackTraceInFastThrow
-XX:CICompilerCount=2
-XX:+DisableExplicitGC
-XX:ParallelGCThreads=4
-XX:ConcGCThreads=4
-Xmx8192m
-Xms8192m
-Xss8m
-Xlog:gc*=info:file=${LT_APP_LOG_PATH_FORMAT}.gc:time,level,tags:filecount=32,filesize=64M
-XX:+UseZGC
-XX:+AlwaysPreTouch
-XX:+UseLargePages
-XX:+UseTransparentHugePages
-XX:ZCollectionInterval=4
-Djdk.nio.maxCachedBufferSize=262144
-XX:+HeapDumpOnOutOfMemoryError
-XX:SoftMaxHeapSize=1G
[2022-11-24T09:21:12.378+0530][info][gc,phases ] GC(669) Concurrent Select Relocation Set 8.713ms
[2022-11-24T09:21:12.378+0530][info][gc,phases ] GC(669) Pause Relocate Start 0.008ms
[2022-11-24T09:21:12.420+0530][info][gc,phases ] GC(669) Concurrent Relocate 41.564ms
[2022-11-24T09:21:12.420+0530][info][gc,load ] GC(669) Load: 36.38/31.86/25.10
[2022-11-24T09:21:12.420+0530][info][gc,mmu ] GC(669) MMU: 2ms/99.1%, 5ms/99.6%, 10ms/99.8%, 20ms/99.9%, 50ms/99.9%, 100ms/100.0%
[2022-11-24T09:21:12.420+0530][info][gc,marking ] GC(669) Mark: 4 stripe(s), 2 proactive flush(es), 1 terminate flush(es), 0 completion(s), 0 continuation(s)
[2022-11-24T09:21:12.420+0530][info][gc,marking ] GC(669) Mark Stack Usage: 32M
[2022-11-24T09:21:12.420+0530][info][gc,nmethod ] GC(669) NMethods: 10353 registered, 328 unregistered
[2022-11-24T09:21:12.420+0530][info][gc,metaspace] GC(669) Metaspace: 44M used, 46M committed, 1072M reserved
[2022-11-24T09:21:12.420+0530][info][gc,ref ] GC(669) Soft: 2311 encountered, 98 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref ] GC(669) Weak: 2559 encountered, 652 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref ] GC(669) Final: 56 encountered, 7 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref ] GC(669) Phantom: 27425 encountered, 27313 discovered, 27030 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,reloc ] GC(669) Small Pages: 5764 / 11528M, Empty: 4752M, Relocated: 58M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc ] GC(669) Medium Pages: 3 / 96M, Empty: 64M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc ] GC(669) Large Pages: 0 / 0M, Empty: 0M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc ] GC(669) Forwarding Usage: 29M
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Min Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Max Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Soft Max Capacity: 1024M(6%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Mark Start Mark End Relocate Start Relocate End High Low
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Capacity: 16384M (100%) 16384M (100%) 16384M (100%) 16384M (100%) 16384M (100%) 16384M (100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Free: 4760M (29%) 0M (0%) 4788M (29%) 5392M (33%) 5394M (33%) 0M (0%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Used: 11624M (71%) 16384M (100%) 11596M (71%) 10992M (67%) 16384M (100%) 10990M (67%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Live: - 5860M (36%) 5860M (36%) 5860M (36%) - -
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Allocated: - 4760M (29%) 4788M (29%) 4858M (30%) - -
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Garbage: - 5763M (35%) 947M (6%) 273M (2%) - -
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Reclaimed: - - 4816M (29%) 5490M (34%) - -
[2022-11-24T09:21:12.420+0530][info][gc ] GC(669) Garbage Collection (Allocation Stall) 11624M(71%)->10992M(67%)
[2022-11-24T09:21:12.479+0530][info][gc,start ] GC(670) Garbage Collection (Allocation Stall)
[2022-11-24T09:21:12.479+0530][info][gc,ref ] GC(670) Clearing All SoftReferences
[2022-11-24T09:21:12.479+0530][info][gc,task ] GC(670) Using 4 workers
[2022-11-24T09:21:12.479+0530][info][gc,phases ] GC(670) Pause Mark Start 0.009ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] === Garbage Collection Statistics =======================================================================================================================
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Last 10s Last 10m Last 10h Total
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Avg / Max Avg / Max Avg / Max Avg / Max
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Collector: Garbage Collection Cycle 3410.686 / 3486.475 844.670 / 3486.475 464.952 / 3486.475 464.952 / 3486.475 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Contention: Mark Segment Reset Contention 7 / 15 3 / 30 1 / 30 1 / 30 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Contention: Mark SeqNum Reset Contention 0 / 1 0 / 3 0 / 3 0 / 3 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: Allocation Stall 1 / 5 0 / 7 0 / 7 0 / 7 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: Allocation Stall 677.602 / 900.858 303.367 / 900.858 303.367 / 900.858 303.367 / 900.858 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: GC Locker Stall 0 / 0 0 / 0 0 / 1 0 / 1 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: GC Locker Stall 0.000 / 0.000 0.000 / 0.000 0.033 / 0.058 0.033 / 0.058 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: Relocation Stall 0 / 0 0 / 0 0 / 0 0 / 0 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: Relocation Stall 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Allocation Rate 1479 / 1860 1003 / 2264 329 / 2264 329 / 2264 MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Out Of Memory 0 / 0 0 / 0 0 / 0 0 / 0 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Flush 0 / 0 0 / 32 0 / 32 0 / 32 MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Hit L1 198 / 827 177 / 901 59 / 901 59 / 901 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Hit L2 541 / 930 316 / 1052 103 / 1052 103 / 1052 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Hit L3 1 / 15 12 / 657 4 / 657 4 / 657 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Miss 0 / 0 0 / 1 0 / 1 0 / 1 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Uncommit 0 / 0 0 / 0 0 / 0 0 / 0 MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Undo Object Allocation Failed 0 / 0 0 / 1 0 / 1 0 / 1 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Undo Object Allocation Succeeded 0 / 3 4 / 855 1 / 855 1 / 855 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Undo Page Allocation 0 / 3 0 / 4 0 / 4 0 / 4 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Mark 3356.120 / 3437.684 812.574 / 3437.684 445.012 / 3437.684 445.012 / 3437.684 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Mark Continue 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Mark Free 0.001 / 0.001 0.001 / 0.002 0.001 / 0.002 0.001 / 0.002 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Process Non-Strong References 8.650 / 8.848 12.746 / 47.124 8.067 / 47.124 8.067 / 47.124 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Relocate 36.550 / 41.564 14.436 / 55.995 8.357 / 55.995 8.357 / 55.995 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Reset Relocation Set 0.067 / 0.080 0.023 / 0.080 0.013 / 0.080 0.013 / 0.080 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Select Relocation Set 7.214 / 8.713 3.382 / 29.365 2.579 / 29.365 2.579 / 29.365 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Pause Mark End 0.011 / 0.011 0.011 / 0.018 0.009 / 0.018 0.009 / 0.018 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Pause Mark Start 0.008 / 0.009 0.008 / 0.016 0.008 / 0.016 0.008 / 0.016 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Pause Relocate Start 0.008 / 0.008 0.007 / 0.013 0.006 / 0.013 0.006 / 0.013 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Classes Purge 0.266 / 0.497 0.411 / 5.255 0.251 / 5.255 0.251 / 5.255 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Classes Unlink 5.018 / 5.358 9.216 / 28.725 5.991 / 28.725 5.991 / 28.725 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Mark 3352.212 / 3437.123 855.661 / 3437.123 454.341 / 3437.123 454.341 / 3437.123 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Mark Try Flush 0.030 / 0.055 0.760 / 42.368 0.442 / 42.368 0.442 / 42.368 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Mark Try Terminate 0.703 / 1.059 1.393 / 27.549 1.126 / 27.549 1.126 / 27.549 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent References Enqueue 0.003 / 0.004 0.003 / 0.013 0.002 / 0.013 0.002 / 0.013 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent References Process 3.055 / 3.524 1.183 / 9.454 0.655 / 9.454 0.655 / 9.454 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Roots ClassLoaderDataGraph 0.090 / 0.223 0.135 / 12.400 0.105 / 12.400 0.105 / 12.400 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Roots CodeCache 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Roots JavaThreads 0.411 / 1.162 0.520 / 12.391 0.407 / 12.391 0.407 / 12.391 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Roots OopStorageSet 0.018 / 0.038 0.051 / 12.043 0.030 / 12.043 0.030 / 12.043 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Weak Roots OopStorageSet 0.259 / 0.263 0.449 / 4.890 0.331 / 4.890 0.331 / 4.890 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Pause Mark Try Complete 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] System: Java Threads 28 / 28 26 / 29 21 / 29 21 / 29 threads
[2022-11-24T09:21:13.180+0530][info][gc,stats ] =========================================================================================================================================================
[2022-11-24T09:21:15.944+0530][info][gc,phases ] GC(670) Concurrent Mark 3465.071ms
[2022-11-24T09:21:15.944+0530][info][gc,phases ] GC(670) Pause Mark End 0.011ms
[2022-11-24T09:21:15.944+0530][info][gc,phases ] GC(670) Concurrent Mark Free 0.001ms
[2022-11-24T09:21:15.953+0530][info][gc,phases ] GC(670) Concurrent Process Non-Strong References 8.537ms
[2022-11-24T09:21:15.953+0530][info][gc,phases ] GC(670) Concurrent Reset Relocation Set 0.062ms
[2022-11-24T09:21:15.953+0530][info][gc ] Allocation Stall (lt-primary-pool-2) 704.524ms
[2022-11-24T09:21:15.953+0530][info][gc ] Allocation Stall (lt-primary-pool-0) 704.168ms
[2022-11-24T09:21:15.953+0530][info][gc ] Allocation Stall (lt-secondary-pool-0) 711.668ms
[2022-11-24T09:21:15.955+0530][info][gc ] Allocation Stall (lt-primary-pool-1) 707.109ms
[2022-11-24T09:21:15.955+0530][info][gc ] Allocation Stall (lt-secondary-pool-1) 713.196ms
[2022-11-24T09:21:15.957+0530][info][gc,phases ] GC(670) Concurrent Select Relocation Set 4.082ms
java version "17.0.1" 2021-10-19 LTS
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)
Graphs created with GCViewer on the GC logs.
We did not get to a reason for the issues, however, we did make some progress in examining the issue closer.
When examing the ZGC code for the OpenJDK from github.
https://github.com/openjdk/zgc/blob/1f8813495e8184b6c38319df1c2cb70de7811a76/src/hotspot/share/gc/z/zDirector.cpp#L184
It has code to work out an allocation rate. We also enabled the debug logging of the ZGC.
The logging that looks like the final lines below, predicts the allocation rate to be 0 for extended periods of time for the application start. We suspect this might be one of the reasons for the issue. However, in looking at the GCGraphs it doesn't fully explain why it gets into some panic state when there is enough memory during normal function.
We set the
-XX:ZAllocationSpikeTolerance=10
to ensure that the estimated memory use would get close to our normal allocation rate of 2000MB/s. The predicted memory usage for the ZGC from that point on appeared to be correct. We've run this settings for a while now on our services without issues so far. We also removed the-XX:SoftMaxHeapSize=
as the GC frequency required seems to be correctly estimated by just theXX:ZAllocationSpikeTolerance
.Example debug logging for Predicted allocation rate.