JDK 17.0.1 ZGC allocation stall on fast allocation rate

1.4k views Asked by At

We're using ZGC on JDK 17.0.1. The application successfully runs without unexpected GC stalls for a period of time, after which it gets into a state where the ZGC does not free much memory when the GC is being done. This results in multiple periodic Allocation Stalls within the application from then on.

The application itself does a lot of allocation at a periodic interval of around 35 milliseconds. This is roughly 2GB/s allocation when it’s at peak load. On a typical start in the morning, there is about 30 minutes in which the load is low, till a point at which external data starts arriving, and load is high from then on. The load profile does change as the application is running as a result. Due to the predictable allocation during load, we've set the collection at an interval using the ZCollectionInterval.

We cannot consistently reproduce the Allocation Stalls, even between different machines where the load is very similar. Oddly, if the application is restarted while the load is already high, then it does not typically run into issues.

We did play around with the UseDynamicNumberOfGCThreads as well previously, but it seems to come up with a number of threads that is 2. The chart attached uses SoftMaxHeapSize=1G. This is a bit aggressive, but we did see the same issue without setting the SofMaxHeapSize. We did also notice that the Concurrent Mark phase of the GC seems to increase leading up to this issue. We also didn’t see anything of note about ZGC within the Release notes of JDK versions up till 17.0.5.

Has anyone run into similar issues before? We're unsure as to why this happens or if this is expected behavior when setting the flags as such. As much hasn't been set and the rarity at which it occurs, we were hoping some one can give some pointers as to if this is actually a known issue with an existing fix or we're just setting it up wrong.

These are the relevant flags.

-XX:-OmitStackTraceInFastThrow
-XX:CICompilerCount=2
-XX:+DisableExplicitGC
-XX:ParallelGCThreads=4
-XX:ConcGCThreads=4
-Xmx8192m
-Xms8192m
-Xss8m
-Xlog:gc*=info:file=${LT_APP_LOG_PATH_FORMAT}.gc:time,level,tags:filecount=32,filesize=64M
-XX:+UseZGC
-XX:+AlwaysPreTouch
-XX:+UseLargePages
-XX:+UseTransparentHugePages
-XX:ZCollectionInterval=4
-Djdk.nio.maxCachedBufferSize=262144
-XX:+HeapDumpOnOutOfMemoryError
-XX:SoftMaxHeapSize=1G
[2022-11-24T09:21:12.378+0530][info][gc,phases   ] GC(669) Concurrent Select Relocation Set 8.713ms
[2022-11-24T09:21:12.378+0530][info][gc,phases   ] GC(669) Pause Relocate Start 0.008ms
[2022-11-24T09:21:12.420+0530][info][gc,phases   ] GC(669) Concurrent Relocate 41.564ms
[2022-11-24T09:21:12.420+0530][info][gc,load     ] GC(669) Load: 36.38/31.86/25.10
[2022-11-24T09:21:12.420+0530][info][gc,mmu      ] GC(669) MMU: 2ms/99.1%, 5ms/99.6%, 10ms/99.8%, 20ms/99.9%, 50ms/99.9%, 100ms/100.0%
[2022-11-24T09:21:12.420+0530][info][gc,marking  ] GC(669) Mark: 4 stripe(s), 2 proactive flush(es), 1 terminate flush(es), 0 completion(s), 0 continuation(s) 
[2022-11-24T09:21:12.420+0530][info][gc,marking  ] GC(669) Mark Stack Usage: 32M
[2022-11-24T09:21:12.420+0530][info][gc,nmethod  ] GC(669) NMethods: 10353 registered, 328 unregistered
[2022-11-24T09:21:12.420+0530][info][gc,metaspace] GC(669) Metaspace: 44M used, 46M committed, 1072M reserved
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Soft: 2311 encountered, 98 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Weak: 2559 encountered, 652 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Final: 56 encountered, 7 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref      ] GC(669) Phantom: 27425 encountered, 27313 discovered, 27030 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Small Pages: 5764 / 11528M, Empty: 4752M, Relocated: 58M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Medium Pages: 3 / 96M, Empty: 64M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Large Pages: 0 / 0M, Empty: 0M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc    ] GC(669) Forwarding Usage: 29M
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Min Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Max Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Soft Max Capacity: 1024M(6%)
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)                Mark Start          Mark End        Relocate Start      Relocate End           High               Low         
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)  Capacity:    16384M (100%)      16384M (100%)      16384M (100%)      16384M (100%)      16384M (100%)      16384M (100%)   
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)      Free:     4760M (29%)           0M (0%)         4788M (29%)        5392M (33%)        5394M (33%)           0M (0%)     
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)      Used:    11624M (71%)       16384M (100%)      11596M (71%)       10992M (67%)       16384M (100%)      10990M (67%)    
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)      Live:         -              5860M (36%)        5860M (36%)        5860M (36%)            -                  -          
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Allocated:         -              4760M (29%)        4788M (29%)        4858M (30%)            -                  -          
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669)   Garbage:         -              5763M (35%)         947M (6%)          273M (2%)             -                  -          
[2022-11-24T09:21:12.420+0530][info][gc,heap     ] GC(669) Reclaimed:         -                  -              4816M (29%)        5490M (34%)            -                  -          
[2022-11-24T09:21:12.420+0530][info][gc          ] GC(669) Garbage Collection (Allocation Stall) 11624M(71%)->10992M(67%)
[2022-11-24T09:21:12.479+0530][info][gc,start    ] GC(670) Garbage Collection (Allocation Stall)
[2022-11-24T09:21:12.479+0530][info][gc,ref      ] GC(670) Clearing All SoftReferences
[2022-11-24T09:21:12.479+0530][info][gc,task     ] GC(670) Using 4 workers
[2022-11-24T09:21:12.479+0530][info][gc,phases   ] GC(670) Pause Mark Start 0.009ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ] === Garbage Collection Statistics =======================================================================================================================
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]                                                              Last 10s              Last 10m              Last 10h                Total
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]                                                              Avg / Max             Avg / Max             Avg / Max             Avg / Max
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]   Collector: Garbage Collection Cycle                   3410.686 / 3486.475    844.670 / 3486.475    464.952 / 3486.475    464.952 / 3486.475    ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]  Contention: Mark Segment Reset Contention                     7 / 15                3 / 30                1 / 30                1 / 30          ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]  Contention: Mark SeqNum Reset Contention                      0 / 1                 0 / 3                 0 / 3                 0 / 3           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Allocation Stall                                  1 / 5                 0 / 7                 0 / 7                 0 / 7           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Allocation Stall                            677.602 / 900.858     303.367 / 900.858     303.367 / 900.858     303.367 / 900.858     ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: GC Locker Stall                                   0 / 0                 0 / 0                 0 / 1                 0 / 1           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: GC Locker Stall                               0.000 / 0.000         0.000 / 0.000         0.033 / 0.058         0.033 / 0.058       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Relocation Stall                                  0 / 0                 0 / 0                 0 / 0                 0 / 0           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Critical: Relocation Stall                              0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Allocation Rate                                1479 / 1860           1003 / 2264            329 / 2264            329 / 2264        MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Out Of Memory                                     0 / 0                 0 / 0                 0 / 0                 0 / 0           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Flush                                  0 / 0                 0 / 32                0 / 32                0 / 32          MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Hit L1                               198 / 827             177 / 901              59 / 901              59 / 901         ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Hit L2                               541 / 930             316 / 1052            103 / 1052            103 / 1052        ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Hit L3                                 1 / 15               12 / 657               4 / 657               4 / 657         ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Page Cache Miss                                   0 / 0                 0 / 1                 0 / 1                 0 / 1           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Uncommit                                          0 / 0                 0 / 0                 0 / 0                 0 / 0           MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Undo Object Allocation Failed                     0 / 0                 0 / 1                 0 / 1                 0 / 1           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Undo Object Allocation Succeeded                  0 / 3                 4 / 855               1 / 855               1 / 855         ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      Memory: Undo Page Allocation                              0 / 3                 0 / 4                 0 / 4                 0 / 4           ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Mark                            3356.120 / 3437.684    812.574 / 3437.684    445.012 / 3437.684    445.012 / 3437.684    ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Mark Continue                      0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Mark Free                          0.001 / 0.001         0.001 / 0.002         0.001 / 0.002         0.001 / 0.002       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Process Non-Strong References      8.650 / 8.848        12.746 / 47.124        8.067 / 47.124        8.067 / 47.124      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Relocate                          36.550 / 41.564       14.436 / 55.995        8.357 / 55.995        8.357 / 55.995      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Reset Relocation Set               0.067 / 0.080         0.023 / 0.080         0.013 / 0.080         0.013 / 0.080       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Concurrent Select Relocation Set              7.214 / 8.713         3.382 / 29.365        2.579 / 29.365        2.579 / 29.365      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Pause Mark End                                0.011 / 0.011         0.011 / 0.018         0.009 / 0.018         0.009 / 0.018       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Pause Mark Start                              0.008 / 0.009         0.008 / 0.016         0.008 / 0.016         0.008 / 0.016       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]       Phase: Pause Relocate Start                          0.008 / 0.008         0.007 / 0.013         0.006 / 0.013         0.006 / 0.013       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Classes Purge                      0.266 / 0.497         0.411 / 5.255         0.251 / 5.255         0.251 / 5.255       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Classes Unlink                     5.018 / 5.358         9.216 / 28.725        5.991 / 28.725        5.991 / 28.725      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Mark                            3352.212 / 3437.123    855.661 / 3437.123    454.341 / 3437.123    454.341 / 3437.123    ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Mark Try Flush                     0.030 / 0.055         0.760 / 42.368        0.442 / 42.368        0.442 / 42.368      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Mark Try Terminate                 0.703 / 1.059         1.393 / 27.549        1.126 / 27.549        1.126 / 27.549      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent References Enqueue                 0.003 / 0.004         0.003 / 0.013         0.002 / 0.013         0.002 / 0.013       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent References Process                 3.055 / 3.524         1.183 / 9.454         0.655 / 9.454         0.655 / 9.454       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots ClassLoaderDataGraph         0.090 / 0.223         0.135 / 12.400        0.105 / 12.400        0.105 / 12.400      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots CodeCache                    0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots JavaThreads                  0.411 / 1.162         0.520 / 12.391        0.407 / 12.391        0.407 / 12.391      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Roots OopStorageSet                0.018 / 0.038         0.051 / 12.043        0.030 / 12.043        0.030 / 12.043      ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Concurrent Weak Roots OopStorageSet           0.259 / 0.263         0.449 / 4.890         0.331 / 4.890         0.331 / 4.890       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]    Subphase: Pause Mark Try Complete                       0.000 / 0.000         0.000 / 0.000         0.000 / 0.000         0.000 / 0.000       ms
[2022-11-24T09:21:13.180+0530][info][gc,stats    ]      System: Java Threads                                     28 / 28               26 / 29               21 / 29               21 / 29          threads
[2022-11-24T09:21:13.180+0530][info][gc,stats    ] =========================================================================================================================================================
[2022-11-24T09:21:15.944+0530][info][gc,phases   ] GC(670) Concurrent Mark 3465.071ms
[2022-11-24T09:21:15.944+0530][info][gc,phases   ] GC(670) Pause Mark End 0.011ms
[2022-11-24T09:21:15.944+0530][info][gc,phases   ] GC(670) Concurrent Mark Free 0.001ms
[2022-11-24T09:21:15.953+0530][info][gc,phases   ] GC(670) Concurrent Process Non-Strong References 8.537ms
[2022-11-24T09:21:15.953+0530][info][gc,phases   ] GC(670) Concurrent Reset Relocation Set 0.062ms
[2022-11-24T09:21:15.953+0530][info][gc          ] Allocation Stall (lt-primary-pool-2) 704.524ms
[2022-11-24T09:21:15.953+0530][info][gc          ] Allocation Stall (lt-primary-pool-0) 704.168ms
[2022-11-24T09:21:15.953+0530][info][gc          ] Allocation Stall (lt-secondary-pool-0) 711.668ms
[2022-11-24T09:21:15.955+0530][info][gc          ] Allocation Stall (lt-primary-pool-1) 707.109ms
[2022-11-24T09:21:15.955+0530][info][gc          ] Allocation Stall (lt-secondary-pool-1) 713.196ms
[2022-11-24T09:21:15.957+0530][info][gc,phases   ] GC(670) Concurrent Select Relocation Set 4.082ms

java version "17.0.1" 2021-10-19 LTS
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)

Graphs created with GCViewer on the GC logs.

ZGC clean up issues ZGC clean up issues

No issues allocation when everything is fine No issues allocation when everything is fine

1

There are 1 answers

1
0x00 On

We did not get to a reason for the issues, however, we did make some progress in examining the issue closer.

When examing the ZGC code for the OpenJDK from github.

https://github.com/openjdk/zgc/blob/1f8813495e8184b6c38319df1c2cb70de7811a76/src/hotspot/share/gc/z/zDirector.cpp#L184

It has code to work out an allocation rate. We also enabled the debug logging of the ZGC.

-Xlog:gc*=debug:file=gc.log:time,level,tags

The logging that looks like the final lines below, predicts the allocation rate to be 0 for extended periods of time for the application start. We suspect this might be one of the reasons for the issue. However, in looking at the GCGraphs it doesn't fully explain why it gets into some panic state when there is enough memory during normal function.

We set the -XX:ZAllocationSpikeTolerance=10 to ensure that the estimated memory use would get close to our normal allocation rate of 2000MB/s. The predicted memory usage for the ZGC from that point on appeared to be correct. We've run this settings for a while now on our services without issues so far. We also removed the -XX:SoftMaxHeapSize= as the GC frequency required seems to be correctly estimated by just the XX:ZAllocationSpikeTolerance.

Example debug logging for Predicted allocation rate.

[2022-12-02T12:45:29.825+0530][debug][gc,director    ] Rule: Allocation Rate (Dynamic GC Workers), MaxAllocRate: 9624.6MB/s (+/-0.6%), Free: 3412MB, GCCPUTime: 0.136, GCDuration: 0.136s, TimeUntilOOM: 0.352s, TimeUntilGC: 0.116s, GCWorkers: 1 -> 1
[2022-12-02T12:45:29.825+0530][debug][gc,director    ] Rule: High Usage, Free: 3412MB(83.3%)
[2022-12-02T12:45:29.825+0530][debug][gc,director    ] Rule: Proactive, AcceptableGCInterval: 3.394s, TimeSinceLastGC: 0.473s, TimeUntilGC: 2.921s
[2022-12-02T12:45:29.925+0530][debug][gc,alloc       ] Allocation Rate: 958.0MB/s, Predicted: 960.9MB/s, Avg: 954.4(+/-5.7)MB/s
[2022-12-02T12:45:29.925+0530][debug][gc,director    ] Rule: Timer, Interval: 4.000s, TimeUntilGC: 3.427s
[2022-12-02T12:45:29.925+0530][debug][gc,director    ] Select GC Workers (Normal), AvoidLongGCWorkers: 0.013, AvoidOOMGCWorkers: 0.393, LastGCWorkers: 1.000, GCWorkers: 0.393