I'm trying to figure out the bottleneck(s) in a web application on a production server. There are about 80 sites running on ColdFusion 10 with the latest Java 8, Tomcat 7, IIS 8.5 on Windows Server 2012 R2. Requests randomly peak from average 500-1000 ms to up to 30 seconds, meaning 9 out of 10 requests of the same page are completed within 1000 ms and one takes 20-30 seconds. It can happen on pretty much every page/template with code, it seems.
Web app characteristics:
- lots of regular expressions and string manipulation
- lots of data transformation (list to array, recreating arrays, randomizing data order, generating structs)
- lightweight SQL queries (dedicated SQL server in the same network)
- no outgoing connections (
<cfhttp>
)
I'm monitoring using FusionReactor and I observed the following things:
- the server has very low traffic (6 requests per second on average, 50 at best)
- requests are not queued
- average 60 JDBC queries per second between 1-2 ms, highest is 250 JDBC queries per second, worst total time is around 300 ms
- heap is at 10 GB at best (most of it being cache/old gen, eden space being collected at around 2 GB, survivor space less than 200 MB)
- GC young collection runs every 40 seconds around 20 ms on average, worst time is 140 ms
- disk read/write is less than 200 Kb/s on average, highest reads at 2 Mb/s, highest writes at 4 Mb/s
- network IO: 200 b/s TX, 400 b/s RX on average
- CPU runs from 20% to 60% load
- whenever the 20-30 seconds peaks occur, the JIT Time graph shows spikes
Stacktrace of a slow request
Note: This is just an example. Other hanging requests show the gap in other functions. There doesn't seem to be a pattern, sometimes it's a function that contains a query, sometimes it's just a <cfinclude>
, sometimes it's just plain output.
Thread CPU Time is 2500 ms while execution time is 29727 ms. TTFB and TTLB is 29726 ms, stream opened 2 ms before and closed 1 ms after. Query Total Time was 76 ms (19 queries).
Notice how funcCONSTRUCTFOOTER.getAccess
is actually just a single line Java function return 0;
(returning the access
attribute of the <cffunction>
tag), so I really doubt that the execution of this function took 23 seconds. I don't see any lock/wait either though.
JVM (Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
)
-server
-Xms24G
-Xmx48G
-Xss4m
-XX:MaxMetaspaceSize=2G
-XX:+TieredCompilation
-XX:ReservedCodeCacheSize=2G
-XX:+UseCompressedOops
-XX:+UseG1GC
-XX:MaxGCPauseMillis=100
-Xbatch
ColdFusion (10.0.22.283922
, Update 22)
Maximum number of simultaneous Template requests: 40
Maximum number of cached templates: 40000 (~14500 cached according to monitor)
Trusted cache: true
Cache template in request: true
Component cache: true
Save class files: true
Connector
<Connector port="8012" protocol="AJP/1.3" redirectPort="8445" tomcatAuthentication="false" maxThreads="500" connectionTimeout="60000" />
worker.list=cfusion
worker.cfusion.type=ajp13
worker.cfusion.host=localhost
worker.cfusion.port=8012
worker.cfusion.connection_pool_size=500
worker.cfusion.max_reuse_connections=250
worker.cfusion.connection_pool_timeout=60
Average in metrics.log
ranges from 35 to 75 threads. Busy count is about 90% of the value, e.g. Max threads: 500 Current thread count: 70 Current thread busy: 64
.
Hardware
- HyperV VM on Intel Xeon E5-2650
- 4 physical cores assigned, 8 logical cores
- 64 GB RAM assigned
What could be the reason? How can I inspect the issue further?