I am on vertx 4.2.4 and using vertx-web to call a remote service. I am using connect timeout of 500ms and request timeout of 1000ms. The application has 4 instances a verticle deployed and running with 4 CPU and 8 event loop threads.
WebClientOptions inst = new WebClientOptions();
.setMaxPoolSize(250)
.setConnectTimeout(500) // milliseconds
.setIdleTimeout(3600) // seconds
.setIdleTimeoutUnit(TimeUnit.SECONDS)
.setKeepAlive(true)
.setKeepAliveTimeout(60) // seconds
..
..
HttpRequest<Buffer> httpRequest = webClient
.post(remoteServiceURL)
.timeout(1000); // milliseconds
.sendJson(request)
It is observed that whenever there is a beginning of surge of requests on the vertx server after a low throughput period, the webclient tries to create new connections (because there are not enough in the pool) but it times out some requests with below error.
io.vertx.core.http.impl.NoStackTraceTimeoutException: The timeout of 1000 ms has been exceeded when getting a connection to remoteServiceURL:<port>
Vertx web-client internally starts a timer for 1000ms when it enqueues the request to get a connection from pool. Later after getting the connection (existing or newly created) it cancels the timer and starts a fresh timer of 1000ms when dispatching the request to the remote service. The above error indicates the first timer timed out i.e. it could not begin the request on time (with a connection obtained from pool).
There is lot of idle CPU available to the application and the max number of connections in the pool is 250. After monitoring through the vertx micrometer metrics, it is observed that the active http client connection are pretty close to zero, (no where close to max 250) hence the timeout is not due to waiting for idle connection when pool max connection limit has been hit. Clearly, it is trying to create a fresh connection and not being able to dispatch a request within 1000ms.
There is no connect timeout error (it is set to 500ms) i.e. it seems to be able to create connection to the remote service, but still it is not able to even begin dispatching request (with the newly obtained connection) to the remote service and times out. Note that few occassions I have seen the connect timeout error when the connection establishing took more than 500ms but that error occurs within 500ms, whereas in the current issue there is no connect timeout.
The expectation is that since new connections should get created well before the timeout of 1000ms, the above error should never be seen. There seems to be some time spent somewhere waiting, but it is not clear where is the wait involved. It happens only when there is a beginning of surge of requests and the pool needs to create new connections.
I've also tried dedicated event loop for new connection creation in vertx with below setting but that has not helped either.
inst.setPoolEventLoopSize(4)
Any help or pointers on resolving this issue is appreciated.