Anyone know how to handle linux-to-linux Jenkins agent issues with JNLP connections?

I'm seeing these hit every 20 days or so. I'm using a centos 7 agent host - 2.223 jenkins with remoting 4.2.

Our jobs run many of their steps in ephemeral docker contains via the docker plugin and we presently use devicemapper as our docker storage driver. When this happens, it does seem that docker is under some load, but I don't yet have detailed stats to back up that theory

docker.image.inside {
...
}

When this happens, the agent host is reported offline by by the controller and both are still in the same AWS VPC with the same security groups so connectivity between them should still be possible (I didn't check it last time because we had other fires to handle). Also, the agent java process was still running.

I've heard that this can be related to plugins and complex pipeline code. I don't see much in the hosts messages and haven't yet seen a pattern when I try to match this up to Jenkins controller's logs.

I'm also wondering if switching to the ssh agent plugin might make the problem less of an issue. I'll probably try enabling more logging on the host to hopefully catch additional details.

Please let me know how you've dealt with this if you've seen it or any suggestions.

01:51:49 agent.host java: INFO: Failed to synchronize IO streams on the channel hudson.remoting.Channel@...:JNLP4-connect connection to jenkins.edgewise.devops/#.#.#.#:50000
01:51:49 agent.host java: java.lang.InterruptedException
01:51:49 agent.host java: at java.lang.Object.wait(Native Method)
01:51:49 agent.host java: at hudson.remoting.Request.call(Request.java:177)
01:51:49 agent.host java: at hudson.remoting.Channel.call(Channel.java:997)
01:51:49 agent.host java: at hudson.remoting.Channel.syncIO(Channel.java:1730)
01:51:49 agent.host java: at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1328)
01:51:49 agent.host java: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
01:51:49 agent.host java: at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
01:51:49 agent.host java: at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
01:51:49 agent.host java: at java.lang.reflect.Method.invoke(Method.java:498)
01:51:49 agent.host java: at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:931)
01:51:49 agent.host java: at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:905)
01:51:49 agent.host java: at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:857)
01:51:49 agent.host java: at hudson.remoting.UserRequest.perform(UserRequest.java:211)
01:51:49 agent.host java: at hudson.remoting.UserRequest.perform(UserRequest.java:54)
01:51:49 agent.host java: at hudson.remoting.Request$2.run(Request.java:369)
01:51:49 agent.host java: at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
01:51:49 agent.host java: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
01:51:49 agent.host java: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
01:51:49 agent.host java: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
01:51:49 agent.host java: at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
01:51:49 agent.host java: at java.lang.Thread.run(Thread.java:748)
02:01:01 agent.host systemd: Created slice User Slice of root.
02:01:01 agent.host systemd: Started Session 784 of user root.

02:15:05 agent.host dhclient[1092]: DHCPREQUEST on eth0 to #.#.#.#ort 67 (xid=0x6...)
02:15:05 agent.host dhclient[1092]: DHCPACK from #.#.#.#xid=0x6...)
02:15:07 agent.host dhclient[1092]: bound to #.#.#.# -- renewal in 1587 seconds.

02:32:00 agent.host java: Oct 23, 2020 2:32:00 AM hudson.remoting.Request$2 run
02:32:00 agent.host java: INFO: Failed to send back a reply to the request hudson.remoting.Request$2@...: hudson.remoting.ChannelClosedException: Channel "unknown": Protocol stack cannot write data anymore. It is not open for write
1

There are 1 answers

0
Peter Kahn On BEST ANSWER

Whereas:

  • docker is not using the recommended efficient storage driver (doc)

devicemapper is supported, but requires direct-lvm for production environments, because loopback-lvm, while zero-configuration, has very poor performance. devicemapper was the recommended storage driver for CentOS and RHEL, as their kernel version did not support overlay2. However, current versions of CentOS and RHEL now have support for overlay2, which is now the recommended driver.

  • Jenkins Docker plugin issue #640 describes how InterruptedException can appear when docker load is high

It seems that switching storage driver to overlay2 on the agent hosts is the correct action to reduce the risk of this problem