I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.
Overall, the agents are able to connect and perform builds through to completion. Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents. Especially after they've been idle for a bit.
I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail. I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.
I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing. One recent disconnect produced this on the linux agent:
Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)
This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again. Here's what the Jenkins master reports for the agent:
java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.
I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker. If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.
I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!
- relates to
-
JENKINS-48895 Channels closed exception after upgrade Jenkins version 2.90
-
- Closed
-
I managed to almost pinpoint the issue (at least in its current incarnation) - after updating to 2.346.2 everything was fine, then some plugin updates happened and I started having serious problems with agent connections + the ones that connected were very slow to update infos on Nodes page.
First I rolled back to 2.346.1 as it was the easiest solution - it didn't help
After some more investigating, I noticed Jenkins also stopped sending any emails, and email plugin(s) generated a lot of errors in the logs:
So I rolled back:
Mailer Plugin from 435.something to 414.something (excuse me, but could we please move back to actually useful and human friendly numbering?)
Email Extension from 2.90 to 2.89
And the problem was immediately solved. Not sure which one of these actually caused the issue, as I rolled them back simultaneously (also it was dependency hell, had to roll back about 10 more plugins because it was somehow crucial for them to have latest mail plugins, even though they don't send any mails).
Now I'm stuck with old blueocean/pipeline plugins due to weird dependencies, but at least agents work fine.
My Jenkins instances are running on Java 11 (latest Adoptium jre)