[JENKINS-48865] JNLP Agents/Slaves Disconnecting Unpredictably

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: remoting
Labels:
- slave
Environment:
Jenkins Master - 2.100, Ubuntu
Linux Agent - Running inside a container on Ubuntu, 2.100 agent jar
Windows Agent - Running inside a container on Windows Server 1709

Similar Issues:
Powered by SuggestiMate

Show

I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.

Overall, the agents are able to connect and perform builds through to completion. Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents. Especially after they've been idle for a bit.

I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail. I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.

I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing. One recent disconnect produced this on the linux agent:

Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)

This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again. Here's what the Jenkins master reports for the agent:

java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.

I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker. If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.

I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

jenkins-from-jacob-anderson.log
171 kB
2022-02-01 17:59
jswum_jenkins_log.txt
68 kB
2019-04-01 21:19

relates to

JENKINS-48895 Channels closed exception after upgrade Jenkins version 2.90

Closed

Alexander Trauzzi created issue - 2018-01-09 15:49

Alexander Trauzzi made changes - 2018-01-09 15:51

Description

Original: I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.

Overall, the agents are able to connect and perform builds through to completion. Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents. Especially after they've been idle for a bit.

{{}}I've been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail. I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.

I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing. One recent disconnect produced this on the linux agent:

{{Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)}}

This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again. Here's what the Jenkins master reports for the agent:

{{java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)}}

This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.

I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker. If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.

I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!

New: I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.

Overall, the agents are able to connect and perform builds through to completion. Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents. Especially after they've been idle for a bit.

I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail. I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.

I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing. One recent disconnect produced this on the linux agent:

{{Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)}}

This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again. Here's what the Jenkins master reports for the agent:

{{java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)}}

This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.

I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker. If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.

I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!

Oleg Nenashev added a comment - 2018-01-09 20:08

Yeah, generally these messages appear when a virtualized agent VM/container gets terminated. In ~~JENKINS-48616~~ I see the same for EC2.

Any chance it is somehow related to Meltdown restarts? CC gjphilp

Oleg Nenashev added a comment - 2018-01-09 20:08 Yeah, generally these messages appear when a virtualized agent VM/container gets terminated. In JENKINS-48616 I see the same for EC2. Any chance it is somehow related to Meltdown restarts? CC gjphilp

Alexander Trauzzi added a comment - 2018-01-09 20:15

This has been happening before meltdown, since about mid December when I started working on moving our build infrastructure over.

Alexander Trauzzi added a comment - 2018-01-09 20:15 This has been happening before meltdown, since about mid December when I started working on moving our build infrastructure over.

Oleg Nenashev made changes - 2018-01-25 12:26

Link

New: This issue relates to ~~JENKINS-48895~~ [ ~~JENKINS-48895~~ ]

Oleg Nenashev added a comment - 2018-01-25 12:26

It seems to be the same as JENKINS-48865

Oleg Nenashev added a comment - 2018-01-25 12:26 It seems to be the same as JENKINS-48865

Piotr Plenik added a comment - 2018-10-26 10:53

oleg_nenashev indeed JENKINS-48865 and JENKINS-48865 is precisely the same issue.

I guess that you mean ~~JENKINS-44132~~. Isn't?

Piotr Plenik added a comment - 2018-10-26 10:53 oleg_nenashev indeed JENKINS-48865 and JENKINS-48865 is precisely the same issue. I guess that you mean JENKINS-44132 . Isn't?

Jeff Thompson added a comment - 2018-10-26 17:57

I suspect Oleg meant ~~JENKINS-48895~~.

Ping failures on the agent can occur because of some issue on the master, perhaps a restart, or excessive resource issue causing it to delay in responding to the ping, or some other system or networking issue.

Jeff Thompson added a comment - 2018-10-26 17:57 I suspect Oleg meant JENKINS-48895 . Ping failures on the agent can occur because of some issue on the master, perhaps a restart, or excessive resource issue causing it to delay in responding to the ping, or some other system or networking issue.

Jeff Thompson added a comment - 2018-12-11 19:08

Closing for lack of sufficient diagnostics and information to reproduce after no response for quite a while.

Jeff Thompson added a comment - 2018-12-11 19:08 Closing for lack of sufficient diagnostics and information to reproduce after no response for quite a while.

Jeff Thompson made changes - 2018-12-11 19:08

Resolution		New: Cannot Reproduce [ 5 ]
Status	Original: Open [ 1 ]	New: Closed [ 6 ]

Assignee:: Unassigned

Reporter:: Alexander Trauzzi

Votes:: 7 Vote for this issue

Watchers:: 22 Start watching this issue

Created:: 2018-01-09 15:49

Updated:: 2024-09-27 02:07

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Oleg Nenashev added a comment - 2018-01-09 20:08

Expand comment: Oleg Nenashev added a comment - 2018-01-09 20:08

Collapse comment: Alexander Trauzzi added a comment - 2018-01-09 20:15

Expand comment: Alexander Trauzzi added a comment - 2018-01-09 20:15

Collapse comment: Oleg Nenashev added a comment - 2018-01-25 12:26

Expand comment: Oleg Nenashev added a comment - 2018-01-25 12:26

Collapse comment: Piotr Plenik added a comment - 2018-10-26 10:53

Expand comment: Piotr Plenik added a comment - 2018-10-26 10:53

Collapse comment: Jeff Thompson added a comment - 2018-10-26 17:57

Expand comment: Jeff Thompson added a comment - 2018-10-26 17:57

Collapse comment: Jeff Thompson added a comment - 2018-12-11 19:08

Expand comment: Jeff Thompson added a comment - 2018-12-11 19:08

People

Dates