• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • remoting
    • Jenkins Master - 2.100, Ubuntu
      Linux Agent - Running inside a container on Ubuntu, 2.100 agent jar
      Windows Agent - Running inside a container on Windows Server 1709

      I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.

      Overall, the agents are able to connect and perform builds through to completion.  Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents.  Especially after they've been idle for a bit.

      I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail.  I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.

      I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing.  One recent disconnect produced this on the linux agent: 

      Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)

      This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again.  Here's what the Jenkins master reports for the agent:

      java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

      This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.

      I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker.  If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.

      I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!

          [JENKINS-48865] JNLP Agents/Slaves Disconnecting Unpredictably

          Alexander Trauzzi created issue -
          Alexander Trauzzi made changes -
          Description Original: I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.

          Overall, the agents are able to connect and perform builds through to completion.  Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents.  Especially after they've been idle for a bit.

          {{}}I've been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail.  I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.

          I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing.  One recent disconnect produced this on the linux agent: 

          {{Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)}}

          This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again.  Here's what the Jenkins master reports for the agent:

          {{java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)}}

          This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.

          I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker.  If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.

          I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!
          New: I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.

          Overall, the agents are able to connect and perform builds through to completion.  Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents.  Especially after they've been idle for a bit.

          I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail.  I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.

          I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing.  One recent disconnect produced this on the linux agent: 

          {{Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)}}

          This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again.  Here's what the Jenkins master reports for the agent:

          {{java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)}}

          This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.

          I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker.  If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.

          I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!

          Oleg Nenashev added a comment -

          Yeah, generally these messages appear when a virtualized agent VM/container gets terminated. In JENKINS-48616 I see the same for EC2.

          Any chance it is somehow related to Meltdown restarts? CC gjphilp

          Oleg Nenashev added a comment - Yeah, generally these messages appear when a virtualized agent VM/container gets terminated. In JENKINS-48616 I see the same for EC2. Any chance it is somehow related to Meltdown restarts? CC gjphilp

          This has been happening before meltdown, since about mid December when I started working on moving our build infrastructure over.

          Alexander Trauzzi added a comment - This has been happening before meltdown, since about mid December when I started working on moving our build infrastructure over.
          Oleg Nenashev made changes -
          Link New: This issue relates to JENKINS-48895 [ JENKINS-48895 ]

          Oleg Nenashev added a comment -

          It seems to be the same as JENKINS-48865

          Oleg Nenashev added a comment - It seems to be the same as JENKINS-48865

          Piotr Plenik added a comment -

          oleg_nenashev indeed JENKINS-48865 and JENKINS-48865 is precisely the same issue.  

          I guess that you mean JENKINS-44132. Isn't?

          Piotr Plenik added a comment - oleg_nenashev indeed  JENKINS-48865 and  JENKINS-48865 is precisely the same issue.    I guess that you mean  JENKINS-44132 . Isn't?

          Jeff Thompson added a comment -

          I suspect Oleg meant JENKINS-48895.

           

          Ping failures on the agent can occur because of some issue on the master, perhaps a restart, or excessive resource issue causing it to delay in responding to the ping, or some other system or networking issue.

          Jeff Thompson added a comment - I suspect Oleg meant JENKINS-48895 .   Ping failures on the agent can occur because of some issue on the master, perhaps a restart, or excessive resource issue causing it to delay in responding to the ping, or some other system or networking issue.

          Jeff Thompson added a comment -

          Closing for lack of sufficient diagnostics and information to reproduce after no response for quite a while.

          Jeff Thompson added a comment - Closing for lack of sufficient diagnostics and information to reproduce after no response for quite a while.
          Jeff Thompson made changes -
          Resolution New: Cannot Reproduce [ 5 ]
          Status Original: Open [ 1 ] New: Closed [ 6 ]

            Unassigned Unassigned
            jomega Alexander Trauzzi
            Votes:
            7 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated: