• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • remoting
    • Jenkins Master - 2.100, Ubuntu
      Linux Agent - Running inside a container on Ubuntu, 2.100 agent jar
      Windows Agent - Running inside a container on Windows Server 1709

      I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.

      Overall, the agents are able to connect and perform builds through to completion.  Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents.  Especially after they've been idle for a bit.

      I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail.  I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.

      I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing.  One recent disconnect produced this on the linux agent: 

      Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)

      This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again.  Here's what the Jenkins master reports for the agent:

      java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

      This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.

      I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker.  If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.

      I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!

          [JENKINS-48865] JNLP Agents/Slaves Disconnecting Unpredictably

          kredens added a comment - - edited

          I managed to almost pinpoint the issue (at least in its current incarnation) - after updating to 2.346.2 everything was fine, then some plugin updates happened and I started having serious problems with agent connections + the ones that connected were very slow to update infos on Nodes page.

          First I rolled back to 2.346.1 as it was the easiest solution - it didn't help

          After some more investigating, I noticed Jenkins also stopped sending any emails, and email plugin(s) generated a lot of errors in the logs:

          WARNING    jenkins.util.Listeners#lambda$notify$0
          java.lang.NoSuchMethodError: 'javax.mail.Session hudson.tasks.Mailer$DescriptorImpl.createSession()'
              at org.jenkinsci.plugins.mailwatcher.MailWatcherMailer.send(MailWatcherMailer.java:116)
              at org.jenkinsci.plugins.mailwatcher.MailWatcherNotification.send(MailWatcherNotification.java:156)
              at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener$Notification$Builder.send(WatcherComputerListener.java:181)
              at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener.onOffline(WatcherComputerListener.java:91)
              at hudson.slaves.SlaveComputer.lambda$closeChannel$1(SlaveComputer.java:927)
              at jenkins.util.Listeners.lambda$notify$0(Listeners.java:59)
              at jenkins.util.Listeners.notify(Listeners.java:67)
              at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:927)
              at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:756)
              at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:175)
              at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:337)
              at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:428)
              at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler.lambda$onChannel$0(JnlpProtocol4Handler.java:334)
              at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
              at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
              at java.base/java.lang.Thread.run(Unknown Source) 

          So I rolled back:

          Mailer Plugin from 435.something to 414.something (excuse me, but could we please move back to actually useful and human friendly numbering?)

          Email Extension from 2.90 to 2.89

          And the problem was immediately solved. Not sure which one of these actually caused the issue, as I rolled them back simultaneously (also it was dependency hell, had to roll back about 10 more plugins because it was somehow crucial for them to have latest mail plugins, even though they don't send any mails).

          Now I'm stuck with old blueocean/pipeline plugins due to weird dependencies, but at least agents work fine.

          My Jenkins instances are running on Java 11 (latest Adoptium jre)

          kredens added a comment - - edited I managed to almost pinpoint the issue (at least in its current incarnation) - after updating to 2.346.2 everything was fine, then some plugin updates happened and I started having serious problems with agent connections + the ones that connected were very slow to update infos on Nodes page. First I rolled back to 2.346.1 as it was the easiest solution - it didn't help After some more investigating, I noticed Jenkins also stopped sending any emails, and email plugin(s) generated a lot of errors in the logs: WARNING    jenkins.util.Listeners#lambda$notify$0 java.lang.NoSuchMethodError: 'javax.mail.Session hudson.tasks.Mailer$DescriptorImpl.createSession()'     at org.jenkinsci.plugins.mailwatcher.MailWatcherMailer.send(MailWatcherMailer.java:116)     at org.jenkinsci.plugins.mailwatcher.MailWatcherNotification.send(MailWatcherNotification.java:156)     at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener$Notification$Builder.send(WatcherComputerListener.java:181)     at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener.onOffline(WatcherComputerListener.java:91)     at hudson.slaves.SlaveComputer.lambda$closeChannel$1(SlaveComputer.java:927)     at jenkins.util.Listeners.lambda$notify$0(Listeners.java:59)     at jenkins.util.Listeners.notify(Listeners.java:67)     at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:927)     at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:756)     at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:175)     at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:337)     at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:428)     at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler.lambda$onChannel$0(JnlpProtocol4Handler.java:334)     at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)     at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)     at java.base/java.lang. Thread .run(Unknown Source) So I rolled back: Mailer Plugin from 435.something to 414.something (excuse me, but could we please move back to actually useful and human friendly numbering?) Email Extension from 2.90 to 2.89 And the problem was immediately solved. Not sure which one of these actually caused the issue, as I rolled them back simultaneously (also it was dependency hell, had to roll back about 10 more plugins because it was somehow crucial for them to have latest mail plugins, even though they don't send any mails). Now I'm stuck with old blueocean/pipeline plugins due to weird dependencies, but at least agents work fine. My Jenkins instances are running on Java 11 (latest Adoptium jre)

          Mark Waite added a comment -

          kredens could you report that message from the mailer plugin as a separate issue and include steps that will allow someone else to duplicate the issue from a fresh Jenkins installation?

          I'm surprised that a plugin upgrade would have any impact on agent connection reliability. I'd like to do more investigation, but your message does not provide enough context to do more investigation.

          Mark Waite added a comment - kredens could you report that message from the mailer plugin as a separate issue and include steps that will allow someone else to duplicate the issue from a fresh Jenkins installation? I'm surprised that a plugin upgrade would have any impact on agent connection reliability. I'd like to do more investigation, but your message does not provide enough context to do more investigation.

          Basil Crow added a comment -

          The error in Mail Watcher is JENKINS-69088, which was fixed in jenkinsci/mail-watcher-plugin#11 and released in 1.17.

          Basil Crow added a comment - The error in Mail Watcher is JENKINS-69088 , which was fixed in jenkinsci/mail-watcher-plugin#11 and released in 1.17 .

          kredens added a comment -

          markewaite the only connection I can think of between agent connection reliability and those email plugins, is that on most agents I have enabled email agent offline/online status notifications - maybe without properly working email "subsystem", something awry happens. 

          I'm yet to try with updated Mail Watcher plugin, will report back whether the issue reappears or not when all three plugins get updated

          kredens added a comment - markewaite  the only connection I can think of between agent connection reliability and those email plugins, is that on most agents I have enabled email agent offline/online status notifications - maybe without properly working email "subsystem", something awry happens.  I'm yet to try with updated Mail Watcher plugin, will report back whether the issue reappears or not when all three plugins get updated

          kredens added a comment -

          With fixed Mail Watcher plugin everything seems to be back to normal.

          kredens added a comment - With fixed Mail Watcher plugin everything seems to be back to normal.

          I have a similar issue. is a fix planned?

          Etienne Weiler added a comment - I have a similar issue. is a fix planned?

          Vishal added a comment -

          Still experiencing "Ping failed. Terminating the channel JNLP4-connect" / "TimeoutException" errors with Jenkins version 2.375.2 (and jdk 11)
          Is there any work around ? 

          Vishal added a comment - Still experiencing "Ping failed. Terminating the channel JNLP4-connect" / "TimeoutException" errors with Jenkins version 2.375.2 (and jdk 11) .  Is there any work around ? 

          Vishal added a comment -

          We still experience the issue with Jenkins 2.401.1 version, even though some workaround was made in Jenkins's 2.387.2 release.
          Here is changelog link :  https://www.jenkins.io/changelog-stable/#v2.387.2

          The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller. 

          Your inputs / help would be greatly appreciated. 

           

          Vishal added a comment - We still experience the issue with Jenkins 2.401.1 version, even though some workaround was made in Jenkins's 2.387.2 release. Here is changelog link :   https://www.jenkins.io/changelog-stable/#v2.387.2 .  The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller.  Your inputs / help would be greatly appreciated.   

          Vishal added a comment -

          Jenkins agent disconnects and reconnects back after few minutes without manual intervention, is there any workaround for this issue ?
          Any input would be greatly appreciated. Thanks !

          Vishal added a comment - Jenkins agent disconnects and reconnects back after few minutes without manual intervention, is there any workaround for this issue ? Any input would be greatly appreciated. Thanks !

          Mark Waite added a comment -

          The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller.

          That suggests that you are probably using the wrong technique to upgrade the container image.

          The Jenkins war file inside the container image should not be upgraded. A new container image should be built with the newer Jenkins version. The new container image can then be tested to confirm it works in your environment. However, that is a question outside this issue. Please use the Jenkins community forum for question and answer, rather than using the issue tracker for question and answer.

          Mark Waite added a comment - The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller. That suggests that you are probably using the wrong technique to upgrade the container image. The Jenkins war file inside the container image should not be upgraded. A new container image should be built with the newer Jenkins version. The new container image can then be tested to confirm it works in your environment. However, that is a question outside this issue. Please use the Jenkins community forum for question and answer, rather than using the issue tracker for question and answer.

            Unassigned Unassigned
            jomega Alexander Trauzzi
            Votes:
            7 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated: