Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-24050

All slaves disconnect and no new slaves can connect due to CancelledKeyException in org.jenkinsci.remoting

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • Enterprise Linux 5.x master, Windows and Linux slaves of varying releases. Slaves are added and removed reasonably frequently in a way similar to the EC2Plugin (although others have reported with snapshot reverting and even with regular slaves)

      We have an issue where we get a CancelledKeyException and 100% of our slaves disconnect and no new new slaves can connect until a restart happens. The issue seems to happen randomly.

      See: https://issues.jenkins-ci.org/browse/JENKINS-22932?focusedCommentId=205983&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205983#JENKINS-22932 and later for some more context.

      The full error message in the build is:
      FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Failed to abort
      hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Failed to abort
      at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
      at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
      at hudson.remoting.Request.call(Request.java:174)
      at hudson.remoting.Channel.call(Channel.java:739)
      at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:168)
      at com.sun.proxy.$Proxy83.join(Unknown Source)
      at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:956)
      at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:137)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:97)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
      at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
      at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:772)
      at hudson.model.Build$BuildExecution.build(Build.java:199)
      at hudson.model.Build$BuildExecution.doRun(Build.java:160)
      at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:535)
      at hudson.model.Run.execute(Run.java:1732)
      at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
      at hudson.model.ResourceController.execute(ResourceController.java:88)
      at hudson.model.Executor.run(Executor.java:234)
      Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Failed to abort
      at hudson.remoting.Request.abort(Request.java:299)
      at hudson.remoting.Channel.terminate(Channel.java:802)
      at hudson.remoting.Channel$2.terminate(Channel.java:483)
      at hudson.remoting.AbstractByteArrayCommandTransport$1.terminate(AbstractByteArrayCommandTransport.java:72)
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
      at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:618)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:592)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:744)
      Caused by: java.io.IOException: Failed to abort
      ... 9 more
      Caused by: java.nio.channels.CancelledKeyException
      at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
      at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
      at java.nio.channels.SelectionKey.isReadable(SelectionKey.java:289)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:513)
      ... 6 more

          [JENKINS-24050] All slaves disconnect and no new slaves can connect due to CancelledKeyException in org.jenkinsci.remoting

          Kevin Browder created issue -

          Kevin Browder added a comment -

          JENKINS-24050 was opened since it's actually a different issue than JENKINS-22932 (which probably should be reclosed)

          Kevin Browder added a comment - JENKINS-24050 was opened since it's actually a different issue than JENKINS-22932 (which probably should be reclosed)
          Kevin Browder made changes -
          Link New: This issue is related to JENKINS-22932 [ JENKINS-22932 ]

          James Noonan added a comment -

          I was going to raise a second defect, but I think this is similar enough.

          When the problem occurs, the Slaves Console shows 'Connected'. However, the master shows them all disconnected. The only way to recover so far is to restart Jenkins.
          We are running Master on WindowsServer2012, on VMWare. We are running about 70 slaves, a mix OSX10.9, Win7, and Linux Sled 11 on VMWare. There are some other variants. We are running Jenkins 1.563.

          This issue has occurred three times for us. Two cases are independent; one occurred shortly after the first and the JVM was not restarted, so perhaps recovery between the 1st and 2nd time was not complete. We have not identified a trigger cause for this problem.

          The thread count starts to increase linearly once the problem occurs, but we believe that this is a symptom. In the JavaMelody Monitoring Plugin, there may be a difference between the reported thread number on the machine in two different places. The graph showed 4000 (it was running but down for 30 hours). However, the thread count below showed 400. I believe that the first figure maybe the JVM's count while the second is Jenkins'. In normal operation, we see about 200 threads. (However, we restarted, so I am not 100% sure that this is correct).

          We see the following messages in the error log. The same exception occurs for each of our slaves within a short period of time.

          Jul 31, 2014 5:13:17 AM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed
          WARNING: NioChannelHub keys=86 gen=1625477529: Computer.threadPoolForRemoting 58 for + XXXXXXXX terminated
          java.io.IOException: Failed to abort
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
          at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
          at java.util.concurrent.FutureTask.run(Unknown Source)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          at java.lang.Thread.run(Unknown Source)
          Caused by: java.nio.channels.ClosedChannelException
          at sun.nio.ch.SocketChannelImpl.shutdownInput(Unknown Source)
          at sun.nio.ch.SocketAdaptor.shutdownInput(Unknown Source)
          at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
          at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226)
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474)
          ... 6 more

          In the first case, we also saw ping timeouts occur at about the same time as the problem. These were not present in the other case. On the latest case, there was a single slave losing network connectivity and we saw this exception in advance of when the 'crash' happened. However, I believe this to be a coincidence. The exception occurs in the logs without all slaves losing connectivity from time to time.

          We see other exceptions in the logs. However, these seem to be related to us shutting down idle machines, or the Disk Usage Util plugin, and seem unrelated.

          Last week, we increased the load on our machine from about 40-slaves to 70, and also increased the number of jobs. Before this, we had not seen this problem.

          We are planning to upgrade to take in the (now reopened) fix for 22932.

          James Noonan added a comment - I was going to raise a second defect, but I think this is similar enough. When the problem occurs, the Slaves Console shows 'Connected'. However, the master shows them all disconnected. The only way to recover so far is to restart Jenkins. We are running Master on WindowsServer2012, on VMWare. We are running about 70 slaves, a mix OSX10.9, Win7, and Linux Sled 11 on VMWare. There are some other variants. We are running Jenkins 1.563. This issue has occurred three times for us. Two cases are independent; one occurred shortly after the first and the JVM was not restarted, so perhaps recovery between the 1st and 2nd time was not complete. We have not identified a trigger cause for this problem. The thread count starts to increase linearly once the problem occurs, but we believe that this is a symptom. In the JavaMelody Monitoring Plugin, there may be a difference between the reported thread number on the machine in two different places. The graph showed 4000 (it was running but down for 30 hours). However, the thread count below showed 400. I believe that the first figure maybe the JVM's count while the second is Jenkins'. In normal operation, we see about 200 threads. (However, we restarted, so I am not 100% sure that this is correct). We see the following messages in the error log. The same exception occurs for each of our slaves within a short period of time. Jul 31, 2014 5:13:17 AM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed WARNING: NioChannelHub keys=86 gen=1625477529: Computer.threadPoolForRemoting 58 for + XXXXXXXX terminated java.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184) at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownInput(Unknown Source) at sun.nio.ch.SocketAdaptor.shutdownInput(Unknown Source) at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474) ... 6 more In the first case, we also saw ping timeouts occur at about the same time as the problem. These were not present in the other case. On the latest case, there was a single slave losing network connectivity and we saw this exception in advance of when the 'crash' happened. However, I believe this to be a coincidence. The exception occurs in the logs without all slaves losing connectivity from time to time. We see other exceptions in the logs. However, these seem to be related to us shutting down idle machines, or the Disk Usage Util plugin, and seem unrelated. Last week, we increased the load on our machine from about 40-slaves to 70, and also increased the number of jobs. Before this, we had not seen this problem. We are planning to upgrade to take in the (now reopened) fix for 22932.

          Kevin Browder added a comment -

          OK so I think the core issue is that org.jenkinsci.remoting.nio.NioChannelHub.java's line 513 is:
          if (key.isReadable()) {
          where as I think it should be:
          if (key.isValid() && key.isReadable()) {
          I guess this would fix the issue assuming that selectedKeys().iterator() is thread safe (I don't really know much about nio), actually it probably makes sense just to add a catch to one of the handlers in the same method (I think the one at http://git.io/VtniaQ).

          Basically my thoughts as to what's happening is that isReadable is generating a CancelledKeyException which ends up getting caught by the RuntimeException handler (at http://git.io/l-5MhA) which ends up killing the loop and attempts to abort everything, including the selector that's not-valid (which gives the message in the description).

          Kevin Browder added a comment - OK so I think the core issue is that org.jenkinsci.remoting.nio.NioChannelHub.java's line 513 is: if (key.isReadable()) { where as I think it should be: if (key.isValid() && key.isReadable()) { I guess this would fix the issue assuming that selectedKeys().iterator() is thread safe (I don't really know much about nio), actually it probably makes sense just to add a catch to one of the handlers in the same method (I think the one at http://git.io/VtniaQ ). Basically my thoughts as to what's happening is that isReadable is generating a CancelledKeyException which ends up getting caught by the RuntimeException handler (at http://git.io/l-5MhA ) which ends up killing the loop and attempts to abort everything, including the selector that's not-valid (which gives the message in the description).
          Jesse Glick made changes -
          Component/s New: core [ 15593 ]
          Component/s Original: slave-status [ 15981 ]
          Assignee New: Kohsuke Kawaguchi [ kohsuke ]
          Kevin Browder made changes -
          Summary Original: All slaves disconnect and no new slaves can connect CancelledKeyException in org.jenkinsci.remoting New: All slaves disconnect and no new slaves can connect due to CancelledKeyException in org.jenkinsci.remoting

          Kevin Browder added a comment - - edited

          @James: So I think the closed channel exception is actually closer to the Jenkins-22932 bug (if so you should repopen, since I had reopened before realizing I had a different root cause I then closed). However one could argue that the "selector" loop should actually catch all NIO errors and try again instead of it's current behavior of killing the loop entirely so it might be the case that the fix ends up being the same.

          Additionally I've implemented a patch that implements the key.isValid() check above:
          https://github.com/kbrowder/remoting/commit/d52cef17a789bac0d1478c561c6696a82eb9ab6a
          Additionally I've got another change that captures CancelledKeyExceptions:
          https://github.com/kbrowder/remoting/commit/1dc29075e26c382b593d189a3a04cd1ab859f7c5

          Actually I think with some minor modification you could extend this last approach to catch a number of potential pitfalls

          Kevin Browder added a comment - - edited @James: So I think the closed channel exception is actually closer to the Jenkins-22932 bug (if so you should repopen, since I had reopened before realizing I had a different root cause I then closed). However one could argue that the "selector" loop should actually catch all NIO errors and try again instead of it's current behavior of killing the loop entirely so it might be the case that the fix ends up being the same. Additionally I've implemented a patch that implements the key.isValid() check above: https://github.com/kbrowder/remoting/commit/d52cef17a789bac0d1478c561c6696a82eb9ab6a Additionally I've got another change that captures CancelledKeyExceptions: https://github.com/kbrowder/remoting/commit/1dc29075e26c382b593d189a3a04cd1ab859f7c5 Actually I think with some minor modification you could extend this last approach to catch a number of potential pitfalls

          Jesse Glick added a comment -

          Assuming the purported fix in JENKINS-22932 did in fact correct at least some variants of the bug, it should be left closed; if this issue represents some other variants, then fine—a follow-up fix can close this one, and it can be backported separately if marked lts-candidate.

          Jesse Glick added a comment - Assuming the purported fix in JENKINS-22932 did in fact correct at least some variants of the bug, it should be left closed; if this issue represents some other variants, then fine—a follow-up fix can close this one, and it can be backported separately if marked lts-candidate .

          James Noonan added a comment -

          We updated to take in fix 22932 today.

          If the issue reoccurs for us, I'll raise a new defect.

          James Noonan added a comment - We updated to take in fix 22932 today. If the issue reoccurs for us, I'll raise a new defect.

            kohsuke Kohsuke Kawaguchi
            kbrowder Kevin Browder
            Votes:
            5 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: