Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43038

Intermittent error "Cannot contact node123: java.lang.InterruptedException " in jenkins

      We face below connection errors intermittently while running jobs on node123.

      Error which we see in build log is : Cannot contact node123: java.lang.InterruptedException

      I dont see any error in thread dump or any other logs related to this node.

      Also i see there was not connection drop between Master and node.

      Slave is see is running since more than 24 hrs now.

       

       

          [JENKINS-43038] Intermittent error "Cannot contact node123: java.lang.InterruptedException " in jenkins

          Puneeth Nanjundaswamy added a comment - - edited

          Facing the same issue here. JENKINS-46853

          Puneeth Nanjundaswamy added a comment - - edited Facing the same issue here.  JENKINS-46853

          Saw this same message in the middle of a build from an elastic provisioned Openstack node.  Using a pipeline job with 7 parallel stages on 7 elastic nodes.Running Jenkins 2.89.1, Openstack plugin 2.29 and latest pipeline plugins.

          John Lengeling added a comment - Saw this same message in the middle of a build from an elastic provisioned Openstack node.  Using a pipeline job with 7 parallel stages on 7 elastic nodes.Running Jenkins 2.89.1, Openstack plugin 2.29 and latest pipeline plugins.

          Sorin Sbarnea added a comment - - edited

          I got the same error and it I have the impression that Jenkins is unable to recover from this because in >30 minutes I didn't get any progress message on the console. The slave.jar process is still running on that machine and there is no networking issues between the master and the slave. Does anyone know how to debug this further? Maybe we can narrow down the bug.

          It is true that my Jenkins is 2.60.3, which seems a little bit old.

          Sorin Sbarnea added a comment - - edited I got the same error and it I have the impression that Jenkins is unable to recover from this because in >30 minutes I didn't get any progress message on the console. The slave.jar process is still running on that machine and there is no networking issues between the master and the slave. Does anyone know how to debug this further? Maybe we can narrow down the bug. It is true that my Jenkins is 2.60.3, which seems a little bit old.

          I see this issue as well during testing which can take about 10-20 minutes of running a single shell script.

          I suppose it happens when the agent gets disconnected for a split second. Is there anyway to create a workaround protecting the shell script from this. At the moment I have to manually abort the running test.

          Thanks,

          Tsvi

          Tsvi Mostovicz added a comment - I see this issue as well during testing which can take about 10-20 minutes of running a single shell script. I suppose it happens when the agent gets disconnected for a split second. Is there anyway to create a workaround protecting the shell script from this. At the moment I have to manually abort the running test. Thanks, Tsvi

          mishal shah added a comment -

          vrenjith Did you find a workaround for the x4 slowness? 

          mishal shah added a comment - vrenjith Did you find a workaround for the x4 slowness? 

          Oleg Nenashev added a comment -

          Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Sam Van Oort added a comment - - edited

          msavlani1 shahmishal tsvi If you update to the latest Pipeline plugins and especially support-core plugin and use the suggested GC settings (https://jenkins.io/blog/2016/11/21/gc-tuning/) you should find that the InterruptedExceptions are pretty much gone – they are the result of timeouts in remoting-related operations generally. The only cases they should happen now I believe are actual hardware/system/network issues.

          In the last quarter of 2017 we did a big change to the way Pipeline's durable tasks interact with remoting that should avoid many of these issues.

          Edit: There was an additional issue fixed around support-core that caused problems and was recently fixed. Specifically, support-core plugin in version 2.42 added heap histogram analysis for diagnostics but this had the unexpected side effect of introducing periodic catastrophically long GC pauses that made the Jenkins master unresponsive for long periods and triggered timeouts (and thus the InterruptedException here when Timeouts kick in).

          Please see https://issues.jenkins-ci.org/browse/JENKINS-49931 for more details of that.

          For now I'm going to transition this to "closed" because when working with several users showing this among other symptoms, the suggestions above successfully resolved the issues – but I'm happy to re-open this if you all still experience problems after applying the above (please reply to note the same).

          Sam Van Oort added a comment - - edited msavlani1 shahmishal tsvi If you update to the latest Pipeline plugins and especially support-core plugin and use the suggested GC settings ( https://jenkins.io/blog/2016/11/21/gc-tuning/ ) you should find that the InterruptedExceptions are pretty much gone – they are the result of timeouts in remoting-related operations generally. The only cases they should happen now I believe are actual hardware/system/network issues. In the last quarter of 2017 we did a big change to the way Pipeline's durable tasks interact with remoting that should avoid many of these issues. Edit: There was an additional issue fixed around support-core that caused problems and was recently fixed. Specifically, support-core plugin in version 2.42 added heap histogram analysis for diagnostics but this had the unexpected side effect of introducing periodic catastrophically long GC pauses that made the Jenkins master unresponsive for long periods and triggered timeouts (and thus the InterruptedException here when Timeouts kick in). Please see https://issues.jenkins-ci.org/browse/JENKINS-49931 for more details of that. For now I'm going to transition this to "closed" because when working with several users showing this among other symptoms, the suggestions above successfully resolved the issues – but I'm happy to re-open this if you all still experience problems after applying the above (please reply to note the same).

          Joe Barber added a comment -

          Hi I am recently seeing the same "Cannot contact node123: java.lang.InterruptedException" error but only during parallel stages in a pipeline job.

          I have created a brand new Jenkins environment (Jenkins version 2.121.1) with all updated plugins and have the GC settings according to the gc-tuning page from the above comment.
          This issue is intermittent (about 1 every 8 builds or so).

          Support-Core version 2.48
          Pipeline version 2.5

          Any other advice?

           

          Thanks,

           

          Joe Barber added a comment - Hi I am recently seeing the same "Cannot contact node123: java.lang.InterruptedException" error but only during parallel stages in a pipeline job. I have created a brand new Jenkins environment (Jenkins version 2.121.1) with all updated plugins and have the GC settings according to the gc-tuning page from the above comment. This issue is intermittent (about 1 every 8 builds or so). Support-Core version 2.48 Pipeline version 2.5 Any other advice?   Thanks,  

          Sam Van Oort added a comment -

          joebarber What you describe sounds a lot like https://issues.jenkins-ci.org/browse/JENKINS-46507 but we have not had a consistent way to reproduce the issue, so it's very hard to debug. If you can provide a simple, self-contained sample Pipeline in the comments of that ticket that will reproduce the issue, that would be very helpful. Thanks!

          Sam Van Oort added a comment - joebarber What you describe sounds a lot like https://issues.jenkins-ci.org/browse/JENKINS-46507 but we have not had a consistent way to reproduce the issue, so it's very hard to debug. If you can provide a simple, self-contained sample Pipeline in the comments of that ticket that will reproduce the issue, that would be very helpful. Thanks!

          We're experiencing the same issue when our java agent get killed my OOM or machine on which agent is running is rebooted. Is there any way to reduce amount of time Jenkins will wait till the build will be mark as failed?

          Andrey Babushkin added a comment - We're experiencing the same issue when our java agent get killed my OOM or machine on which agent is running is rebooted. Is there any way to reduce amount of time Jenkins will wait till the build will be mark as failed?

            svanoort Sam Van Oort
            msavlani1 Manish Sawlani
            Votes:
            25 Vote for this issue
            Watchers:
            44 Start watching this issue

              Created:
              Updated:
              Resolved: