Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-46507

Parallel Pipeline random java.lang.InterruptedException

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • None
    • workflow-durable-task-step 2.29

      In my pipeline job,
      sometimes it'd randomly receive the java.lang.InterruptedException below:

      java.lang.InterruptedException
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
      	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275)
      	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294)
      	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61)
      	at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:251)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:179)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:126)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.println(CpsScript.java:207)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.print(CpsScript.java:202)
      	at sun.reflect.GeneratedMethodAccessor103253.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
      ....
      ....
      

      Please refer to the file attachment for the full console log and the pipeline Jenkinsfile code.

        1. workflow-durable-task-step.hpi
          85 kB
        2. jenkins.log
          266 kB
        3. stuff.tgz
          310 kB
        4. hs_err_pid239040.log
          84 kB
        5. Jenkinsfile
          7 kB
        6. Jenkinsfile.txt
          6 kB
        7. consoleText_ERROR.txt
          95 kB

          [JENKINS-46507] Parallel Pipeline random java.lang.InterruptedException

          Andrew Bayer added a comment -

          svanoort - may be of interest to you.

          Andrew Bayer added a comment - svanoort - may be of interest to you.

          Kieran Webber added a comment -

          +1 to this issue. Experiencing a hang followed by this error when large amount of pipelines run concurrently

          Kieran Webber added a comment - +1 to this issue. Experiencing a hang followed by this error when large amount of pipelines run concurrently

          Ann Beeskau added a comment -

          +1 we're experiencing this as well.

          Ann Beeskau added a comment - +1 we're experiencing this as well.

          Syed Imran added a comment - - edited

          +1 we're experiencing this as well. expecting a resolution

          Syed Imran added a comment - - edited +1 we're experiencing this as well. expecting a resolution

          Georg Henzler added a comment -

          +1 we're experiencing this as well.

          Georg Henzler added a comment - +1 we're experiencing this as well.

          Suriyanath added a comment -

          We are also experiencing the same, on jenkins version 2.69

          Suriyanath added a comment - We are also experiencing the same, on jenkins version 2.69

          Same here using Jenkins 2.73.2

          Lorenzo Cubero added a comment - Same here using Jenkins 2.73.2

          shyam singh added a comment - - edited

          +1 Same here, we are facing this issue very frequently

          shyam singh added a comment - - edited +1 Same here, we are facing this issue very frequently

          Bruno Sousa added a comment -

          +1 same here

          Bruno Sousa added a comment - +1 same here

          Jarrett B added a comment -

          +1 Same here

          Jarrett B added a comment - +1 Same here

          Georg Henzler added a comment -

          We also had this problem again today... reading jglick's comment in https://issues.jenkins-ci.org/browse/JENKINS-35710?focusedCommentId=268398&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-268398 I suspect adding more slaves does not help (as the stacktrace clearly points to CPS, and according the comment that always runs on master)... but I suppose using https://github.com/jenkinsci/throttle-concurrent-builds-plugin to ensure only a maximum of builds run in parallel globally could help. svanoort before trying, do you think this is a valid way of getting rid of the problem? (asking before trying as this is not easy to reproduce)

          Georg Henzler added a comment - We also had this problem again today... reading jglick 's comment in https://issues.jenkins-ci.org/browse/JENKINS-35710?focusedCommentId=268398&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-268398 I suspect adding more slaves does not help (as the stacktrace clearly points to CPS, and according the comment that always runs on master)... but I suppose using https://github.com/jenkinsci/throttle-concurrent-builds-plugin to ensure only a maximum of builds run in parallel globally could help. svanoort before trying, do you think this is a valid way of getting rid of the problem? (asking before trying as this is not easy to reproduce)

          Sam Van Oort added a comment -

          ghenzler Executors have nothing to do with this, and this issue is generally within the scope of a single build, so limiting the maximum number should have no impact – except when restarting Jenkins, where there might be enough of a bottleneck due to loading FlowExecutions to hit the 1 minute timeout. Maybe if you had maybe 10x as many Pipelines as CPU cores on the master, or really complex Program state – in this case we'd be blocking on the code here: https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L1024

          Do you have a message like this in your logs? WARNING, failed to wait for $someRun to be loaded (and then a stack trace).

          More likely there's something subtly wrong with the concurrency/error-handling model here – some of what I'm doing right now to untangle some nasty issues in these guts may help – and there was a change in the last release of workflow-cps that might help if you don't have it already (it caught more exceptions and prevented threads from potentially waiting forever).

          Sam Van Oort added a comment - ghenzler Executors have nothing to do with this, and this issue is generally within the scope of a single build, so limiting the maximum number should have no impact – except when restarting Jenkins, where there might be enough of a bottleneck due to loading FlowExecutions to hit the 1 minute timeout. Maybe if you had maybe 10x as many Pipelines as CPU cores on the master, or really complex Program state – in this case we'd be blocking on the code here: https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L1024 Do you have a message like this in your logs? WARNING, failed to wait for $someRun to be loaded (and then a stack trace). More likely there's something subtly wrong with the concurrency/error-handling model here – some of what I'm doing right now to untangle some nasty issues in these guts may help – and there was a change in the last release of workflow-cps that might help if you don't have it already (it caught more exceptions and prevented threads from potentially waiting forever).

          Alexey Grigorov added a comment - - edited

          Same here. I'm running a @NonCPS labeled method in library and if fails with this error randomly in the middle

          import jenkins.*
          import jenkins.model.*
          import hudson.*
          import hudson.model.*
          
          /* This method will find all jobs and verify their disk usage.
          Then it will remove job runs to fit  quota
          
          That will help to save disk space
          
          */
          @NonCPS
          def call() {
              def maxRunSize = 21474836480 //20Gb
              def maxJobSize = 42949672960 //40Gb
              Jenkins.instance.getAllItems(hudson.model.Job).each
                      {
                          println it
                          def runs = it.getBuilds()
                          def job = it
                          def totalSize = 0
                          def jobSizeReached = false
                          runs.each{
                              def currentRun = it
                              def currentRunSize = it.getRootDir().directorySize()
                              println it
                              println currentRunSize
                              switch (currentRunSize) {
                                  case {it > maxRunSize} :
                                      currentRun.delete()
                                      println "deleted " + currentRun + " for having size " + it + " over allowed " +  maxRunSize
                                      break
                                  case {it + totalSize > maxJobSize || jobSizeReached } :
                                      jobSizeReached = true
                                      currentRun.delete()
                                      println "deleted " + currentRun + " for not fitting total  " + job + " jobs size " + (totalSize+currentRunSize) +" to max  allowed " +  maxJobSize
                                      break
                                  default :
                                      totalSize += currentRunSize
                                      println 'leaving '  + currentRun + ". current total = " + totalSize
                                      break
                              }
                              println '--------------------------'
                          }
                          println '=========================='
                      }
          }
          
          
          <<Some output from the loop here sometime few lines sometime few hundred lines, i guess it fails after some random number>>
          --------------------------
          [Pipeline] echo
          [Pipeline] End of Pipeline
          java.lang.InterruptedException
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
          	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275)
          	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111)
          	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248)
          	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237)
          	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:298)
          	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61)
          	at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:258)
          	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:206)
          	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:153)
          	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
          	at org.jenkinsci.plugins.workflow.cps.CpsScript.println(CpsScript.java:221)
          	at sun.reflect.GeneratedMethodAccessor1023.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          	at java.lang.reflect.Method.invoke(Method.java:498)
          	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
          	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
          	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
          	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
          	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:925)
          	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:908)
          	at org.codehaus.groovy.runtime.DefaultGroovyMethods.println(DefaultGroovyMethods.java:678)
          	at org.codehaus.groovy.runtime.dgm$500.doMethodInvoke(Unknown Source)
          	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:294)
          	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
          	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:69)
          	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:166)
          

          jenkins 2.107.3

          Alexey Grigorov added a comment - - edited Same here. I'm running a @NonCPS labeled method in library and if fails with this error randomly in the middle import jenkins.* import jenkins.model.* import hudson.* import hudson.model.* /* This method will find all jobs and verify their disk usage. Then it will remove job runs to fit quota That will help to save disk space */ @NonCPS def call() { def maxRunSize = 21474836480 //20Gb def maxJobSize = 42949672960 //40Gb Jenkins.instance.getAllItems(hudson.model.Job).each { println it def runs = it.getBuilds() def job = it def totalSize = 0 def jobSizeReached = false runs.each{ def currentRun = it def currentRunSize = it.getRootDir().directorySize() println it println currentRunSize switch (currentRunSize) { case {it > maxRunSize} : currentRun.delete() println "deleted " + currentRun + " for having size " + it + " over allowed " + maxRunSize break case {it + totalSize > maxJobSize || jobSizeReached } : jobSizeReached = true currentRun.delete() println "deleted " + currentRun + " for not fitting total " + job + " jobs size " + (totalSize+currentRunSize) + " to max allowed " + maxJobSize break default : totalSize += currentRunSize println 'leaving ' + currentRun + ". current total = " + totalSize break } println '--------------------------' } println '==========================' } } <<Some output from the loop here sometime few lines sometime few hundred lines, i guess it fails after some random number>> -------------------------- [Pipeline] echo [Pipeline] End of Pipeline java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:298) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:258) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:206) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:153) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at org.jenkinsci.plugins.workflow.cps.CpsScript.println(CpsScript.java:221) at sun.reflect.GeneratedMethodAccessor1023.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:925) at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:908) at org.codehaus.groovy.runtime.DefaultGroovyMethods.println(DefaultGroovyMethods.java:678) at org.codehaus.groovy.runtime.dgm$500.doMethodInvoke(Unknown Source) at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:294) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:69) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:166) jenkins 2.107.3

          Toby Broyles added a comment -

          We are seeing this error quite often lately. Is there anything I can do to help troubleshoot?

          Toby Broyles added a comment - We are seeing this error quite often lately. Is there anything I can do to help troubleshoot?

          Sam Van Oort added a comment -

          dnusbaum Please, could you take a look?

          Sam Van Oort added a comment - dnusbaum Please, could you take a look?

          Devin Nusbaum added a comment - - edited

          I finally had a chance to take a look at this.

          iceiceice: You are using println inside of an @NonCPS method. It might seem like this should be ok, but behind the scenes println ends up calling the echo step, which means that you are executing Pipeline Steps inside of an @NonCPS method, which should not be done even though it may appear to work correctly in some cases.

          tobybro If you have a minimal pipeline that can reproduce the issue, or a scenario in which you can reproduce the problem consistently it would be very helpful if you could post the Pipeline script and the build folder of one of the builds that failed with the exception here. Otherwise, it would be great to see your actual pipeline that is hitting the error so we can check for any similarities with Alexey's or Rick's Pipelines that might give us an idea of what is happening.

          EDIT: My original hypothesis was misleading and incorrect, so I've edited it based on new information.

          Notably, the exception being thrown in all these cases comes from AbstractQueuedSynchronizer#acquireSharedInterruptibly Line 1302, which checks the interrupt status before the thread tries to acquire the lock. This means that the thread was actually interrupted before the step started executing, and CpsStepContext#getThreadGroupSynchronously just happens to be the first piece of code to check the interrupted flag and throw an exception. I got access to the flow nodes of another user's pipeline hitting the same exception, and noticed that the duration between the last successfully executed node and the start of the step that threw the exception was just over 5 minutes, which leads me to believe that the code that is doing the interruption is the Timeout in CpsThread#runNextChunk.

          None of the Pipelines that I've seen that have thrown this error look like they should have taken more than 5 minutes in runNextChunk, so I am suspicious that there is something going wrong inside of that method (maybe it is executing more code than it should), but I do not know how to validate my hypothesis.

          Devin Nusbaum added a comment - - edited I finally had a chance to take a look at this. iceiceice : You are using println inside of an @NonCPS method. It might seem like this should be ok, but behind the scenes println ends up calling the echo step, which means that you are executing Pipeline Steps inside of an @NonCPS method, which should not be done even though it may appear to work correctly in some cases. tobybro If you have a minimal pipeline that can reproduce the issue, or a scenario in which you can reproduce the problem consistently it would be very helpful if you could post the Pipeline script and the build folder of one of the builds that failed with the exception here. Otherwise, it would be great to see your actual pipeline that is hitting the error so we can check for any similarities with Alexey's or Rick's Pipelines that might give us an idea of what is happening. EDIT: My original hypothesis was misleading and incorrect, so I've edited it based on new information. Notably, the exception being thrown in all these cases comes from AbstractQueuedSynchronizer#acquireSharedInterruptibly Line 1302, which checks the interrupt status before the thread tries to acquire the lock. This means that the thread was actually interrupted before the step started executing, and CpsStepContext#getThreadGroupSynchronously just happens to be the first piece of code to check the interrupted flag and throw an exception. I got access to the flow nodes of another user's pipeline hitting the same exception, and noticed that the duration between the last successfully executed node and the start of the step that threw the exception was just over 5 minutes, which leads me to believe that the code that is doing the interruption is the Timeout in CpsThread#runNextChunk . None of the Pipelines that I've seen that have thrown this error look like they should have taken more than 5 minutes in runNextChunk, so I am suspicious that there is something going wrong inside of that method (maybe it is executing more code than it should), but I do not know how to validate my hypothesis.

          Andrew a added a comment -

          Hi everyone. A little info on this issue that we've been hitting as well. We're using jenkins and it's pipeline plugin to run our regression and ci builds for a SoC (System-on-chip) we're designing. As a side effect of what we're doing - some tests (under the parallel step) take a very long time to complete and are very CPU-intensive. Here's some statistics I've noticed:

          1. CI builds that have a small amount (~10) of short tests under parallel step almost always complete successfully.
          2. Regression runs that take a few days to complete and have a huge list of longer tests almost always hit this bug.
          3. If the jenkins was restarted fresh, chances to hit this bug are somewhat lower. I came to think it was related to a memory leak in jenkins test-results-analyzer plugin at first, which (due to our huge, sometime over 500MiB log files) crashed jenkins often at fist, until I started filtering logs to mitigate the issue (https://issues.jenkins-ci.org/browse/JENKINS-34134). Tuning GC options seemed to further improve things.
          4. Since we're running builds on NFS filesystem to later distribute and execute tests on different servers, thus balancing the load, I first considered filesystem lags due to other tasks being run on our servers. Adding this seemed to improve things a little bit. The disk load is not evenly distributed. It's heavy on the start of the test (when the model is loaded to memory and during reset due to extensive logging), later it's usually very low throughout the whole run.

           

          /* Workaround laggy filesystem */
          script {
             System.setProperty("org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL", "3800");
          }

           

           

          5. Another thing I've noticed, some tests seem to randomly fail. According to the simulation log, the simulation process was just killed (by something other than OOM), rather that getting stuck, and that happens quite fast (e.g. the simulation haven't even completed the reset sequence, which takes a few seconds of real time, but usually spews a lot of info into the log). 

          6. At the Blue Ocean UI all fails in parallel steps seem to be grouped to the very bottom half of the screen, after 150+ of successfully executed tests. Since these are mostly in the order they're being started (I assume), it's making me believe, that it might be somehow related to either GC taking a lot of time or some memory leaking here and there.

           

          A grand total would be of 5-10 tests of 300+ randomly crashing, and almost always the annoying java.lang.interruptedException with the whole run (That took 3-4 days to complete!) just freezing.

          A full regression that takes a few days to complete almost always triggers the bug. I'm willing to help solve this issue, but since it requires some sacred knowledge about jenkins' guts I do not have, I can either post my pipeline script (which will most likely be useless, due to a huge amount of our internal tools being called), or give some of the proposed fixes a try and post the results, since I have the right environment.

           

           

          Andrew a added a comment - Hi everyone. A little info on this issue that we've been hitting as well. We're using jenkins and it's pipeline plugin to run our regression and ci builds for a SoC (System-on-chip) we're designing. As a side effect of what we're doing - some tests (under the parallel step) take a very long time to complete and are very CPU-intensive. Here's some statistics I've noticed: CI builds that have a small amount (~10) of short tests under parallel step almost always complete successfully. Regression runs that take a few days to complete and have a huge list of longer tests almost always hit this bug. If the jenkins was restarted fresh, chances to hit this bug are somewhat lower. I came to think it was related to a memory leak in jenkins test-results-analyzer plugin at first, which (due to our huge, sometime over 500MiB log files) crashed jenkins often at fist, until I started filtering logs to mitigate the issue ( https://issues.jenkins-ci.org/browse/JENKINS-34134 ). Tuning GC options seemed to further improve things. Since we're running builds on NFS filesystem to later distribute and execute tests on different servers, thus balancing the load, I first considered filesystem lags due to other tasks being run on our servers. Adding this seemed to improve things a little bit. The disk load is not evenly distributed. It's heavy on the start of the test (when the model is loaded to memory and during reset due to extensive logging), later it's usually very low throughout the whole run.   /* Workaround laggy filesystem */ script { System .setProperty( "org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL" , "3800" ); }     5. Another thing I've noticed, some tests seem to randomly fail. According to the simulation log, the simulation process was just killed (by something other than OOM), rather that getting stuck, and that happens quite fast (e.g. the simulation haven't even completed the reset sequence, which takes a few seconds of real time, but usually spews a lot of info into the log).  6. At the Blue Ocean UI all fails in parallel steps seem to be grouped to the very bottom half of the screen, after 150+ of successfully executed tests. Since these are mostly in the order they're being started (I assume), it's making me believe, that it might be somehow related to either GC taking a lot of time or some memory leaking here and there.   A grand total would be of 5-10 tests of 300+ randomly crashing, and almost always the annoying java.lang.interruptedException with the whole run (That took 3-4 days to complete!) just freezing. A full regression that takes a few days to complete almost always triggers the bug. I'm willing to help solve this issue, but since it requires some sacred knowledge about jenkins' guts I do not have, I can either post my pipeline script (which will most likely be useless, due to a huge amount of our internal tools being called), or give some of the proposed fixes a try and post the results, since I have the right environment.    

          Devin Nusbaum added a comment -

          ncrmnt Thank you for including all of those details in your comment! It's interesting that you are hitting the issue consistently on multi-day builds. How much of your build is running external processes sh/bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)?

          I am not aware of any workarounds for the issue. One thing that would be helpful would be if you could upload the build folder for one of your builds which failed with this issue. In particular, I am interested in the XML files in the workflow/ directory to see if the serialized flow nodes can tell us what was executing when the issue occurred which may give us some insight.

          Devin Nusbaum added a comment - ncrmnt Thank you for including all of those details in your comment! It's interesting that you are hitting the issue consistently on multi-day builds. How much of your build is running external processes sh / bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)? I am not aware of any workarounds for the issue. One thing that would be helpful would be if you could upload the build folder for one of your builds which failed with this issue. In particular, I am interested in the XML files in the workflow/ directory to see if the serialized flow nodes can tell us what was executing when the issue occurred which may give us some insight.

          Andrew a added a comment -

          dnusbaum, my last run was inconsistent due to the fact that the server with jenkins was out of disk space (sic!), I'll start the next regression run today and post results once it's done or hangs. This will take a few days though, depending on the load. Once it's done I'll try to provide your with as much data as I can.

          > How much of your build is running external processes sh/bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)?

          Basically it all comes down to running a single "cmake's ctest" process with one test in the list that may take anything from a few minutes to a few days with little to no stdout. Here's my pipeline script:

          Jenkinsfile

          Please note, that it may seem weird in places due to our internal server config and me trying to workaround this bug. It configures and builds the project on one select node (the build is done on an NFS share with the exact same path on all nodes), touchstone simulation step creates simulation checkpoints and, finally, the result is executed in parallel during regression phase with throttling enabled with load distributed over our servers.
           

           

           

          Andrew a added a comment - dnusbaum , my last run was inconsistent due to the fact that the server with jenkins was out of disk space (sic!), I'll start the next regression run today and post results once it's done or hangs. This will take a few days though, depending on the load. Once it's done I'll try to provide your with as much data as I can. > How much of your build is running external processes sh / bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)? Basically it all comes down to running a single "cmake's ctest" process with one test in the list that may take anything from a few minutes to a few days with little to no stdout. Here's my pipeline script: Jenkinsfile Please note, that it may seem weird in places due to our internal server config and me trying to workaround this bug. It configures and builds the project on one select node (the build is done on an NFS share with the exact same path on all nodes), touchstone simulation step creates simulation checkpoints and, finally, the result is executed in parallel during regression phase with throttling enabled with load distributed over our servers.      

          Andrew a added a comment - - edited

          dnusbaum, an update. It looks like I've got the hit the first try and now see a handful of
           

          [rumboot-default-rumboot-basis-Debug-basis-stress-test-uart-eth-dma] Cannot contact oribi: java.lang.InterruptedException
          [rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_1] Cannot contact bison: java.lang.InterruptedException
          [rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_6] Cannot contact oribi: java.lang.InterruptedException
          

          in the log. I now have the build in the stuck state and can provide the data. Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them?

          EDIT: Hm... I've been running tests with a maximum timeout of 260000 s and it seems it has failed once it stumbled upon the first test that actually timed out. Actually, all tests that have an interrupted exception are those that are currently timing out and being worked on! They shoud've hung for at least 260000 seconds consuming 100% of one CPU core with no stdout before being killed internally by ctest process. InterruptedException arrives earlier.

          I wonder if wrapping up 'sleep 260000' in groovy will produce the same results.

          Andrew a added a comment - - edited dnusbaum , an update. It looks like I've got the hit the first try and now see a handful of   [rumboot- default -rumboot-basis-Debug-basis-stress-test-uart-eth-dma] Cannot contact oribi: java.lang.InterruptedException [rumboot- default -rumboot-basis-Debug-simple-iram-mirror-hello-arinc_1] Cannot contact bison: java.lang.InterruptedException [rumboot- default -rumboot-basis-Debug-simple-iram-mirror-hello-arinc_6] Cannot contact oribi: java.lang.InterruptedException in the log. I now have the build in the stuck state and can provide the data. Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them? EDIT: Hm... I've been running tests with a maximum timeout of 260000 s and it seems it has failed once it stumbled upon the first test that actually timed out. Actually, all tests that have an interrupted exception are those that are currently timing out and being worked on! They shoud've hung for at least 260000 seconds consuming 100% of one CPU core with no stdout before being killed internally by ctest process. InterruptedException arrives earlier. I wonder if wrapping up 'sleep 260000' in groovy will produce the same results.

          Devin Nusbaum added a comment -

          Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them?

          Yep, sorry for not specifying earlier. They will be on the master, not on the agents. The workspace path will be $JENKINS_HOME/jobs/<path-to-job>/builds/<failing_build_number>/. Inside, the main files that would be interesting are build.xml and everything in workflow/.

          If it is reproducible with sleep 260000 that would be interesting. At those lengths of time I would start to suspect something going wrong with a network call from the master to the agent causing code somewhere to hang that Pipeline does not handle gracefully. Do your overall Jenkins master logs have any warnings/errors around the time the InterruptedException was thrown?

          Devin Nusbaum added a comment - Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them? Yep, sorry for not specifying earlier. They will be on the master, not on the agents. The workspace path will be $JENKINS_HOME/jobs/<path-to-job>/builds/<failing_build_number>/ . Inside, the main files that would be interesting are build.xml and everything in workflow/ . If it is reproducible with sleep 260000 that would be interesting. At those lengths of time I would start to suspect something going wrong with a network call from the master to the agent causing code somewhere to hang that Pipeline does not handle gracefully. Do your overall Jenkins master logs have any warnings/errors around the time the InterruptedException was thrown?

          Andrew a added a comment -

          stuff.tgz Here goes the tarball with all the xml files. Be careful, there's a ton of them there

          I do see some exceptions in jenkins master logs like this:

          Oct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed
          INFO: new failure
          org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
                  at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125)
                  at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378)
                  at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                  at java.lang.Thread.run(Thread.java:748)
                  Suppressed: java.lang.InterruptedException
                          at java.lang.Object.wait(Native Method)
                          at hudson.remoting.Request.call(Request.java:177)
                          at hudson.remoting.Channel.call(Channel.java:954)
                          at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084)
                          at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304)
                          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258)
                          at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296)
                          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139)
                          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128)
                          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906)
                          at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
                          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                          at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
                          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                          at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
                          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                          ... 3 moreOct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed
          INFO: previously delivered here
          java.lang.Throwable
                  at org.jenkinsci.plugins.workflow.cps.CpsStepContext.completed(CpsStepContext.java:341)
                  at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:329)
                  at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:254)
                  at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                  at java.lang.Thread.run(Thread.java:748)
          

          But I'm not sure if these are due to the build failing, or the restarting the master earlier today. Our instance has become quite bloated recently.

          Btw, I restarted the build with an insanely huge heartbeat interval that should match the tests timeout. Will see how it goes now.

          Andrew a added a comment - stuff.tgz Here goes the tarball with all the xml files. Be careful, there's a ton of them there I do see some exceptions in jenkins master logs like this: Oct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed INFO: new failure org.jenkinsci.plugins.workflow.steps.FlowInterruptedException at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125) at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Suppressed: java.lang.InterruptedException at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258) at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 moreOct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed INFO: previously delivered here java.lang.Throwable at org.jenkinsci.plugins.workflow.cps.CpsStepContext.completed(CpsStepContext.java:341) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:329) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:254) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) But I'm not sure if these are due to the build failing, or the restarting the master earlier today. Our instance has become quite bloated recently. Btw, I restarted the build with an insanely huge heartbeat interval that should match the tests timeout. Will see how it goes now.

          Same issue here...

          We're having a quite small Jenkinsfile that calls a RESTful service, which takes (depending on parameters) few seconds up to an hour or more to complete. The JSON returned by the service should be attached to the build, that's why it's called synchronously. The longer the service call takes, the more likely the InterruptedException occurs. There are no parallel steps in our pipeline, so I'm pretty sure it's related to the long-running step.

          The REST call is done by some selfmade Groovy function and classes in our shared library, basically setting up a HttpURLConnection instance. However, that works nicely in fast service calls, so I don't think there is an issue in this code.

          Let me know if I can help with any files/logs.

          Christoph Amshoff added a comment - Same issue here... We're having a quite small Jenkinsfile that calls a RESTful service, which takes (depending on parameters) few seconds up to an hour or more to complete. The JSON returned by the service should be attached to the build, that's why it's called synchronously. The longer the service call takes, the more likely the InterruptedException occurs. There are no parallel steps in our pipeline, so I'm pretty sure it's related to the long-running step. The REST call is done by some selfmade Groovy function and classes in our shared library, basically setting up a HttpURLConnection instance. However, that works nicely in fast service calls, so I don't think there is an issue in this code. Let me know if I can help with any files/logs.

          Devin Nusbaum added a comment - - edited

          chamshoff Without seeing your Jenkinsfile, it's hard to say, but my guess would be that you are hitting the 5 minute timeout here because your code is waiting for a network response or similar in a method that doesn't respond to being interrupted (so it doesn't fail until the next step is invoked which notices that the thread was interrupted, assuming your stack trace is original to the OP's). I would highly recommend not making direct HTTP calls in your Jenkinsfile/Shared Library even if it seems like it works fine in some cases. You can create a bash script that does whatever you want (curl, java -jar yourRestClient.jar, etc.) and call that script using the sh or bat steps which run asynchronously and are not subject to a 5 minute timeout.

          ncrmnt Based on those logs, there does appear to be a network issue with the remoting layer involved as well. I will check through the build files you sent to see if they give any indication of a root cause that we can address.

          Devin Nusbaum added a comment - - edited chamshoff Without seeing your Jenkinsfile, it's hard to say, but my guess would be that you are hitting the 5 minute timeout here because your code is waiting for a network response or similar in a method that doesn't respond to being interrupted (so it doesn't fail until the next step is invoked which notices that the thread was interrupted, assuming your stack trace is original to the OP's). I would highly recommend not making direct HTTP calls in your Jenkinsfile/Shared Library even if it seems like it works fine in some cases. You can create a bash script that does whatever you want (curl, java -jar yourRestClient.jar, etc.) and call that script using the sh or bat steps which run asynchronously and are not subject to a 5 minute timeout. ncrmnt  Based on those logs, there does appear to be a network issue with the remoting layer involved as well. I will check through the build files you sent to see if they give any indication of a root cause that we can address.

          Andrew a added a comment -

          dnusbaum Thanks a lot for looking into it. Let me know if I can be of any help with this issue.

           

          P.S. Another run with a super-high heartbeat interval ran for a little longer, but ended up with another InterruptedException as well on one of our tests that is timing out.

          Andrew a added a comment - dnusbaum Thanks a lot for looking into it. Let me know if I can be of any help with this issue.   P.S. Another run with a super-high heartbeat interval ran for a little longer, but ended up with another InterruptedException as well on one of our tests that is timing out.

          Devin Nusbaum added a comment -

          ncrmnt The exception in your most recent build looks like a networking issue after you manually aborted the build. Here is the stack trace:

          org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
              at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125)
              at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378)
              at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
              Suppressed: java.lang.InterruptedException
                  at java.lang.Object.wait(Native Method)
                  at hudson.remoting.Request.call(Request.java:177)
                  at hudson.remoting.Channel.call(Channel.java:954)
                  at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084)
                  at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304)
                  at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258)
                  at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296)
                  at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139)
                  at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128)
                  at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906)
                  at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
                  at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                  at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                  at java.lang.Thread.run(Thread.java:748)
              ... Suppressed exception repeated for each parallel block
          

          And the serialized FlowInterruptedException has:

          <causes class="java.util.Arrays$ArrayList">
            <a class="jenkins.model.CauseOfInterruption-array">
              <jenkins.model.CauseOfInterruption_-UserInterruption>
                <user>andrianov</user>
              </jenkins.model.CauseOfInterruption_-UserInterruption>
            </a>
          </causes>
          

          Have you been manually aborting the builds, or is something else happening? Either way, it seems like you might have a distinct issue, so might be better to move the discussion to a new ticket or to the mailing list to help diagnose your remoting problems.

          Devin Nusbaum added a comment - ncrmnt The exception in your most recent build looks like a networking issue after you manually aborted the build. Here is the stack trace: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125) at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Suppressed: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258) at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ... Suppressed exception repeated for each parallel block And the serialized FlowInterruptedException has: <causes class="java.util.Arrays$ArrayList"> <a class="jenkins.model.CauseOfInterruption-array"> <jenkins.model.CauseOfInterruption_-UserInterruption> <user>andrianov</user> </jenkins.model.CauseOfInterruption_-UserInterruption> </a> </causes> Have you been manually aborting the builds, or is something else happening? Either way, it seems like you might have a distinct issue, so might be better to move the discussion to a new ticket or to the mailing list to help diagnose your remoting problems.

          Devin Nusbaum added a comment -

          For anyone still seeing the original exception in the description, try adding a custom logger (Manage Jenkins -> System Log -> Add new log recorder) at FINE level for org.jenkinsci.plugins.workflow.support.concurrent.Timeout, so that we see these log messages, and then next time you see the issue, add the output of the custom logger here, which will tell us if my hypothesis is correct, and if so, it will tell us what code was actually running at the time of the issue.

          Devin Nusbaum added a comment - For anyone still seeing the original exception in the description, try adding a custom logger (Manage Jenkins -> System Log -> Add new log recorder) at FINE level for org.jenkinsci.plugins.workflow.support.concurrent.Timeout , so that we see these log messages , and then next time you see the issue, add the output of the custom logger here, which will tell us if my hypothesis is correct, and if so, it will tell us what code was actually running at the time of the issue.

          Andrew a added a comment - - edited

          Yes, I have been manually aborting the build, but already when they were stuck with the exception. I'll clean up all logs and fire up a fresh regression run this weekend with fine logging you've mentioned. Let's see if that gives us any better results.

          Andrew a added a comment - - edited Yes, I have been manually aborting the build, but already when they were stuck with the exception. I'll clean up all logs and fire up a fresh regression run this weekend with fine logging you've mentioned. Let's see if that gives us any better results.

          Devin Nusbaum added a comment -

          Sounds good. When you abort the build, it can change the stored info, so I would collect an archive of the build files prior to aborting it.

          Devin Nusbaum added a comment - Sounds good. When you abort the build, it can change the stored info, so I would collect an archive of the build files prior to aborting it.

          Andrew a added a comment - - edited

          Before actually starting the job for this weekend, I went to the folks at our IT department and talked about the issue. We looked at the zabbix logs and could see no problem with network connectivity during the previous runs, and the nodes still had plenty of free resources. The VM running jenkins was NOT being backed up (VM backup might break/freeze TCP connections).  The only thing that might have interferred - NFS home directories backup that is done every evening, since that will introduce some file system lags. We'll have a good look at the zabbix logs once again on monday. Meanwhile, here are the FINE exception logs:

          Oct 28, 2018 12:58:13 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#222]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=541659 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 12:58:14 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#159]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=541661 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 1:26:27 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#227]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=546106 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#235]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=556575 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#234]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556578 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556579 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#238]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=559586 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#236]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559587 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#230]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559588 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#232]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561098 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561099 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#221]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561100 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#242]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561455 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#245]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561456 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#228]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561457 after 10 SECONDS
          java.lang.Throwable
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	at hudson.remoting.Channel.call(Channel.java:954)
          	at hudson.FilePath.act(FilePath.java:1070)
          	at hudson.FilePath.act(FilePath.java:1059)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          

          Here goes the jenkins.log full of exceptions

          jenkins.log

          P.S. I haven't yet interrupted the build, so I can collect some more info if you need any. I'll fetch files from nodes later.

          Andrew a added a comment - - edited Before actually starting the job for this weekend, I went to the folks at our IT department and talked about the issue. We looked at the zabbix logs and could see no problem with network connectivity during the previous runs, and the nodes still had plenty of free resources. The VM running jenkins was NOT being backed up (VM backup might break/freeze TCP connections).  The only thing that might have interferred - NFS home directories backup that is done every evening, since that will introduce some file system lags. We'll have a good look at the zabbix logs once again on monday. Meanwhile, here are the FINE exception logs: Oct 28, 2018 12:58:13 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#222]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=541659 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 12:58:14 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#159]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=541661 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 1:26:27 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#227]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=546106 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#235]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=556575 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#234]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556578 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556579 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#238]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=559586 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#236]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559587 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#230]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559588 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#232]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561098 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561099 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#221]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561100 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#242]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561455 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#245]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561456 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#228]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561457 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Here goes the jenkins.log full of exceptions jenkins.log P.S. I haven't yet interrupted the build, so I can collect some more info if you need any. I'll fetch files from nodes later.

          Andrew a added a comment - - edited

          A little follow-up. Carefully inspecting the zabbix logs it turned out that the issue was during the time when we were performing routine backups of the NFS home directories at night. iowait is still very low, but a little more than usual. We didn't hit this bug every night, but almost always on the second night. My hypothesis: As the builds are performed on the NFS share, this may have resulted in filesystem lags. I have applied the following patch (600 s is quite an overkill, but let's see if it helps) and restarted the regression.

          +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
          @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                           return; // slave not yet ready, wait for another day
                       }
                       TaskListener listener = listener();
          -            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
          +            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                           if (watching) {
                               Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                               if (exitCode == null) {
          
          

          A question to other folks experiencing the issue: How much RAM/disk io do your processes take? May it be that a huge memory-hungry process being terminated will freeze for a while (>10 seconds), syncing data to the disk?

          Andrew a added a comment - - edited A little follow-up. Carefully inspecting the zabbix logs it turned out that the issue was during the time when we were performing routine backups of the NFS home directories at night. iowait is still very low, but a little more than usual. We didn't hit this bug every night, but almost always on the second night. My hypothesis: As the builds are performed on the NFS share, this may have resulted in filesystem lags. I have applied the following patch (600 s is quite an overkill, but let's see if it helps) and restarted the regression. +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step { return ; // slave not yet ready, wait for another day } TaskListener listener = listener(); - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) { if (watching) { Integer exitCode = controller.exitStatus(workspace, launcher(), listener); if (exitCode == null ) { A question to other folks experiencing the issue: How much RAM/disk io do your processes take? May it be that a huge memory-hungry process being terminated will freeze for a while (>10 seconds), syncing data to the disk?

          Devin Nusbaum added a comment -

          ncrmnt Are you able to upload the content of the custom logger for org.jenkinsci.plugins.workflow.support.concurrent.Timeout (or was it blank?)? Custom loggers are not included in the normal jenkins.log, so you'll have to add it separately. Along with your patch, it could be interesting to add another custom logger at FINE level for org.jenkinsci.plugins.workflow.steps.durable_task.DurableTask.

          Devin Nusbaum added a comment - ncrmnt Are you able to upload the content of the custom logger for org.jenkinsci.plugins.workflow.support.concurrent.Timeout (or was it blank?)? Custom loggers are not included in the normal jenkins.log, so you'll have to add it separately. Along with your patch, it could be interesting to add another custom logger at FINE level for org.jenkinsci.plugins.workflow.steps.durable_task.DurableTask .

          Ian Cotton added a comment -

          I have the custom logger turned on. I have just added additional logging to it for DurableTask, so no results from that yet, just for the Timeout class. We are getting many exceptions as below. Our Jenkins master runs on an NFS file system, which ought to be reasonably quick but apparently may have issues.

          Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 78969: checking /var/lib/jenkins/workspace/anches_ert_gw-ert-custard-3-GARHGYOSDUCJ4EHKSORKR3E2RAFWD3A5ZVBAKP6LQ5ZKQZBMWUQA on RedHatSlave3 / waiting for RedHatSlave3 id=14810179,5,main] after 10 SECONDS
          java.lang.Throwable
          at java.lang.Object.wait(Native Method)
          at hudson.remoting.Request.call(Request.java:169)
          at hudson.remoting.Channel.call(Channel.java:904)
          at hudson.FilePath.act(FilePath.java:987)
          at hudson.FilePath.act(FilePath.java:976)
          at hudson.FilePath.isDirectory(FilePath.java:1514)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)

          Ian Cotton added a comment - I have the custom logger turned on. I have just added additional logging to it for DurableTask, so no results from that yet, just for the Timeout class. We are getting many exceptions as below. Our Jenkins master runs on an NFS file system, which ought to be reasonably quick but apparently may have issues. Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 78969 : checking /var/lib/jenkins/workspace/anches_ert_gw-ert-custard-3-GARHGYOSDUCJ4EHKSORKR3E2RAFWD3A5ZVBAKP6LQ5ZKQZBMWUQA on RedHatSlave3 / waiting for RedHatSlave3 id=14810179,5,main] after 10 SECONDS java.lang.Throwable at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:169) at hudson.remoting.Channel.call(Channel.java:904) at hudson.FilePath.act(FilePath.java:987) at hudson.FilePath.act(FilePath.java:976) at hudson.FilePath.isDirectory(FilePath.java:1514) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

          Devin Nusbaum added a comment - - edited

          icotton64 Thanks for adding the stack trace! If ncrmnt is seeing identical stack traces, then the patch they are testing will not have any affect, because it looks like the timeout that is timing out is in DurableTaskStep#getWorkspace here, before the timeout in DurableTaskStep#check itself. It could be worth trying to increase that timeout just to see if it changes the behavior (though my initial guess though is that if we don't get a response after 10 seconds we will never get one), but given that the stuck operation is a remoting call, my guess is that this is a networking issue (perhaps an issue in Jenkins' networking stack (remoting) rather than something wrong with the network itself if your system monitoring shows no network issues). Did you see any network-related issues on the machine on which Jenkins was running around the time of the issue?

          Devin Nusbaum added a comment - - edited icotton64 Thanks for adding the stack trace! If ncrmnt is seeing identical stack traces, then the patch they are testing will not have any affect, because it looks like the timeout that is timing out is in DurableTaskStep#getWorkspace here , before the timeout in DurableTaskStep#check itself. It could be worth trying to increase that timeout just to see if it changes the behavior (though my initial guess though is that if we don't get a response after 10 seconds we will never get one), but given that the stuck operation is a remoting call, my guess is that this is a networking issue (perhaps an issue in Jenkins' networking stack (remoting) rather than something wrong with the network itself if your system monitoring shows no network issues). Did you see any network-related issues on the machine on which Jenkins was running around the time of the issue?

          Ian Cotton added a comment -

          Thanks dnusbaum. No, I didn't observe any networking related issues. I monitor both machines using both Nagios and Prometheus. the granularity of monitoring on Prometheus is every minute and it didn't see any disconnect.

          Ian Cotton added a comment - Thanks dnusbaum . No, I didn't observe any networking related issues. I monitor both machines using both Nagios and Prometheus. the granularity of monitoring on Prometheus is every minute and it didn't see any disconnect.

          Andrew a added a comment -

          dnusbaum, yes, I've just included those in the code section, see my comments above (apparently JIRA has hidden those already). All of them are related to file operations.

          Anyways, I think I have resolved the issue on my setup with this patch (600 seconds is an overkill, but I just had to test it): 

           

          @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step {
                           }
                       }
                       boolean directory;
          -            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
          +            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                           directory = ws.isDirectory();
                       } catch (Exception x) {
                           getWorkspaceProblem(x);
          @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                           return; // slave not yet ready, wait for another day
                       }
                       TaskListener listener = listener();
          -            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
          +            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                           if (watching) {
                               Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                               if (exitCode == null) {
          
          

          It seems that the 10 second timeout for file-based operations is something causing the problem under heavy disc/network load. During all three runs with FINE exception logging those occurred during routine backups.

          I also found the following workaround: A restart of the master will cause pipeline to resume and correctly reconnect with all the nodes, effectively unfreezing the job.

           

          Andrew a added a comment - dnusbaum , yes, I've just included those in the code section, see my comments above (apparently JIRA has hidden those already). All of them are related to file operations. Anyways, I think I have resolved the issue on my setup with this patch (600 seconds is an overkill, but I just had to test it):    @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step { } } boolean directory; - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) { directory = ws.isDirectory(); } catch (Exception x) { getWorkspaceProblem(x); @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step { return ; // slave not yet ready, wait for another day } TaskListener listener = listener(); - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) { if (watching) { Integer exitCode = controller.exitStatus(workspace, launcher(), listener); if (exitCode == null ) { It seems that the 10 second timeout for file-based operations is something causing the problem under heavy disc/network load. During all three runs with FINE exception logging those occurred during routine backups. I also found the following workaround: A restart of the master will cause pipeline to resume and correctly reconnect with all the nodes, effectively unfreezing the job.  

          Ian Cotton added a comment -

          I now get the following exception from the DurableTaskStep log:

          Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 88150: checking /var/lib/jenkins/workspace/GW_GW_Pipeline_Build_trunk-KY5NRFS7BWLOEMDFRPJGO5IAG4Y6DFVRUT333ENYFD5NOOFQINTA on RedHatSlave10 / waiting for RedHatSlave10 id=16567391,5,main] after 10 SECONDS
          java.lang.Throwable
          at java.lang.Object.wait(Native Method)
          at hudson.remoting.Request.call(Request.java:169)
          at hudson.remoting.Channel.call(Channel.java:904)
          at hudson.FilePath.act(FilePath.java:987)
          at hudson.FilePath.act(FilePath.java:976)
          at hudson.FilePath.isDirectory(FilePath.java:1514)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306)
          at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)

           

          So it looks like it is timing out calling the node.

          Ian Cotton added a comment - I now get the following exception from the DurableTaskStep log: Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 88150 : checking /var/lib/jenkins/workspace/GW_GW_Pipeline_Build_trunk-KY5NRFS7BWLOEMDFRPJGO5IAG4Y6DFVRUT333ENYFD5NOOFQINTA on RedHatSlave10 / waiting for RedHatSlave10 id=16567391,5,main] after 10 SECONDS java.lang.Throwable at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:169) at hudson.remoting.Channel.call(Channel.java:904) at hudson.FilePath.act(FilePath.java:987) at hudson.FilePath.act(FilePath.java:976) at hudson.FilePath.isDirectory(FilePath.java:1514) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)   So it looks like it is timing out calling the node.

          Andrew a added a comment - - edited

          icotton64, can you give my proposed patch with increased timeouts a try?

          Here's my pre-compiled version of the plugin for your convenience. (I hope sharing a precompiled plugin is allowed here, if not - sorry) workflow-durable-task-step.hpi

          Andrew a added a comment - - edited icotton64 , can you give my proposed patch with increased timeouts a try? Here's my pre-compiled version of the plugin for your convenience. (I hope sharing a precompiled plugin is allowed here, if not - sorry) workflow-durable-task-step.hpi

          Devin Nusbaum added a comment - - edited

          ncrmnt Thanks for working on the issue! Any chance you are able to play with the timeout to figure out a lower bound of what works for you? I would be fine increasing the limit to maybe 30-60 seconds by default, and providing a system property to increase it further, but I'm not sure about increasing the default further than that.

           

          Devin Nusbaum added a comment - - edited ncrmnt  Thanks for working on the issue! Any chance you are able to play with the timeout to figure out a lower bound of what works for you? I would be fine increasing the limit to maybe 30-60 seconds by default, and providing a system property to increase it further, but I'm not sure about increasing the default further than that.  

          Ian Cotton added a comment -

          Sorry ncrmnt the system we are seeing the issue on is our production system and I really don't want to risk installing a pre-compiled plugin on it.

          Ian Cotton added a comment - Sorry ncrmnt the system we are seeing the issue on is our production system and I really don't want to risk installing a pre-compiled plugin on it.

          Andrew a added a comment - - edited

          dnusbaum I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section.

          icotton64 No problem, I fully understand. Can you recompile it yourself with a 30 second timeout (see my patch above) and give it a try? This way we'll provide dnusbaum some results about a suitable timeout faster.

          Andrew a added a comment - - edited dnusbaum I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section. icotton64 No problem, I fully understand. Can you recompile it yourself with a 30 second timeout (see my patch above) and give it a try? This way we'll provide dnusbaum some results about a suitable timeout faster.

          Ian Cotton added a comment -

          ncrmnt I am not able to build the plugin at the moment. My Jenkins server doesn't have the required plugins. We are setting up some new server and hopefully I can use one of them.

          Ian Cotton added a comment - ncrmnt I am not able to build the plugin at the moment. My Jenkins server doesn't have the required plugins. We are setting up some new server and hopefully I can use one of them.

          Devin Nusbaum added a comment -

          For now, I went ahead and filed https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 to allow the timeout to be configured by system property. Feel free to pull down the incremental build of that PR once it is complete for testing if you are already running workflow-durable-task-step 2.26, or review/comment on the PR. Thanks!

          Devin Nusbaum added a comment - For now, I went ahead and filed https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 to allow the timeout to be configured by system property. Feel free to pull down the incremental build of that PR once it is complete for testing if you are already running workflow-durable-task-step 2.26, or review/comment on the PR. Thanks!

          Ian Cotton added a comment -

          I realised we are running an old version of the plugin (2.17). unfortunately we are also running Jenkins 2.73.2 and the plugin versions beyond 2.22 require 2.73.3. We can try the plugin on our newer replacement Jenkins but we don't have the failing builds on those Jenkins. We are working to improve matters but I don't think we will be in a position to test this properly for at least a few days.

          Ian Cotton added a comment - I realised we are running an old version of the plugin (2.17). unfortunately we are also running Jenkins 2.73.2 and the plugin versions beyond 2.22 require 2.73.3. We can try the plugin on our newer replacement Jenkins but we don't have the failing builds on those Jenkins. We are working to improve matters but I don't think we will be in a position to test this properly for at least a few days.

          Andrew a added a comment - - edited

          I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now)

          First one:
          I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning:

          Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling
          

          But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.

           

          Second one:

          In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem:

           Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop
           java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy
           at sun.nio.fs.UnixException.translateToIOException(Unknown Source)
           at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
           at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
           at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source)
           at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source)
           at java.nio.file.Files.deleteIfExists(Unknown Source)
           at hudson.Util.tryOnceDeleteFile(Util.java:316)
           at hudson.Util.deleteFile(Util.java:272)
           Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca
           at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
           at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
           at hudson.remoting.Channel.call(Channel.java:955)
           at hudson.FilePath.act(FilePath.java:1070)
           at hudson.FilePath.act(FilePath.java:1059)
           at hudson.FilePath.deleteRecursive(FilePath.java:1266)
           at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
           at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
           at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
           at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
           Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
           at hudson.Util.deleteFile(Util.java:277)
           at hudson.FilePath.deleteRecursive(FilePath.java:1303)
           at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
           at hudson.FilePath.deleteRecursive(FilePath.java:1302)
           at hudson.FilePath.access$1600(FilePath.java:211)
           at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
           at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
           at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
           at hudson.remoting.UserRequest.perform(UserRequest.java:212)
           at hudson.remoting.UserRequest.perform(UserRequest.java:54)
           at hudson.remoting.Request$2.run(Request.java:369)
           at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
           at java.util.concurrent.FutureTask.run(Unknown Source)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
           at java.lang.Thread.run(Unknown Source)Sending interrupt signal to process
           After 10s process did not stop
           java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy
           at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
           at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
           at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
           at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
           at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
           at java.nio.file.Files.deleteIfExists(Files.java:1165)
           at hudson.Util.tryOnceDeleteFile(Util.java:316)
           at hudson.Util.deleteFile(Util.java:272)
           Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx
           at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
           at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
           at hudson.remoting.Channel.call(Channel.java:955)
           at hudson.FilePath.act(FilePath.java:1070)
           at hudson.FilePath.act(FilePath.java:1059)
           at hudson.FilePath.deleteRecursive(FilePath.java:1266)
           at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
           at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
           at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
           at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
           at hudson.Util.deleteFile(Util.java:277)
           at hudson.FilePath.deleteRecursive(FilePath.java:1303)
           at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
           at hudson.FilePath.deleteRecursive(FilePath.java:1302)
           at hudson.FilePath.access$1600(FilePath.java:211)
           at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
           at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
           at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
           at hudson.remoting.UserRequest.perform(UserRequest.java:212)
           at hudson.remoting.UserRequest.perform(UserRequest.java:54)
           at hudson.remoting.Request$2.run(Request.java:369)
           at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:748)[Pipeline] }[Pipeline] }[Pipeline] // timeout[Pipeline] // timeout[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] // dir[Pipeline] // dir[Pipeline] }[Pipeline] }[Pipeline] // node[Pipeline] // node[Pipeline] }[Pipeline] }sh: line 1: 104849 Terminated sleep 3sh: line 1: 163732 Terminated { while [ ( -d /proc/$pid -o ! -d /proc/$$ ) -a -d '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03' -a ! -f '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-result.txt' ]; do
           touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt'; sleep 3;
           done; }
           sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot-default-rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec
          
          
          

          It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes.

          P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely.

          diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
          index 9b449d7..b338690 100644
          --- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
          +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
          @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step {
                           }
                       }
                       boolean directory;
          -            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
          +            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
                           directory = ws.isDirectory();
                       } catch (Exception x) {
                           getWorkspaceProblem(x);
          @@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step {
                                   stopTask = null;
                                   if (recurrencePeriod > 0) {
                                       recurrencePeriod = 0;
          -                            listener().getLogger().println("After 10s process did not stop");
          +                            listener().getLogger().println("After 60s process did not stop");
                                       getContext().onFailure(cause);
                                       try {
                                           FilePath workspace = getWorkspace();
          @@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step {
                                       }
                                   }
                               }
          -                }, 10, TimeUnit.SECONDS);
          +                }, 60, TimeUnit.SECONDS);
                           controller.stop(workspace, launcher());
                       } else {
                           listener().getLogger().println("Could not connect to " + node + " to send interrupt signal to process");
          @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                           return; // slave not yet ready, wait for another day
                       }
                       TaskListener listener = listener();
          -            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
          +            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
                           if (watching) {
                               Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                               if (exitCode == null) {
          
          

          Andrew a added a comment - - edited I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now) First one: I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning: Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.   Second one: In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem: Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy at sun.nio.fs.UnixException.translateToIOException(Unknown Source) at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source) at java.nio.file.Files.deleteIfExists(Unknown Source) at hudson.Util.tryOnceDeleteFile(Util.java:316) at hudson.Util.deleteFile(Util.java:272) Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:955) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at hudson.FilePath.deleteRecursive(FilePath.java:1266) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716' . Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. at hudson.Util.deleteFile(Util.java:277) at hudson.FilePath.deleteRecursive(FilePath.java:1303) at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312) at hudson.FilePath.deleteRecursive(FilePath.java:1302) at hudson.FilePath.access$1600(FilePath.java:211) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084) at hudson.remoting.UserRequest.perform(UserRequest.java:212) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source)Sending interrupt signal to process After 10s process did not stop java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108) at java.nio.file.Files.deleteIfExists(Files.java:1165) at hudson.Util.tryOnceDeleteFile(Util.java:316) at hudson.Util.deleteFile(Util.java:272) Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:955) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at hudson.FilePath.deleteRecursive(FilePath.java:1266) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597' . Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. at hudson.Util.deleteFile(Util.java:277) at hudson.FilePath.deleteRecursive(FilePath.java:1303) at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312) at hudson.FilePath.deleteRecursive(FilePath.java:1302) at hudson.FilePath.access$1600(FilePath.java:211) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084) at hudson.remoting.UserRequest.perform(UserRequest.java:212) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:748)[Pipeline] }[Pipeline] }[Pipeline] // timeout[Pipeline] // timeout[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn 't honor timeout setting?[Pipeline] }[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn' t honor timeout setting?[Pipeline] }[Pipeline] // dir[Pipeline] // dir[Pipeline] }[Pipeline] }[Pipeline] // node[Pipeline] // node[Pipeline] }[Pipeline] }sh: line 1: 104849 Terminated sleep 3sh: line 1: 163732 Terminated { while [ ( -d /proc/$pid -o ! -d /proc/$$ ) -a -d '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03' -a ! -f '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-result.txt' ]; do touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' ; sleep 3; done; } sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot- default -rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes. P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely. diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java index 9b449d7..b338690 100644 --- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step { } } boolean directory; - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) { directory = ws.isDirectory(); } catch (Exception x) { getWorkspaceProblem(x); @@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step { stopTask = null ; if (recurrencePeriod > 0) { recurrencePeriod = 0; - listener().getLogger().println( "After 10s process did not stop" ); + listener().getLogger().println( "After 60s process did not stop" ); getContext().onFailure(cause); try { FilePath workspace = getWorkspace(); @@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step { } } } - }, 10, TimeUnit.SECONDS); + }, 60, TimeUnit.SECONDS); controller.stop(workspace, launcher()); } else { listener().getLogger().println( "Could not connect to " + node + " to send interrupt signal to process" ); @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step { return ; // slave not yet ready, wait for another day } TaskListener listener = listener(); - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) { if (watching) { Integer exitCode = controller.exitStatus(workspace, launcher(), listener); if (exitCode == null ) {

          Devin Nusbaum added a comment -

          ncrmnt did those timeouts end up helping? If so, I can roll them up into https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 and release that so they can be configured without needing to run custom code.

          Devin Nusbaum added a comment - ncrmnt did those timeouts end up helping? If so, I can roll them up into https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90  and release that so they can be configured without needing to run custom code.

          Andrew a added a comment -

          dnusbaum Sorry for not reporting earlier. 60 seconds seem to have fixed all issues for me. The rest of the problems were due to ctest (and our numa scheduler wrapped within it before the actual simulator) not correctly dying when jenkins asked them to do so.

          Andrew a added a comment - dnusbaum Sorry for not reporting earlier. 60 seconds seem to have fixed all issues for me. The rest of the problems were due to ctest (and our numa scheduler wrapped within it before the actual simulator) not correctly dying when jenkins asked them to do so.

          Devin Nusbaum added a comment -

          ncrmnt No problem! I will move forward with my PR (adding an additional timeout), thanks so much for interactively debugging the issue!

          Devin Nusbaum added a comment - ncrmnt No problem! I will move forward with my PR (adding an additional timeout), thanks so much for interactively debugging the issue!

          Devin Nusbaum added a comment -

          As of version 2.29 of the Pipeline Nodes and Process Plugin, the default timeout for remote calls is 20 seconds, and the value can be configured using the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT.

          I am marking this ticket as closed since that is the main cause of the issue identified from discussion in the comments (thanks ncrmnt!). If this issue is still occurring frequently for someone after increasing that value, please comment and we can investigate further.

          Devin Nusbaum added a comment - As of version 2.29 of the Pipeline Nodes and Process Plugin, the default timeout for remote calls is 20 seconds, and the value can be configured using the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT . I am marking this ticket as closed since that is the main cause of the issue identified from discussion in the comments (thanks ncrmnt !). If this issue is still occurring frequently for someone after increasing that value, please comment and we can investigate further.

          Tony Poerio added a comment -

          Hi dnusbaum – when will this fix be released? My team needs it. (Or at least we think we do.)

           

          And I saw that this message is from about 9 months back, at time of writing.

           

          As of right now, the current release is only `2.200`, (released 10-14-2019).

           

          However, the post immediately above is referencing version `2.29`.

           

          Is it possible that this update is already present in `2.200`?

           

          If not, when will it become available in a stable release?  Many thanks for the help.

          Tony Poerio added a comment - Hi dnusbaum – when will this fix be released? My team needs it. (Or at least we think we do.)   And I saw that this message is from about 9 months back, at time of writing.   As of right now, the current release is only `2.200`, (released 10-14-2019).   However, the post immediately above is referencing version `2.29`.   Is it possible that this update is already present in `2.200`?   If not, when will it become available in a stable release?  Many thanks for the help.

            dnusbaum Devin Nusbaum
            totoroliu Rick Liu
            Votes:
            19 Vote for this issue
            Watchers:
            30 Start watching this issue

              Created:
              Updated:
              Resolved: