Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-46507

Parallel Pipeline random java.lang.InterruptedException

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • None
    • workflow-durable-task-step 2.29

      In my pipeline job,
      sometimes it'd randomly receive the java.lang.InterruptedException below:

      java.lang.InterruptedException
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
      	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275)
      	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294)
      	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61)
      	at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:251)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:179)
      	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:126)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.println(CpsScript.java:207)
      	at org.jenkinsci.plugins.workflow.cps.CpsScript.print(CpsScript.java:202)
      	at sun.reflect.GeneratedMethodAccessor103253.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
      ....
      ....
      

      Please refer to the file attachment for the full console log and the pipeline Jenkinsfile code.

        1. consoleText_ERROR.txt
          95 kB
        2. hs_err_pid239040.log
          84 kB
        3. jenkins.log
          266 kB
        4. Jenkinsfile
          7 kB
        5. Jenkinsfile.txt
          6 kB
        6. stuff.tgz
          310 kB
        7. workflow-durable-task-step.hpi
          85 kB

          [JENKINS-46507] Parallel Pipeline random java.lang.InterruptedException

          Andrew a added a comment - - edited

          dnusbaum I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section.

          icotton64 No problem, I fully understand. Can you recompile it yourself with a 30 second timeout (see my patch above) and give it a try? This way we'll provide dnusbaum some results about a suitable timeout faster.

          Andrew a added a comment - - edited dnusbaum I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section. icotton64 No problem, I fully understand. Can you recompile it yourself with a 30 second timeout (see my patch above) and give it a try? This way we'll provide dnusbaum some results about a suitable timeout faster.

          Ian Cotton added a comment -

          ncrmnt I am not able to build the plugin at the moment. My Jenkins server doesn't have the required plugins. We are setting up some new server and hopefully I can use one of them.

          Ian Cotton added a comment - ncrmnt I am not able to build the plugin at the moment. My Jenkins server doesn't have the required plugins. We are setting up some new server and hopefully I can use one of them.

          Devin Nusbaum added a comment -

          For now, I went ahead and filed https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 to allow the timeout to be configured by system property. Feel free to pull down the incremental build of that PR once it is complete for testing if you are already running workflow-durable-task-step 2.26, or review/comment on the PR. Thanks!

          Devin Nusbaum added a comment - For now, I went ahead and filed https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 to allow the timeout to be configured by system property. Feel free to pull down the incremental build of that PR once it is complete for testing if you are already running workflow-durable-task-step 2.26, or review/comment on the PR. Thanks!

          Ian Cotton added a comment -

          I realised we are running an old version of the plugin (2.17). unfortunately we are also running Jenkins 2.73.2 and the plugin versions beyond 2.22 require 2.73.3. We can try the plugin on our newer replacement Jenkins but we don't have the failing builds on those Jenkins. We are working to improve matters but I don't think we will be in a position to test this properly for at least a few days.

          Ian Cotton added a comment - I realised we are running an old version of the plugin (2.17). unfortunately we are also running Jenkins 2.73.2 and the plugin versions beyond 2.22 require 2.73.3. We can try the plugin on our newer replacement Jenkins but we don't have the failing builds on those Jenkins. We are working to improve matters but I don't think we will be in a position to test this properly for at least a few days.

          Andrew a added a comment - - edited

          I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now)

          First one:
          I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning:

          Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling
          

          But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.

           

          Second one:

          In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem:

           Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop
           java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy
           at sun.nio.fs.UnixException.translateToIOException(Unknown Source)
           at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
           at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
           at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source)
           at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source)
           at java.nio.file.Files.deleteIfExists(Unknown Source)
           at hudson.Util.tryOnceDeleteFile(Util.java:316)
           at hudson.Util.deleteFile(Util.java:272)
           Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca
           at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
           at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
           at hudson.remoting.Channel.call(Channel.java:955)
           at hudson.FilePath.act(FilePath.java:1070)
           at hudson.FilePath.act(FilePath.java:1059)
           at hudson.FilePath.deleteRecursive(FilePath.java:1266)
           at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
           at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
           at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
           at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
           Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
           at hudson.Util.deleteFile(Util.java:277)
           at hudson.FilePath.deleteRecursive(FilePath.java:1303)
           at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
           at hudson.FilePath.deleteRecursive(FilePath.java:1302)
           at hudson.FilePath.access$1600(FilePath.java:211)
           at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
           at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
           at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
           at hudson.remoting.UserRequest.perform(UserRequest.java:212)
           at hudson.remoting.UserRequest.perform(UserRequest.java:54)
           at hudson.remoting.Request$2.run(Request.java:369)
           at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
           at java.util.concurrent.FutureTask.run(Unknown Source)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
           at java.lang.Thread.run(Unknown Source)Sending interrupt signal to process
           After 10s process did not stop
           java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy
           at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
           at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
           at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
           at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
           at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
           at java.nio.file.Files.deleteIfExists(Files.java:1165)
           at hudson.Util.tryOnceDeleteFile(Util.java:316)
           at hudson.Util.deleteFile(Util.java:272)
           Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx
           at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
           at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
           at hudson.remoting.Channel.call(Channel.java:955)
           at hudson.FilePath.act(FilePath.java:1070)
           at hudson.FilePath.act(FilePath.java:1059)
           at hudson.FilePath.deleteRecursive(FilePath.java:1266)
           at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
           at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
           at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
           at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
           at hudson.Util.deleteFile(Util.java:277)
           at hudson.FilePath.deleteRecursive(FilePath.java:1303)
           at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
           at hudson.FilePath.deleteRecursive(FilePath.java:1302)
           at hudson.FilePath.access$1600(FilePath.java:211)
           at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
           at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
           at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
           at hudson.remoting.UserRequest.perform(UserRequest.java:212)
           at hudson.remoting.UserRequest.perform(UserRequest.java:54)
           at hudson.remoting.Request$2.run(Request.java:369)
           at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:748)[Pipeline] }[Pipeline] }[Pipeline] // timeout[Pipeline] // timeout[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] // dir[Pipeline] // dir[Pipeline] }[Pipeline] }[Pipeline] // node[Pipeline] // node[Pipeline] }[Pipeline] }sh: line 1: 104849 Terminated sleep 3sh: line 1: 163732 Terminated { while [ ( -d /proc/$pid -o ! -d /proc/$$ ) -a -d '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03' -a ! -f '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-result.txt' ]; do
           touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt'; sleep 3;
           done; }
           sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot-default-rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec
          
          
          

          It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes.

          P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely.

          diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
          index 9b449d7..b338690 100644
          --- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
          +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
          @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step {
                           }
                       }
                       boolean directory;
          -            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
          +            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
                           directory = ws.isDirectory();
                       } catch (Exception x) {
                           getWorkspaceProblem(x);
          @@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step {
                                   stopTask = null;
                                   if (recurrencePeriod > 0) {
                                       recurrencePeriod = 0;
          -                            listener().getLogger().println("After 10s process did not stop");
          +                            listener().getLogger().println("After 60s process did not stop");
                                       getContext().onFailure(cause);
                                       try {
                                           FilePath workspace = getWorkspace();
          @@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step {
                                       }
                                   }
                               }
          -                }, 10, TimeUnit.SECONDS);
          +                }, 60, TimeUnit.SECONDS);
                           controller.stop(workspace, launcher());
                       } else {
                           listener().getLogger().println("Could not connect to " + node + " to send interrupt signal to process");
          @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                           return; // slave not yet ready, wait for another day
                       }
                       TaskListener listener = listener();
          -            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
          +            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
                           if (watching) {
                               Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                               if (exitCode == null) {
          
          

          Andrew a added a comment - - edited I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now) First one: I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning: Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.   Second one: In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem: Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy at sun.nio.fs.UnixException.translateToIOException(Unknown Source) at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source) at java.nio.file.Files.deleteIfExists(Unknown Source) at hudson.Util.tryOnceDeleteFile(Util.java:316) at hudson.Util.deleteFile(Util.java:272) Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:955) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at hudson.FilePath.deleteRecursive(FilePath.java:1266) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716' . Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. at hudson.Util.deleteFile(Util.java:277) at hudson.FilePath.deleteRecursive(FilePath.java:1303) at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312) at hudson.FilePath.deleteRecursive(FilePath.java:1302) at hudson.FilePath.access$1600(FilePath.java:211) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084) at hudson.remoting.UserRequest.perform(UserRequest.java:212) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source)Sending interrupt signal to process After 10s process did not stop java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108) at java.nio.file.Files.deleteIfExists(Files.java:1165) at hudson.Util.tryOnceDeleteFile(Util.java:316) at hudson.Util.deleteFile(Util.java:272) Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:955) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at hudson.FilePath.deleteRecursive(FilePath.java:1266) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597' . Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. at hudson.Util.deleteFile(Util.java:277) at hudson.FilePath.deleteRecursive(FilePath.java:1303) at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312) at hudson.FilePath.deleteRecursive(FilePath.java:1302) at hudson.FilePath.access$1600(FilePath.java:211) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084) at hudson.remoting.UserRequest.perform(UserRequest.java:212) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:748)[Pipeline] }[Pipeline] }[Pipeline] // timeout[Pipeline] // timeout[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn 't honor timeout setting?[Pipeline] }[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn' t honor timeout setting?[Pipeline] }[Pipeline] // dir[Pipeline] // dir[Pipeline] }[Pipeline] }[Pipeline] // node[Pipeline] // node[Pipeline] }[Pipeline] }sh: line 1: 104849 Terminated sleep 3sh: line 1: 163732 Terminated { while [ ( -d /proc/$pid -o ! -d /proc/$$ ) -a -d '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03' -a ! -f '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-result.txt' ]; do touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' ; sleep 3; done; } sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot- default -rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes. P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely. diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java index 9b449d7..b338690 100644 --- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step { } } boolean directory; - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) { directory = ws.isDirectory(); } catch (Exception x) { getWorkspaceProblem(x); @@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step { stopTask = null ; if (recurrencePeriod > 0) { recurrencePeriod = 0; - listener().getLogger().println( "After 10s process did not stop" ); + listener().getLogger().println( "After 60s process did not stop" ); getContext().onFailure(cause); try { FilePath workspace = getWorkspace(); @@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step { } } } - }, 10, TimeUnit.SECONDS); + }, 60, TimeUnit.SECONDS); controller.stop(workspace, launcher()); } else { listener().getLogger().println( "Could not connect to " + node + " to send interrupt signal to process" ); @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step { return ; // slave not yet ready, wait for another day } TaskListener listener = listener(); - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) { if (watching) { Integer exitCode = controller.exitStatus(workspace, launcher(), listener); if (exitCode == null ) {

          Devin Nusbaum added a comment -

          ncrmnt did those timeouts end up helping? If so, I can roll them up into https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 and release that so they can be configured without needing to run custom code.

          Devin Nusbaum added a comment - ncrmnt did those timeouts end up helping? If so, I can roll them up into https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90  and release that so they can be configured without needing to run custom code.

          Andrew a added a comment -

          dnusbaum Sorry for not reporting earlier. 60 seconds seem to have fixed all issues for me. The rest of the problems were due to ctest (and our numa scheduler wrapped within it before the actual simulator) not correctly dying when jenkins asked them to do so.

          Andrew a added a comment - dnusbaum Sorry for not reporting earlier. 60 seconds seem to have fixed all issues for me. The rest of the problems were due to ctest (and our numa scheduler wrapped within it before the actual simulator) not correctly dying when jenkins asked them to do so.

          Devin Nusbaum added a comment -

          ncrmnt No problem! I will move forward with my PR (adding an additional timeout), thanks so much for interactively debugging the issue!

          Devin Nusbaum added a comment - ncrmnt No problem! I will move forward with my PR (adding an additional timeout), thanks so much for interactively debugging the issue!

          Devin Nusbaum added a comment -

          As of version 2.29 of the Pipeline Nodes and Process Plugin, the default timeout for remote calls is 20 seconds, and the value can be configured using the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT.

          I am marking this ticket as closed since that is the main cause of the issue identified from discussion in the comments (thanks ncrmnt!). If this issue is still occurring frequently for someone after increasing that value, please comment and we can investigate further.

          Devin Nusbaum added a comment - As of version 2.29 of the Pipeline Nodes and Process Plugin, the default timeout for remote calls is 20 seconds, and the value can be configured using the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT . I am marking this ticket as closed since that is the main cause of the issue identified from discussion in the comments (thanks ncrmnt !). If this issue is still occurring frequently for someone after increasing that value, please comment and we can investigate further.

          Tony Poerio added a comment -

          Hi dnusbaum – when will this fix be released? My team needs it. (Or at least we think we do.)

           

          And I saw that this message is from about 9 months back, at time of writing.

           

          As of right now, the current release is only `2.200`, (released 10-14-2019).

           

          However, the post immediately above is referencing version `2.29`.

           

          Is it possible that this update is already present in `2.200`?

           

          If not, when will it become available in a stable release?  Many thanks for the help.

          Tony Poerio added a comment - Hi dnusbaum – when will this fix be released? My team needs it. (Or at least we think we do.)   And I saw that this message is from about 9 months back, at time of writing.   As of right now, the current release is only `2.200`, (released 10-14-2019).   However, the post immediately above is referencing version `2.29`.   Is it possible that this update is already present in `2.200`?   If not, when will it become available in a stable release?  Many thanks for the help.

            dnusbaum Devin Nusbaum
            totoroliu Rick Liu
            Votes:
            19 Vote for this issue
            Watchers:
            30 Start watching this issue

              Created:
              Updated:
              Resolved: