Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53652

InterruptedException during slave workspace groovy clean up pipeline

      We use a declarative pipeline to wipe out all slaves workspaces (attached here - Jenkinsfile)
      It generally does some processing and attempts to delete the workspace like this:

      FilePath fp = node.getRootPath().child("workspace");
      fp.deleteRecursive();

      It starts working well and fails with InterruptedException at some point:

      Started by timer
      Lightweight checkout support not available, falling back to full checkout.
      Checking out svn http://svndae.apama.com/um/branches/dev/jenkins2_2/build/change-management/jenkinsfiles/wipeout_slave_workspace into /FS/fslocal/jenkinsWorkspace/jobs/admin_wipeout_workspaces/workspace@script to read Jenkinsfile
      Updating http://svndae.apama.com/um/branches/dev/jenkins2_2/build/change-management/jenkinsfiles/wipeout_slave_workspace at revision '2018-09-19T01:47:00.338 +0200' --quiet
      At revision 112786

      No changes for http://svndae.apama.com/um/branches/dev/jenkins2_2/build/change-management/jenkinsfiles/wipeout_slave_workspace since the previous build
      Running in Durability level: MAX_SURVIVABILITY
      Loading library templates@branches/dev/jenkins2_2/build/change-management/jenkinsfiles/templates
      Opening connection to http://svndae.apama.com/um/
      Updating http://svndae.apama.com/um/branches/dev/jenkins2_2/build/change-management/jenkinsfiles/templates@112786 at revision 112786
      At revision 112786

      No changes for http://svndae.apama.com/um/branches/dev/jenkins2_2/build/change-management/jenkinsfiles/templates since the previous build
      [Pipeline] node
      Running on Jenkins in /FS/fslocal/jenkinsWorkspace/jobs/admin_wipeout_workspaces/workspace
      [Pipeline] {
      [Pipeline] stage
      [Pipeline]

      { (Declarative: Checkout SCM) [Pipeline] checkout Updating http://svndae.apama.com/um/branches/dev/jenkins2_2/build/change-management/jenkinsfiles/wipeout_slave_workspace at revision '2018-09-19T01:47:00.338 +0200' --quiet At revision 112786 No changes for http://svndae.apama.com/um/branches/dev/jenkins2_2/build/change-management/jenkinsfiles/wipeout_slave_workspace since the previous build [Pipeline] }

      [Pipeline] // stage
      [Pipeline] withEnv
      [Pipeline] {
      [Pipeline] stage
      [Pipeline] { (Wipe Out Slave Workspaces)
      [Pipeline] script
      [Pipeline]

      { [Pipeline] echo Processing: daeosx109v04.eur.ad.sag [Pipeline] echo Wiping out: /Users/nirdevadm/workspace/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: daeosx109v05.eur.ad.sag [Pipeline] echo Wiping out: /Users/nirdevadm/jenkinsWorkspace/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: daeosx109v11.eur.ad.sag [Pipeline] echo Wiping out: /Users/nirdevadm/workspace/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: sofum01.eur.ad.sag [Pipeline] echo Wiping out: /home/nirdevadm/jenkins/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxbuild10.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxbuild11.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxbuild12.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxbuild13.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxbuild14.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxbuild15.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxqa18.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxqa19.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxqa20.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxqa21.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxqa22.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxqa23.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxqa24.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umlinuxqa25.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins_slave/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umsuse03.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umsuse04.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umsuse05.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umsuse06.eur.ad.sag [Pipeline] echo Wiping out: /FS/fslocal/jenkins/workspace [Pipeline] echo ------------------ [Pipeline] echo Processing: umwindowstest01.eur.ad.sag [Pipeline] echo Wiping out: c:\users\nirdevadm\jenkins\workspace [Pipeline] echo ------------------ [Pipeline] }

      [Pipeline] // script
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // withEnv
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] End of Pipeline
      java.lang.InterruptedException
      at java.lang.Object.wait(Native Method)
      at hudson.remoting.Request.call(Request.java:177)
      at hudson.remoting.Channel.call(Channel.java:954)
      at hudson.FilePath.act(FilePath.java:1070)
      at hudson.FilePath.act(FilePath.java:1059)
      at hudson.FilePath.deleteRecursive(FilePath.java:1266)
      at sun.reflect.GeneratedMethodAccessor2361.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
      at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
      at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
      at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
      at org.codehaus.groovy.runtime.callsite.PojoMetaClassSite.call(PojoMetaClassSite.java:47)
      at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
      at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
      at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:157)
      at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
      at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:133)
      at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
      at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:159)
      at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
      at WorkflowScript.run(WorkflowScript:32)
      at __cps.transform__(Native Method)
      at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
      at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
      at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixName(FunctionCallBlock.java:77)
      at sun.reflect.GeneratedMethodAccessor416.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
      at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
      at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
      at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
      at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122)
      at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261)
      at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
      at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$101(SandboxContinuable.java:34)
      at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.lambda$run0$0(SandboxContinuable.java:59)
      at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
      at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:58)
      at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:332)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:83)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:244)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:232)
      at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      Finished: FAILURE

          [JENKINS-53652] InterruptedException during slave workspace groovy clean up pipeline

          Attached a list of the active threads during the job execution (https://issues.jenkins-ci.org/secure/attachment/44535/ThreadsDuringTheRun.txt)

          Vassilena Treneva added a comment - Attached a list of the active threads during the job execution ( https://issues.jenkins-ci.org/secure/attachment/44535/ThreadsDuringTheRun.txt )

          Jeff Thompson added a comment -

          It's hard to say. Some thread sends an interrupt to the thread waiting on the request. It doesn't look like it comes from Remoting so it must be one of the many other threads in that list.

          Does this job complete? Or does it terminate part way through? Any indication of how long different parts of it take?

          Jeff Thompson added a comment - It's hard to say. Some thread sends an interrupt to the thread waiting on the request. It doesn't look like it comes from Remoting so it must be one of the many other threads in that list. Does this job complete? Or does it terminate part way through? Any indication of how long different parts of it take?

          Hey,

          The job does not complete. It fails.

          I cannot see a pattern when it tends to fail. Sometimes it takes ~10 minutes, sometimes less. And it happens when working with different slaves....

          I have added timestamps to the execution and I am attaching the console output with timestamps (https://issues.jenkins-ci.org/secure/attachment/44537/ConsoleOutputWithTimestamps.txt) as well as the threads I see immediately after the job has failed (https://issues.jenkins-ci.org/secure/attachment/44536/ThreadsAfterJobFailure.txt)

          I was also thinking if it makes sense that I log a bit more data about the threads from the pipeline itself? I am not sure however what is the context and how to access the current thread...

          Vassilena Treneva added a comment - Hey, The job does not complete. It fails. I cannot see a pattern when it tends to fail. Sometimes it takes ~10 minutes, sometimes less. And it happens when working with different slaves.... I have added timestamps to the execution and I am attaching the console output with timestamps ( https://issues.jenkins-ci.org/secure/attachment/44537/ConsoleOutputWithTimestamps.txt ) as well as the threads I see immediately after the job has failed ( https://issues.jenkins-ci.org/secure/attachment/44536/ThreadsAfterJobFailure.txt ) I was also thinking if it makes sense that I log a bit more data about the threads from the pipeline itself? I am not sure however what is the context and how to access the current thread...

          Jeff Thompson added a comment -

          Hmm ... I've really got no idea.

          If you have access to the agent systems you could try looking on there manually and seeing if there is anything interesting in the remoting logs. I doubt you'll find anything.

          It looks like something else is interrupting the call. Something running on the master. Given your log with timestamps, it's not like this job is taking a long time. None of the threads look suspicious to me but I don't have much familiarity with them.

          Maybe someone with more experience in other areas, such as pipeline, may have other ideas but I don't know who to suggest.

          Jeff Thompson added a comment - Hmm ... I've really got no idea. If you have access to the agent systems you could try looking on there manually and seeing if there is anything interesting in the remoting logs. I doubt you'll find anything. It looks like something else is interrupting the call. Something running on the master. Given your log with timestamps, it's not like this job is taking a long time. None of the threads look suspicious to me but I don't have much familiarity with them. Maybe someone with more experience in other areas, such as pipeline, may have other ideas but I don't know who to suggest.

          I have the same problem while doing the same kind of operation. I ended up just wrapping the deleteRecursive in a try/catch block to catch and ignore the InterruptedException, with a loop around that to keep trying until it works. It makes progress before getting interrupted each time, so eventually everything gets deleted, but for large directories it can get interrupted dozens of times

          Michael Mrozek added a comment - I have the same problem while doing the same kind of operation. I ended up just wrapping the deleteRecursive in a try/catch block to catch and ignore the InterruptedException , with a loop around that to keep trying until it works. It makes progress before getting interrupted each time, so eventually everything gets deleted, but for large directories it can get interrupted dozens of times

          mrozekma, Many thanks for this hint! I will try to adjust my script in the same way.

          I will attempt something like this:

          int counter = 0;
          while (counter < 10) {
          try

          { fp.deleteRecursive(); }

          catch (InterruptedException e)

          { counter ++; e.printStackTrace(); }

          }

          I am not clear what the exit condition of the while should be? What is yours? Just some counter or some other condition? If you can share a snippet with me - that would be helpful.

          Vassilena Treneva added a comment - mrozekma , Many thanks for this hint! I will try to adjust my script in the same way. I will attempt something like this: int counter = 0; while (counter < 10) { try { fp.deleteRecursive(); } catch (InterruptedException e) { counter ++; e.printStackTrace(); } } I am not clear what the exit condition of the while should be? What is yours? Just some counter or some other condition? If you can share a snippet with me - that would be helpful.

          Okay, so I think I do not need the loop at all!
          This script works:

          import hudson.model.Node
          import hudson.model.Slave
          import jenkins.model.Jenkins
          import hudson.FilePath;

          pipeline {
          triggers

          { cron('H 1 * * *') }

          agent {
          node

          { label 'master' }

          }
          options

          { timeout(time: 3, unit: 'HOURS') buildDiscarder(logRotator(daysToKeepStr: '30', numToKeepStr: '30', artifactDaysToKeepStr:'1', artifactNumToKeepStr: '30')) timestamps() }

          stages {
          stage ('Wipe Out Slave Workspaces') {
          steps {
          script {
          Jenkins jenkins = Jenkins.instance
          def jenkinsNodes = jenkins.nodes

          for (Node node in jenkinsNodes) {
          if (!node.getComputer().isOffline()) {
          if(node.getComputer().countBusy()==0) {
          FilePath fp = node.getRootPath().child("workspace");
          println("Processing: " + node.getDisplayName())
          println ("Wiping out: " + fp)
          def now = new Date()
          println now.format("yyyy-MM-dd HH:mm:ss", TimeZone.getTimeZone('GMT+2'))
          println("------------------")
          try

          { fp.deleteRecursive(); }

          catch (InterruptedException e)

          { println "InterruptedException caught!" e.printStackTrace(); }

          }
          }
          }
          }
          }
          }
          }
          }

          mrozekma Many thanks again!

          jthompson Thank you for your detailed answers and ideas and most of all - for not ignoring my problem

          Vassilena Treneva added a comment - Okay, so I think I do not need the loop at all! This script works: import hudson.model.Node import hudson.model.Slave import jenkins.model.Jenkins import hudson.FilePath; pipeline { triggers { cron('H 1 * * *') } agent { node { label 'master' } } options { timeout(time: 3, unit: 'HOURS') buildDiscarder(logRotator(daysToKeepStr: '30', numToKeepStr: '30', artifactDaysToKeepStr:'1', artifactNumToKeepStr: '30')) timestamps() } stages { stage ('Wipe Out Slave Workspaces') { steps { script { Jenkins jenkins = Jenkins.instance def jenkinsNodes = jenkins.nodes for (Node node in jenkinsNodes) { if (!node.getComputer().isOffline()) { if(node.getComputer().countBusy()==0) { FilePath fp = node.getRootPath().child("workspace"); println("Processing: " + node.getDisplayName()) println ("Wiping out: " + fp) def now = new Date() println now.format("yyyy-MM-dd HH:mm:ss", TimeZone.getTimeZone('GMT+2')) println("------------------") try { fp.deleteRecursive(); } catch (InterruptedException e) { println "InterruptedException caught!" e.printStackTrace(); } } } } } } } } } mrozekma Many thanks again! jthompson Thank you for your detailed answers and ideas and most of all - for not ignoring my problem

          A side question (hope I am not spoiling the issue) – is there a way to check and kill all remaining process in the workspace without using OS commands, but instead using Jenkins Groovy Api for this? Something like an alternative to kill -9 for all processes in the workspace…

          This check (!node.getComputer().isOffline()) is good for figuring out if the node is currently building, but if by any chance some process has been left alive, the script would eventually fail.

          Vassilena Treneva added a comment - A side question (hope I am not spoiling the issue) – is there a way to check and kill all remaining process in the workspace without using OS commands, but instead using Jenkins Groovy Api for this? Something like an alternative to kill -9 for all processes in the workspace… This check (!node.getComputer().isOffline()) is good for figuring out if the node is currently building, but if by any chance some process has been left alive, the script would eventually fail.

          Jeff Thompson added a comment -

          I'm glad you were able to figure something out. I would feel a lot more comfortable if we understood what is going on, what is causing the InterruptedException. If you figure anything out on that, please share. But, at least you figured out how to get things working well.

          As for your question on killing processes, you would likely get better results by asking that in the Jenkins users forums.

          It looks like you've got something working so I'm going to go ahead and close this issue.

          Jeff Thompson added a comment - I'm glad you were able to figure something out. I would feel a lot more comfortable if we understood what is going on, what is causing the InterruptedException. If you figure anything out on that, please share. But, at least you figured out how to get things working well. As for your question on killing processes, you would likely get better results by asking that in the Jenkins users forums. It looks like you've got something working so I'm going to go ahead and close this issue.

          vassilena: Without the loop you're preventing the exception from failing your build, but not actually deleting the directory you tried to delete with fp.deleteRecursive(); – the operation failed. I didn't have any cap on the number of retries; it seems like it deletes some files each time before getting interrupted (it takes about 5 seconds to get interrupted according to the step timer), so eventually it succeeds given enough attempts. I did this:

          for(;;) {
              try {
                  dir.deleteRecursive()
                  break
              } catch(InterruptedException e) {
                  echo "Delete interrupted; retrying"
              } catch(e) {
                  echo "Failed to remove directory: $e"
                  break
              }
          }
          

          You could add a max number of attempts to the for loop if you wanted

          Michael Mrozek added a comment - vassilena : Without the loop you're preventing the exception from failing your build, but not actually deleting the directory you tried to delete with fp.deleteRecursive(); – the operation failed. I didn't have any cap on the number of retries; it seems like it deletes some files each time before getting interrupted (it takes about 5 seconds to get interrupted according to the step timer), so eventually it succeeds given enough attempts. I did this: for (;;) { try { dir.deleteRecursive() break } catch (InterruptedException e) { echo "Delete interrupted; retrying" } catch (e) { echo "Failed to remove directory: $e" break } } You could add a max number of attempts to the for loop if you wanted

            jthompson Jeff Thompson
            vassilena Vassilena Treneva
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: