Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47006

durable-task's BourneShellScript.launchWithCookie trips workflow-cps-plugin's 5-minute timeout

    • Icon: Bug Bug
    • Resolution: Not A Defect
    • Icon: Minor Minor
    • None

      I ran a pipeline job which failed in this way:

      // [...] Start a VM and have the VM connect back as a swarm slave named "example-eff65ede"
      [Pipeline] node
      22:44:19 Running on example-eff65ede in /var/tmp/jenkins/workspace/devops-gate/master/install-os-usher
      [Pipeline] {
      [Pipeline] stage (Unset publishers)
      22:44:19 Entering stage Unset publishers
      22:44:19 Proceeding
      [Pipeline] sh
      [Pipeline] stage (Send Notifications)
      22:49:19 Using the ‘stage’ step without a block argument is deprecated
      22:49:19 Entering stage Send Notifications
      22:49:19 Proceeding
      22:49:19 Sending email to: example@example.com
      [Pipeline] emailext
      [Pipeline] }
      [Pipeline] // timestamps
      [Pipeline] End of Pipeline
      java.lang.InterruptedException
       at java.lang.Object.wait(Native Method)
       at hudson.remoting.Request.call(Request.java:147)
       at hudson.remoting.Channel.call(Channel.java:829)
       at hudson.FilePath.act(FilePath.java:987)
       at hudson.FilePath.act(FilePath.java:976)
       at hudson.FilePath.chmod(FilePath.java:1592)
       at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:101)
       at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:64)
       at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:167)
       at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:224)
       at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:150)
       at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:108)
       at sun.reflect.GeneratedMethodAccessor1275.invoke(Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
       at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
       at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1218)
       at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027)
       at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
       at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
       at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
       at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
       at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
       at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:133)
       at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:153)
       at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:157)
       at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:127)
       at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:127)
       at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:127)
       at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:127)
       at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:127)
       at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
       at WorkflowScript.run(WorkflowScript:65)
       at ___cps.transform___(Native Method)
       at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
       at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
       at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
       at sun.reflect.GeneratedMethodAccessor609.invoke(Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
       at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:103)
       at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
       at sun.reflect.GeneratedMethodAccessor609.invoke(Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
       at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:60)
       at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
       at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
       at sun.reflect.GeneratedMethodAccessor609.invoke(Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
       at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
       at com.cloudbees.groovy.cps.Next.step(Next.java:83)
       at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
       at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
       at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122)
       at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261)
       at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
       at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:19)
       at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:35)
       at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:32)
       at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
       at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:32)
       at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:330)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230)
       at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
       at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
       Finished: FAILURE
      

      This is interesting. First, the swarm plugin has connected back to the master at 22:44:19 and so we successfully enter the node block and start running on the slave. Then we enter an sh step. This step seems to be taking a very long time and then exactly 5 minutes later at 22:49:19 we get an interrupt.

      At the time that we get the interrupt, we're in hudson.FilePath.chmod(FilePath.java:1592) via BourneShellScript.launchWithCookie via DurableTaskStep$Execution.start. Essentially we are trying to start the execution of the BourneShellScriptDurableTaskStep and as part of its launch method it is trying to chmod the script to 0755. This seems to be hanging or at the very least taking quite a long time. Five minutes later we get an interrupt.

      Where could this interrupt have come from? Further up in the stack trace, we see org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174). Looking at the source of that method, there is a 5-minute timer on each instruction in the CPS VM thread introduced by JENKINS-32986. This seems likely to be what is causing the interrupt.

      I read JENKINS-32986 as well as related bug JENKINS-42561. It seems that the suggestion there was to use DurableStep. But the thing is, I am using a DurableStep here already, in particular the durable-task-plugin and workflow-durable-task-step-plugin. The advice given in JENKINS-42561 was to implement Step directly, but it appears that is already done in BourneShellScript.

      Unfortunately, I don't have the logs on the swarm client side (I am trying to get those), but the VM was running on a network quite far away from the Jenkins master. It was also still booting up at the time we started the swarm client, so it's possible that performance was bad either due to the machine still booting up or the network being very poor (or both). Either way, it sounds like the Durable Task Plugin needs to account for the possibility that this work may take a long time and do it outside the VM CPS thread.

          [JENKINS-47006] durable-task's BourneShellScript.launchWithCookie trips workflow-cps-plugin's 5-minute timeout

          Basil Crow added a comment -

          jglick, since you introduced this timeout, would you mind commenting on what you think the correct course of action is here? The timeout is not customizable, so presumably one of the plugins I'm using (either the workflow-durable-task-step-plugin or durable-task-plugin) needs to be updated to do this slow operation outside of the CPS VM thread.

          Basil Crow added a comment - jglick , since you introduced this timeout, would you mind commenting on what you think the correct course of action is here? The timeout is not customizable, so presumably one of the plugins I'm using (either the workflow-durable-task-step-plugin or durable-task-plugin) needs to be updated to do this slow operation outside of the CPS VM thread.

          Jesse Glick added a comment -

          FileMonitoringTask.launch is supposed to be quick under normal cases (this just starts the process, not waiting for it to do anything) so DurableTaskStep.Execution.start ought to apply an aggressive timeout on that and fail with a more meaningful error.

          As to why Request.call is hanging for >5m in your case, I have no idea—the Remoting channel is presumably badly broken.

          Jesse Glick added a comment - FileMonitoringTask.launch is supposed to be quick under normal cases (this just starts the process, not waiting for it to do anything) so DurableTaskStep.Execution.start ought to apply an aggressive timeout on that and fail with a more meaningful error. As to why Request.call is hanging for >5m in your case, I have no idea—the Remoting channel is presumably badly broken.

          Basil Crow added a comment -

          I did some more investigation and it appears the slave agent is taking a long time to launch because it has to write about 10MB of JARs to the user's home directory, which is mounted on a slow NFS server and performing pathologically. Remoting Work Directories might fix this, but the swarm plugin has not been updated yet.

          Basil Crow added a comment - I did some more investigation and it appears the slave agent is taking a long time to launch because it has to write about 10MB of JARs to the user's home directory, which is mounted on a slow NFS server and performing pathologically. Remoting Work Directories might fix this, but the swarm plugin has not been updated yet.

          Jesse Glick added a comment -

          So FilePath.chmod was blocking on that download? Hmm. I suppose DurableTaskStep$Execution.start could be reworked to run asynchronously. (Cannot compatibly extend GeneralNonBlockingStepExecution, unfortunately.)

          Jesse Glick added a comment - So FilePath.chmod was blocking on that download? Hmm. I suppose DurableTaskStep$Execution.start could be reworked to run asynchronously. (Cannot compatibly extend GeneralNonBlockingStepExecution , unfortunately.)

            Unassigned Unassigned
            basil Basil Crow
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: