-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
Jenkins version 2.190
Java: OpenJDK 1.8.0_222
Within pipeline builds, shell steps randomly fail with an unspecific java.lang.InterruptedException, a full stack trace is listed below.
Unfortunately, this happens often enough to be a major issue within our development process since negative build results cannot be trusted and builds of multi-hour length might have to be retriggered multiple times.
Since we cannot reliably trigger the issue, I cannot provide an minimal example for reproduction. This is especially painful since all debugging has to happen in production.
Background information:
- Our slaves are started dynamically using the swarm plugin
- The orchestration of these slaves is handled by a shared library, the respective step is [available on github]
- We've only seen the exception occur on shell steps, other steps do not seem to throw (although not many were tested)
- Only the first shell step might throw, if it succeeds the others will be fine
- master can catch the Exception and continue with error handling
Complete stack trace:
java.lang.InterruptedException at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:956) at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1060) at hudson.Launcher$ProcStarter.start(Launcher.java:455) at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:194) at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:317) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:286) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:179) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:48) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20) Caused: java.io.InterruptedIOException at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1062) at hudson.Launcher$ProcStarter.start(Launcher.java:455) at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:194) at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:317) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:286) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:179) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:48) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20) at jesh.call(jesh.groovy:28) at withModules.call(withModules.groovy:45) at WorkflowScript.run(WorkflowScript:57) at onSlurmResource.call(onSlurmResource.groovy:46) at runOnSlave.call(runOnSlave.groovy:37) at ___cps.transform___(Native Method) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:84) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at sun.reflect.GeneratedMethodAccessor395.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.LocalVariableBlock$LocalVariable.get(LocalVariableBlock.java:39) at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30) at com.cloudbees.groovy.cps.impl.LocalVariableBlock.evalLValue(LocalVariableBlock.java:28) at com.cloudbees.groovy.cps.LValueBlock$BlockImpl.eval(LValueBlock.java:55) at com.cloudbees.groovy.cps.LValueBlock.eval(LValueBlock.java:16) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:186) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:370) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:93) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:282) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:270) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:66) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
I happened to come across this bug while triaging the Swarm component. Unclear what the cause of your problem is so far. It might be related to Swarm, or it might be related to durable-task or Remoting. It would be helpful to know what versions of the Swarm plugin, Swarm client, durable-task, and workflow-durable-task you are running. From these we would be able to tell what version of Remoting you are using on each side of the connection. If these plugins aren't already up-to-date, try updating them first.
The proximate cause of your problem given the above stack trace is hudson.remoting.Request.call(Request.java:177):
Here the Jenkins master is timing out after waiting for 30 seconds for some type of response from the agent over Remoting. It then throws an InterruptedException which causes the job to fail. You should try to look into the other side of the connection (the agent side) to see why it stopped responding to the master. Try turning up the logging as high as possible on the Swarm client side and see if anything suspicious is present there.