Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-72729

java.lang.InterruptedException when executing within docker on remote worker

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • docker-workflow-plugin
    • None

      Background - we have a Jenkins environment that is hosted on kubernetes. In our deployment we have a main controller pod and then an arbitrary number of remote worker nodes executing on separate kubernetes pods (in most cases limited to 2 or 4) connected via JNLP4.

      The remote workers themselves execute commands inside of various docker containers that are specialized for whatever task is being executed. This setup has existed like this for 4+ years and we have upgraded a few times to the latest LTS Jenkins as well as upgrading plugins.

      What has changed - we are upgrading to a more recent Jenkins LTS (FROM Jenkins 2.263.2 TO 2.426.3) and with it are upgrading associated plugins and underlying linux software.

      What is the problem - after the upgrade we have started executing our unit tests to ensure our testing platform operates properly and there are no regressions introduced with the upgrade of Jenkins. Our testing suite consists of a number of pipelines (representing one test) which are automatically generated and then executed concurrently.

      During test execution we are hitting a case where it appears like docker connectivity is breaking down on the worker and the pipeline is failing with an InterruptedException. If we run a small set of tests our jobs can pass reliably. If we ramp up and run the full suite (something we are currently able to do before the upgrade) most tests fail with the error mentioned below.

      What have we tried - an attempt was made to prevent this issue by setting the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT to 60. This appears to have solved an unrelated issue. This issue remains.

      Logs - included below are the logs of our pipeline execution - I have removed everything between when the pipeline started, and when it failed within withDockerContainer. We have a groovy helper that is the function we call to invoke the methods to execute within the container called dockerHelper.groovy below.

      14:05:01  [Pipeline] Start of Pipeline (hide)
      ...
      14:07:59  [Pipeline] withDockerContainer
      14:07:59  persistent-docker-worker-0 seems to be running inside container <our-container-id>
      14:07:59  but /home/jenkins/agent/workspace/integration-test/helmTest could not be found among []
      14:07:59  but /home/jenkins/agent/workspace/integration-test/helmTest@tmp could not be found among []
      14:07:59  $ docker run -t -d -u 1000:1000 -w /home/jenkins/agent/workspace/integration-test/helmTest -v /home/jenkins/agent/workspace/integration-test/helmTest:/home/jenkins/agent/workspace/integration-test/helmTest:rw,z -v /home/jenkins/agent/workspace/integration-test/helmTest@tmp:/home/jenkins/agent/workspace/integration-test/helmTest@tmp:rw,z -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** artifactory.local/jpm/helm-client:3 cat
      14:12:59  [Pipeline] // withDockerContainer
      ...
      14:13:01  Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: be8f9f02-a5fc-4aae-a610-64176b73c3e1
      14:13:01  java.lang.InterruptedException
      14:13:01        at java.base/java.lang.Object.wait(Native Method)
      14:13:01        at hudson.remoting.Request.call(Request.java:177)
      14:13:01        at hudson.remoting.Channel.call(Channel.java:1002)
      14:13:01        at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285)
      14:13:01        at com.sun.proxy.$Proxy272.join(Unknown Source)
      14:13:01        at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1198)
      14:13:01        at hudson.Proc.joinWithTimeout(Proc.java:172)
      14:13:01        at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:314)
      14:13:01        at org.jenkinsci.plugins.docker.workflow.client.DockerClient.run(DockerClient.java:144)
      14:13:01        at org.jenkinsci.plugins.docker.workflow.WithContainerStep$Execution.start(WithContainerStep.java:200)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:323)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:196)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:124)
      14:13:01        at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:47)
      14:13:01        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
      14:13:01        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
      14:13:01        at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.LoggingInvoker.methodCall(LoggingInvoker.java:105)
      14:13:01        at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(Docker.groovy:140)
      14:13:01        at org.jenkinsci.plugins.docker.workflow.Docker.node(Docker.groovy:66)
      14:13:01        at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(Docker.groovy:125)
      14:13:01        at dockerHelper.runOnContainer(dockerHelper:256)
      14:13:01        at ___cps.transform___(Native Method)
      14:13:01        at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:90)
      14:13:01        at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:116)
      14:13:01        at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:85)
      14:13:01        at jdk.internal.reflect.GeneratedMethodAccessor639.invoke(Unknown Source)
      14:13:01        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      14:13:01        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
      14:13:01        at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
      14:13:01        at com.cloudbees.groovy.cps.impl.ClosureBlock.eval(ClosureBlock.java:46)
      14:13:01        at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      14:13:01        at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:152)
      14:13:01        at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:146)
      14:13:01        at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)
      14:13:01        at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)
      14:13:01        at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:146)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:187)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:423)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:331)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:295)
      14:13:01        at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:97)
      14:13:01        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
      14:13:01        at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
      14:13:01        at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      14:13:01        at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
      14:13:01        at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
      14:13:01        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
      14:13:01        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
      14:13:01        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      14:13:01        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      14:13:01        at java.base/java.lang.Thread.run(Unknown Source)
      14:13:01  Finished: FAILURE
      

          [JENKINS-72729] java.lang.InterruptedException when executing within docker on remote worker

          gmv added a comment - - edited

          After some digging through the source code I found the following configuration option:

          org.jenkinsci.plugins.docker.workflow.client.DockerClient.CLIENT_TIMEOUT
          

          This seemed to be set to 300 by default (at least in our environment) and increasing this value to 600 had two observed improvements:

          1. the total execution time of all pipelines reduced (not yet clear how or why a longer timeout would also result in better performance)
          2. fewer of the downstream pipelines failed due to this timeout condition

          Increasing this value to a very large value such as 1200 also saw a performance increase as well as no tests failing due to the timeout condition. We are unlikely to try and use this huge value in production as we are not clear on other impacts it might have (such as jobs encountering some other problem and potentially stalling for 20 minutes). However it seems clear that there is an issue here.

          gmv added a comment - - edited After some digging through the source code I found the following configuration option: org.jenkinsci.plugins.docker.workflow.client.DockerClient.CLIENT_TIMEOUT This seemed to be set to 300 by default (at least in our environment) and increasing this value to 600 had two observed improvements: 1. the total execution time of all pipelines reduced (not yet clear how or why a longer timeout would also result in better performance) 2. fewer of the downstream pipelines failed due to this timeout condition Increasing this value to a very large value such as 1200 also saw a performance increase as well as no tests failing due to the timeout condition. We are unlikely to try and use this huge value in production as we are not clear on other impacts it might have (such as jobs encountering some other problem and potentially stalling for 20 minutes). However it seems clear that there is an issue here.

            Unassigned Unassigned
            gmv gmv
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: