Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-37719

Build cannot be interrupted if `docker stop` hangs

      cleclerc found a problem in a build doing basically

      docker.image('...').inside {
        sh '...'
      }
      

      The shell command appeared to complete, but then the build seemed to hang, and did not respond to anything short of a hard kill.

      A virtual thread dump just showed WorkflowScript on the last line of the closure. A physical thread dump of the master showed the problem:

      at ...
      at hudson.Launcher$ProcStarter.join(Launcher.java:388)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:260)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:241)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:238)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:133)
      at org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:104)
      at ...
      

      On the agent, stream copiers were still active for the docker exec call associated with the shell step, as well as a docker stop call on the container.

      Indeed process inspection showed that both docker commands were running, and docker ps just hanged: for reasons TBD, the Docker daemon was not responding to requests (though docker version worked). Docker did not work again until sudo service docker restart was run.

      Was not clear if killing the docker stop process would have resumed the build, since by that point a sequence of kill escalations had already been attempted.

      The most straightforward fix in the plugin would be to use .start().joinWithTimeout(...) rather than .join() in DockerClient. Thus, a failure to clean up a container due to an outage in the Docker daemon would at worst result in a few minutes' delay followed by a (more or less) comprehensible error.

      (All Docker commands in the plugin that are expected to take a long time, because they might contact a registry, are run from sh steps in Groovy code so they are durable and cleanly interruptible. Docker commands which we do not need to wait for, like docker exec to run processes in the container, are just launched and the process handle discarded. DockerClient with its blocking join call is only used for commands which under normal conditions should be close to instantaneous: docker run -d after checking for local availability of the image, docker inspect, docker stop, etc. But clearly this assumption is not completely reliable.)

          [JENKINS-37719] Build cannot be interrupted if `docker stop` hangs

          Jesse Glick added a comment -

          Implementing the recommendations in JENKINS-32986 would probably also have solved this: the CPS VM was blocked in an interruptible call, so Thread.interrupt ought to have caused Callback.finished to throw InterruptedException or some wrapper thereof, which should have caused the withDockerContainer step to fail cleanly.

          Jesse Glick added a comment - Implementing the recommendations in JENKINS-32986 would probably also have solved this: the CPS VM was blocked in an interruptible call, so Thread.interrupt ought to have caused Callback.finished to throw InterruptedException or some wrapper thereof, which should have caused the withDockerContainer step to fail cleanly.

          Jesse Glick added a comment -

          Related problem in JENKINS-37720 getting diagnostics.

          Jesse Glick added a comment - Related problem in JENKINS-37720 getting diagnostics.

          Jesse Glick added a comment -

          As a side note, after a hard kill, the entire build directory abruptly disappears except for build.xml, so the log disappears. Unclear why; perhaps LogRotator is trying to delete the build after it exits, but then something asynchronously saves build.xml?

          Jesse Glick added a comment - As a side note, after a hard kill, the entire build directory abruptly disappears except for build.xml , so the log disappears. Unclear why; perhaps LogRotator is trying to delete the build after it exits, but then something asynchronously saves build.xml ?

          Cyrille Le Clerc added a comment - We regularly reproduce see https://gist.github.com/cyrille-leclerc/20e36ba0926b429b6ed1e485f0339334

          R. Tyler Croy added a comment -

          Adding a thread dump from a Pipeline I have hung at the moment:

          Thread #12
          	at WorkflowScript.run(WorkflowScript:64)
          	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123)
          	at DSL.withDockerContainer(Native Method)
          	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:122)
          	at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:63)
          	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:116)
          	at WorkflowScript.run(WorkflowScript:60)
          	at DSL.timeout(killer task reported done)
          	at WorkflowScript.run(WorkflowScript:56)
          	at DSL.stage(Native Method)
          	at WorkflowScript.run(WorkflowScript:52)
          	at DSL.node(running on trusted-agent-1)
          	at WorkflowScript.run(WorkflowScript:21)
          

          A resulting effect of this is that the timeout block will not properly work either leading to never-ending-Pipelines.

          R. Tyler Croy added a comment - Adding a thread dump from a Pipeline I have hung at the moment: Thread #12 at WorkflowScript.run(WorkflowScript:64) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123) at DSL.withDockerContainer(Native Method) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:122) at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:63) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:116) at WorkflowScript.run(WorkflowScript:60) at DSL.timeout(killer task reported done) at WorkflowScript.run(WorkflowScript:56) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:52) at DSL.node(running on trusted-agent-1) at WorkflowScript.run(WorkflowScript:21) A resulting effect of this is that the timeout block will not properly work either leading to never-ending-Pipelines.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          pom.xml
          src/main/java/org/jenkinsci/plugins/docker/workflow/WithContainerStep.java
          src/main/java/org/jenkinsci/plugins/docker/workflow/client/DockerClient.java
          src/test/java/org/jenkinsci/plugins/docker/workflow/DockerTestUtil.java
          src/test/java/org/jenkinsci/plugins/docker/workflow/WithContainerStepTest.java
          http://jenkins-ci.org/commit/docker-workflow-plugin/3f8a83fbacdce1e3d286cf1790182a89d9f6cd11
          Log:
          [FIXED JENKINS-37719] Apply a timeout to all Docker CLI operations not already run from durable steps.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/main/java/org/jenkinsci/plugins/docker/workflow/WithContainerStep.java src/main/java/org/jenkinsci/plugins/docker/workflow/client/DockerClient.java src/test/java/org/jenkinsci/plugins/docker/workflow/DockerTestUtil.java src/test/java/org/jenkinsci/plugins/docker/workflow/WithContainerStepTest.java http://jenkins-ci.org/commit/docker-workflow-plugin/3f8a83fbacdce1e3d286cf1790182a89d9f6cd11 Log: [FIXED JENKINS-37719] Apply a timeout to all Docker CLI operations not already run from durable steps.

            jglick Jesse Glick
            jglick Jesse Glick
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: