Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-37719

Build cannot be interrupted if `docker stop` hangs

      cleclerc found a problem in a build doing basically

      docker.image('...').inside {
        sh '...'
      }
      

      The shell command appeared to complete, but then the build seemed to hang, and did not respond to anything short of a hard kill.

      A virtual thread dump just showed WorkflowScript on the last line of the closure. A physical thread dump of the master showed the problem:

      at ...
      at hudson.Launcher$ProcStarter.join(Launcher.java:388)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:260)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:241)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:238)
      at org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:133)
      at org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:104)
      at ...
      

      On the agent, stream copiers were still active for the docker exec call associated with the shell step, as well as a docker stop call on the container.

      Indeed process inspection showed that both docker commands were running, and docker ps just hanged: for reasons TBD, the Docker daemon was not responding to requests (though docker version worked). Docker did not work again until sudo service docker restart was run.

      Was not clear if killing the docker stop process would have resumed the build, since by that point a sequence of kill escalations had already been attempted.

      The most straightforward fix in the plugin would be to use .start().joinWithTimeout(...) rather than .join() in DockerClient. Thus, a failure to clean up a container due to an outage in the Docker daemon would at worst result in a few minutes' delay followed by a (more or less) comprehensible error.

      (All Docker commands in the plugin that are expected to take a long time, because they might contact a registry, are run from sh steps in Groovy code so they are durable and cleanly interruptible. Docker commands which we do not need to wait for, like docker exec to run processes in the container, are just launched and the process handle discarded. DockerClient with its blocking join call is only used for commands which under normal conditions should be close to instantaneous: docker run -d after checking for local availability of the image, docker inspect, docker stop, etc. But clearly this assumption is not completely reliable.)

          [JENKINS-37719] Build cannot be interrupted if `docker stop` hangs

          Jesse Glick created issue -
          Jesse Glick made changes -
          Epic Link New: JENKINS-35399 [ 171192 ]

          Jesse Glick added a comment -

          Implementing the recommendations in JENKINS-32986 would probably also have solved this: the CPS VM was blocked in an interruptible call, so Thread.interrupt ought to have caused Callback.finished to throw InterruptedException or some wrapper thereof, which should have caused the withDockerContainer step to fail cleanly.

          Jesse Glick added a comment - Implementing the recommendations in JENKINS-32986 would probably also have solved this: the CPS VM was blocked in an interruptible call, so Thread.interrupt ought to have caused Callback.finished to throw InterruptedException or some wrapper thereof, which should have caused the withDockerContainer step to fail cleanly.
          Jesse Glick made changes -
          Link New: This issue is related to JENKINS-32986 [ JENKINS-32986 ]

          Jesse Glick added a comment -

          Related problem in JENKINS-37720 getting diagnostics.

          Jesse Glick added a comment - Related problem in JENKINS-37720 getting diagnostics.
          Jesse Glick made changes -
          Link New: This issue is related to JENKINS-37720 [ JENKINS-37720 ]

          Jesse Glick added a comment -

          As a side note, after a hard kill, the entire build directory abruptly disappears except for build.xml, so the log disappears. Unclear why; perhaps LogRotator is trying to delete the build after it exits, but then something asynchronously saves build.xml?

          Jesse Glick added a comment - As a side note, after a hard kill, the entire build directory abruptly disappears except for build.xml , so the log disappears. Unclear why; perhaps LogRotator is trying to delete the build after it exits, but then something asynchronously saves build.xml ?

          Cyrille Le Clerc added a comment - We regularly reproduce see https://gist.github.com/cyrille-leclerc/20e36ba0926b429b6ed1e485f0339334

          R. Tyler Croy added a comment -

          Adding a thread dump from a Pipeline I have hung at the moment:

          Thread #12
          	at WorkflowScript.run(WorkflowScript:64)
          	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123)
          	at DSL.withDockerContainer(Native Method)
          	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:122)
          	at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:63)
          	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:116)
          	at WorkflowScript.run(WorkflowScript:60)
          	at DSL.timeout(killer task reported done)
          	at WorkflowScript.run(WorkflowScript:56)
          	at DSL.stage(Native Method)
          	at WorkflowScript.run(WorkflowScript:52)
          	at DSL.node(running on trusted-agent-1)
          	at WorkflowScript.run(WorkflowScript:21)
          

          A resulting effect of this is that the timeout block will not properly work either leading to never-ending-Pipelines.

          R. Tyler Croy added a comment - Adding a thread dump from a Pipeline I have hung at the moment: Thread #12 at WorkflowScript.run(WorkflowScript:64) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123) at DSL.withDockerContainer(Native Method) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:122) at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:63) at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/ var /jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:116) at WorkflowScript.run(WorkflowScript:60) at DSL.timeout(killer task reported done) at WorkflowScript.run(WorkflowScript:56) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:52) at DSL.node(running on trusted-agent-1) at WorkflowScript.run(WorkflowScript:21) A resulting effect of this is that the timeout block will not properly work either leading to never-ending-Pipelines.
          Jesse Glick made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]

            jglick Jesse Glick
            jglick Jesse Glick
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: