[JENKINS-37719] Build cannot be interrupted if `docker stop` hangs - Jenkins Jira

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: docker-workflow-plugin
Labels:
- robustness
- threads

Similar Issues:
Powered by SuggestiMate

Show
Epic Link:
Pipeline Durability

cleclerc found a problem in a build doing basically

docker.image('...').inside {
  sh '...'
}

The shell command appeared to complete, but then the build seemed to hang, and did not respond to anything short of a hard kill.

A virtual thread dump just showed WorkflowScript on the last line of the closure. A physical thread dump of the master showed the problem:

at ...
at hudson.Launcher$ProcStarter.join(Launcher.java:388)
at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:260)
at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:241)
at org.jenkinsci.plugins.docker.workflow.client.DockerClient.launch(DockerClient.java:238)
at org.jenkinsci.plugins.docker.workflow.client.DockerClient.stop(DockerClient.java:133)
at org.jenkinsci.plugins.docker.workflow.WithContainerStep.destroy(WithContainerStep.java:104)
at ...

On the agent, stream copiers were still active for the docker exec call associated with the shell step, as well as a docker stop call on the container.

Indeed process inspection showed that both docker commands were running, and docker ps just hanged: for reasons TBD, the Docker daemon was not responding to requests (though docker version worked). Docker did not work again until sudo service docker restart was run.

Was not clear if killing the docker stop process would have resumed the build, since by that point a sequence of kill escalations had already been attempted.

The most straightforward fix in the plugin would be to use .start().joinWithTimeout(...) rather than .join() in DockerClient. Thus, a failure to clean up a container due to an outage in the Docker daemon would at worst result in a few minutes' delay followed by a (more or less) comprehensible error.

(All Docker commands in the plugin that are expected to take a long time, because they might contact a registry, are run from sh steps in Groovy code so they are durable and cleanly interruptible. Docker commands which we do not need to wait for, like docker exec to run processes in the container, are just launched and the process handle discarded. DockerClient with its blocking join call is only used for commands which under normal conditions should be close to instantaneous: docker run -d after checking for local availability of the image, docker inspect, docker stop, etc. But clearly this assumption is not completely reliable.)

is related to

JENKINS-37720 Virtual thread dump hangs waiting for ProcessLiveness

Resolved

JENKINS-32986 hard killing a pipeline leaves the JVM CPS thread running.

Open

JENKINS-42322 Docker rm/stop/... commands killed by the timeout, failing builds

Resolved

relates to

JENKINS-44785 Add Built-in Request timeout support in Remoting

Open

links to

CloudBees Internal OSS-1374

docker-workflow PR 86

workflow-durable-task-step PR 30

(2 links to)

Assignee:: Jesse Glick

Reporter:: Jesse Glick

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2016-08-26 15:15

Updated:: 2017-12-05 05:46

Resolved:: 2017-02-13 13:48

Details

Description

Attachments

Issue Links

Activity

People

Dates