Cyrille Le Clerc found a problem in a build doing basically
The shell command appeared to complete, but then the build seemed to hang, and did not respond to anything short of a hard kill.
A virtual thread dump just showed WorkflowScript on the last line of the closure. A physical thread dump of the master showed the problem:
On the agent, stream copiers were still active for the docker exec call associated with the shell step, as well as a docker stop call on the container.
Indeed process inspection showed that both docker commands were running, and docker ps just hanged: for reasons TBD, the Docker daemon was not responding to requests (though docker version worked). Docker did not work again until sudo service docker restart was run.
Was not clear if killing the docker stop process would have resumed the build, since by that point a sequence of kill escalations had already been attempted.
The most straightforward fix in the plugin would be to use .start().joinWithTimeout(...) rather than .join() in DockerClient. Thus, a failure to clean up a container due to an outage in the Docker daemon would at worst result in a few minutes' delay followed by a (more or less) comprehensible error.
(All Docker commands in the plugin that are expected to take a long time, because they might contact a registry, are run from sh steps in Groovy code so they are durable and cleanly interruptible. Docker commands which we do not need to wait for, like docker exec to run processes in the container, are just launched and the process handle discarded. DockerClient with its blocking join call is only used for commands which under normal conditions should be close to instantaneous: docker run -d after checking for local availability of the image, docker inspect, docker stop, etc. But clearly this assumption is not completely reliable.)