-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Jenkins 2.164.3
Docker Plugin 1.1.6
We have multiple docker cloud hosts setup with multiple templates per host. Some templates are duplicated across clouds, some are unique to a host. I'm not sure yet how to reproduce or get into this state but I can help diagnose on my end with some guidance.
After a period of time successfully provisioning docker agents, we'll get into a state whereby a job is waiting for a container that never comes. If another job is launched which requests a container which happens to match the criteria of the already queued job, the new container will provision and the first job will take it. This leaves the new job waiting until another job requests a matching container.
If another job requests a unique container, it too will wait indefinitely until another job requests the same container.
Restarting master resolves the issue temporarily (few days).
I noticed in the logs once we have hit this state and I launch a job requesting container-A the log shows the following and the container will not start until a second job requests a container.
Jul 08, 2019 12:05:24 PM INFO io.jenkins.docker.DockerTransientNode$1 println Disconnected computer for node 'docker-003jf7ab2t864'. Jul 08, 2019 12:05:24 PM INFO hudson.remoting.Request$2 run Failed to send back a reply to the request hudson.remoting.Request$2@17f6e1b9: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3448814f:docker-003jf7ab2t864": channel is already closed Jul 08, 2019 12:05:24 PM INFO io.jenkins.docker.DockerTransientNode$1 println Removed Node for node 'docker-003jf7ab2t864'. Jul 08, 2019 12:05:24 PM INFO io.jenkins.docker.DockerTransientNode$1 println Stopped container '736b8e5ffa4e2a0a06965fc5768bbf241654efe3193e72a54a9f72fcf400e417' for node 'docker-003jf7ab2t864'. Jul 08, 2019 12:05:24 PM INFO io.jenkins.docker.DockerTransientNode$1 println Removed container '736b8e5ffa4e2a0a06965fc5768bbf241654efe3193e72a54a9f72fcf400e417' for node 'docker-003jf7ab2t864'.
If I launch a second job requesting container-A it looks more normal but now the second job is stuck waiting.
Jul 08, 2019 12:07:20 PM INFO hudson.slaves.NodeProvisioner$2 run Image of CONTAINER-A:latest provisioning successfully completed. We have now 146 computer(s) Jul 08, 2019 12:07:20 PM INFO io.jenkins.docker.DockerTransientNode$1 println Disconnected computer for node 'docker-003jf7ab2t864'. Jul 08, 2019 12:07:20 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud provision Asked to provision 1 slave(s) for: merge Jul 08, 2019 12:07:20 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedSlave Provisioning 'CONTAINER-A:latest' on 'docker-cloud-1' Jul 08, 2019 12:07:20 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud provision Will provision 'CONTAINER-A:latest', for label: 'merge', in cloud: 'docker-cloud-1' Jul 08, 2019 12:07:20 PM INFO hudson.slaves.NodeProvisioner$StandardStrategyImpl apply Started provisioning Image of CONTAINER-A:latest from docker-cloud-1 with 1 executors. Remaining excess workload: 0 Jul 08, 2019 12:07:20 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate pullImage Pulling image 'CONTAINER-A:latest'. This may take awhile... Jul 08, 2019 12:07:21 PM INFO io.jenkins.docker.DockerTransientNode$1 println Removed Node for node 'docker-003jf7ab2t864'. Jul 08, 2019 12:07:22 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate pullImage Finished pulling image 'CONTAINER-A:latest', took 2002 ms Jul 08, 2019 12:07:22 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode Trying to run container for CONTAINER-A:latest Jul 08, 2019 12:07:22 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode Trying to run container for node docker-003jf9zomwkv9 from image: CONTAINER-A:latest Jul 08, 2019 12:07:22 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode Started container ID fd2202c7c9db3549e82351926694a9e2965d9c9d83f133049af7f99f4f6e94da for node docker-003jf9zomwkv9 from image: CONTAINER-A:latest