Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58390

Docker cloud provisioning 1 slave behind

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: docker-plugin
    • Labels:
      None
    • Environment:
      Jenkins 2.164.3
      Docker Plugin 1.1.6
    • Similar Issues:

      Description

      We have multiple docker cloud hosts setup with multiple templates per host. Some templates are duplicated across clouds, some are unique to a host. I'm not sure yet how to reproduce or get into this state but I can help diagnose on my end with some guidance. 

      After a period of time successfully provisioning docker agents, we'll get into a state whereby a job is waiting for a container that never comes. If another job is launched which requests a container which happens to match the criteria of the already queued job, the new container will provision and the first job will take it. This leaves the new job waiting until another job requests a matching container.

      If another job requests a unique container, it too will wait indefinitely until another job requests the same container.

      Restarting master resolves the issue temporarily (few days).

      I noticed in the logs once we have hit this state and I launch a job requesting container-A the log shows the following and the container will not start until a second job requests a container.

      Jul 08, 2019 12:05:24 PM INFO io.jenkins.docker.DockerTransientNode$1 println
      Disconnected computer for node 'docker-003jf7ab2t864'.
      Jul 08, 2019 12:05:24 PM INFO hudson.remoting.Request$2 run
      Failed to send back a reply to the request hudson.remoting.Request$2@17f6e1b9: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3448814f:docker-003jf7ab2t864": channel is already closed
      Jul 08, 2019 12:05:24 PM INFO io.jenkins.docker.DockerTransientNode$1 println
      Removed Node for node 'docker-003jf7ab2t864'.
      Jul 08, 2019 12:05:24 PM INFO io.jenkins.docker.DockerTransientNode$1 println
      Stopped container '736b8e5ffa4e2a0a06965fc5768bbf241654efe3193e72a54a9f72fcf400e417' for node 'docker-003jf7ab2t864'.
      Jul 08, 2019 12:05:24 PM INFO io.jenkins.docker.DockerTransientNode$1 println
      Removed container '736b8e5ffa4e2a0a06965fc5768bbf241654efe3193e72a54a9f72fcf400e417' for node 'docker-003jf7ab2t864'.
      

      If I launch a second job requesting container-A it looks more normal but now the second job is stuck waiting.

      Jul 08, 2019 12:07:20 PM INFO hudson.slaves.NodeProvisioner$2 run
      Image of CONTAINER-A:latest provisioning successfully completed. We have now 146 computer(s)
      Jul 08, 2019 12:07:20 PM INFO io.jenkins.docker.DockerTransientNode$1 println
      Disconnected computer for node 'docker-003jf7ab2t864'.
      Jul 08, 2019 12:07:20 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud provision
      Asked to provision 1 slave(s) for: merge
      Jul 08, 2019 12:07:20 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedSlave
      Provisioning 'CONTAINER-A:latest' on 'docker-cloud-1'
      Jul 08, 2019 12:07:20 PM INFO com.nirima.jenkins.plugins.docker.DockerCloud provision
      Will provision 'CONTAINER-A:latest', for label: 'merge', in cloud: 'docker-cloud-1'
      Jul 08, 2019 12:07:20 PM INFO hudson.slaves.NodeProvisioner$StandardStrategyImpl apply
      Started provisioning Image of CONTAINER-A:latest from docker-cloud-1 with 1 executors. Remaining excess workload: 0
      Jul 08, 2019 12:07:20 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate pullImage
      Pulling image 'CONTAINER-A:latest'. This may take awhile...
      Jul 08, 2019 12:07:21 PM INFO io.jenkins.docker.DockerTransientNode$1 println
      Removed Node for node 'docker-003jf7ab2t864'.
      Jul 08, 2019 12:07:22 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate pullImage
      Finished pulling image 'CONTAINER-A:latest', took 2002 ms
      Jul 08, 2019 12:07:22 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
      Trying to run container for CONTAINER-A:latest
      Jul 08, 2019 12:07:22 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
      Trying to run container for node docker-003jf9zomwkv9 from image: CONTAINER-A:latest
      Jul 08, 2019 12:07:22 PM INFO com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
      Started container ID fd2202c7c9db3549e82351926694a9e2965d9c9d83f133049af7f99f4f6e94da for node docker-003jf9zomwkv9 from image: CONTAINER-A:latest
      

        Attachments

          Activity

          Hide
          broussar Adam Brousseau added a comment -

          As a workaround, we wrote a separate job that runs after the first job has started, and asks for the same container, then times out.

          Show
          broussar Adam Brousseau added a comment - As a workaround, we wrote a separate job that runs after the first job has started, and asks for the same container, then times out.
          Hide
          mwilson Matt Wilson added a comment -

          I had a similar issue to this when I first started using this plugin.  I think after some debugging I realized that I had a docker agent "hidden" on my system.  i.e. if I looked in my jenkins folder under nodes, I could see one of the container agents there.  That same agent wasn't active though.  I can't remember the commands to do this now, but I used the console to run a command to list all agents, and sure enough the machine was listed there.  Then using the console I issued a command to remove the agent.  That got me back to a space where things worked fine again.

          I believe I caused this issue by accidentally setting an node to delete after use.

          Show
          mwilson Matt Wilson added a comment - I had a similar issue to this when I first started using this plugin.  I think after some debugging I realized that I had a docker agent "hidden" on my system.  i.e. if I looked in my jenkins folder under nodes, I could see one of the container agents there.  That same agent wasn't active though.  I can't remember the commands to do this now, but I used the console to run a command to list all agents, and sure enough the machine was listed there.  Then using the console I issued a command to remove the agent.  That got me back to a space where things worked fine again. I believe I caused this issue by accidentally setting an node to delete after use.
          Hide
          broussar Adam Brousseau added a comment -

          Thanks Matt,

          I'm not sure if that was our problem but it sounds like that could have been. I don't think we've seen this in a while so I can't really test it.

          Show
          broussar Adam Brousseau added a comment - Thanks Matt, I'm not sure if that was our problem but it sounds like that could have been. I don't think we've seen this in a while so I can't really test it.

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            broussar Adam Brousseau
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated: