Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48490

Intermittently slow docker provisioning with no errors

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • docker-plugin
    • None
    • Jenkins 2.93
      Docker Plugin 1.1.1
      Containers are using JNLP

      I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

      When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:

      1. Node is allocated immediately
      2. Node is not allocated and jenkins logs indicate why (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
      3. Node is allocated with a significant delay (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
      4. Node is allocated with a ridiculous delay (I just had one take 77 minutes).  Logs do not indicate any activity from the Docker plugin until it is allocated.  Other jobs have gotten containers allocated since (and those events are in the logs).  An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

      How can I troubleshoot this behavior, especially #4?

      Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

      In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

          [JENKINS-48490] Intermittently slow docker provisioning with no errors

          Should be fixed by https://github.com/jenkinsci/docker-plugin/commit/b906c37f0add6a4a44fb4858aa8ff9de0ebeeabe#diff-0a438d5fa3c18711a9f82de9b72440c0R277

           

          I will need to revisit Provisioner implementation at some point.

          Nicolas De Loof added a comment - Should be fixed by https://github.com/jenkinsci/docker-plugin/commit/b906c37f0add6a4a44fb4858aa8ff9de0ebeeabe#diff-0a438d5fa3c18711a9f82de9b72440c0R277   I will need to revisit Provisioner implementation at some point.

          Alexander Komarov added a comment - - edited

          I'm now on docker-plugin 1.1.2 and Jenkins 2.98 (ndeloof it looks like the commit you referenced should be included in 1.1.2).  Note that I'm not using the SSH launcher.

          I'm experiencing the same behavior as before

          I'll try to describe it better:

          • I have 5 main docker templates with unique labels.  
          • I have bitbucket-branch-source based jobs that occasionally launch a lot of jobs rapidly (one per PR, for example, about 40 is typical)
          • These jobs use two of the docker templates by labels
          • Sometimes one or more will get stuck with "waiting for..." forever. 
          • The logs will mention nothing about provisioning (that I noticed)

          The interesting part:

          • If I then create and run a new freestyle job that allocates the same label (and does nothing), the blocked (real) job will immediately get a node.  The new freestyle job will then wait for a node until another job requests one.
          • In other words, the NodeProvisioner appears to be one allocation behind.  Time is not a factor (the job can wait for 10 hours).
          • Currently I have the aforementioned job running frequently as a workaround, and that seems to be working (I changed it to a pipeline job that allocates each of the 5 docker templates and times out quickly).  This seems to have real jobs running but obviously adds some overhead and wear to the cloud.

          ndeloof, would you suggest playing with system properties like these?

          -Dhudson.slaves.NodeProvisioner.initialDelay=X
          -Dhudson.slaves.NodeProvisioner.recurrencePeriod=X
          -Dhudson.slaves.NodeProvisioner.MARGIN=X

          This feels like the wrong approach because I did not need to do this with Jenkins <=2.93 and docker-plugin <=1.0.x, so it would seem that something changed...

          Alexander Komarov added a comment - - edited I'm now on docker-plugin 1.1.2 and Jenkins 2.98 ( ndeloof it looks like the commit you referenced should be included in 1.1.2).  Note that I'm not using the SSH launcher. I'm experiencing  the same behavior as before .  I'll try to describe it better: I have 5 main docker templates with unique labels.   I have bitbucket-branch-source based jobs that occasionally launch a lot of jobs rapidly (one per PR, for example, about 40 is typical) These jobs use two of the docker templates by labels Sometimes one or more will get stuck with "waiting for..." forever.  The logs will mention nothing about provisioning (that I noticed) The interesting part: If I then create and run a new freestyle job that allocates the same label (and does nothing), the blocked (real) job will immediately get a node.  The new freestyle job will then wait for a node until another job requests one. In other words, the NodeProvisioner appears to be one allocation behind .  Time is not a factor (the job can wait for 10 hours). Currently I have the aforementioned job running frequently as a workaround, and that seems to be working (I changed it to a pipeline job that allocates each of the 5 docker templates and times out quickly).  This seems to have real jobs running but obviously adds some overhead and wear to the cloud. ndeloof , would you suggest playing with system properties like these? -Dhudson.slaves.NodeProvisioner.initialDelay=X -Dhudson.slaves.NodeProvisioner.recurrencePeriod=X -Dhudson.slaves.NodeProvisioner.MARGIN=X This feels like the wrong approach because I did not need to do this with Jenkins <=2.93 and docker-plugin <=1.0.x, so it would seem that something changed...

          I think I may be seeing the same thing but my builds have a timeout so they never wait around for the 10 hour mark.

          I have 2 docker servers with the same template configured and a pipeline job that starts two parallel jobs against the label on the docker containers but it frequently either only creates a slave on one or neither. Do get the odd exception like:

          SEVERE: I/O error in channel docker-d5fcd775ff9f
          java.io.IOException: Expected 8 bytes application/vnd.docker.raw-stream header, got 4
          at io.jenkins.docker.client.DockerMultiplexedInputStream.readInternal(DockerMultiplexedInputStream.java:45)
          at io.jenkins.docker.client.DockerMultiplexedInputStream.read(DockerMultiplexedInputStream.java:25)
          at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
          at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
          at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
          at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
          at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          

          But its not a 1 to 1 with when it fails (ie sometimes it doesn't have that exception but still fails, but maybe it does disrupt something for a while?)

          David van Laatum added a comment - I think I may be seeing the same thing but my builds have a timeout so they never wait around for the 10 hour mark. I have 2 docker servers with the same template configured and a pipeline job that starts two parallel jobs against the label on the docker containers but it frequently either only creates a slave on one or neither. Do get the odd exception like: SEVERE: I/O error in channel docker-d5fcd775ff9f java.io.IOException: Expected 8 bytes application/vnd.docker.raw-stream header, got 4 at io.jenkins.docker.client.DockerMultiplexedInputStream.readInternal(DockerMultiplexedInputStream.java:45) at io.jenkins.docker.client.DockerMultiplexedInputStream.read(DockerMultiplexedInputStream.java:25) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) But its not a 1 to 1 with when it fails (ie sometimes it doesn't have that exception but still fails, but maybe it does disrupt something for a while?)

          Alexander Komarov added a comment - - edited

          I was finally able to reproduce this in an isolated Jenkins installation running the following versions:

          • Jenkins 2.98
          • Docker Plugin 1.1.2

          Here are the steps to reproduce:

          • Create a Cloud configuration with several docker image template definitions.  Mine are custom built images, but the random nature of this issue suggests that this should happen with any images.   Note: I am using JNLP but the problem also occurs with "Attached" method.  Ensure that there is enough capacity (instance limits, etc) so that this is not a bottleneck.
          • Create a Pipeline job with the following code (adjust for image labels if needed): 
          def flavors = ['centos6','centos7','sles11','sles12','ubuntu']
          def steps = [:]
          
          flavors.each{def flavor ->
            steps[flavor] = {
              stage(flavor) {
                timeout(1) {
                   echo "Allocating ${flavor}"
                   node("${flavor}&&dockerswarm") {
                     sh "date"
                   }
                }
              }
            }
          }
          
          timestamps {
            parallel steps
          }
          
          • Run this job a few times.  Job will succeed at first, but after a few successful runs one or more of the image flavors will start to time out.  
          • Restart Jenkins.
          • Job will succeed again for a little while.  (UPDATE: restart does not always help, issue sometimes begins immediately after)

          Jenkins logs are attached:

          Job run when provisioning hangs:  jenkins-log-provisioning-fail.log

          Job run when provisioning succeeds: jenkins-log-provisioning-success.log

           

          Note that while Jenkins is blocked trying to allocate a node, I can manually allocate one using docker command-line, proving that the actual docker infrastructure is not the problem.

          Alexander Komarov added a comment - - edited I was finally able to reproduce this in an isolated Jenkins installation running the following versions: Jenkins 2.98 Docker Plugin 1.1.2 Here are the steps to reproduce: Create a Cloud configuration with several docker image template definitions.  Mine are custom built images, but the random nature of this issue suggests that this should happen with any images.   Note: I am using JNLP but the problem also occurs with "Attached" method.  Ensure that there is enough capacity (instance limits, etc) so that this is not a bottleneck. Create a Pipeline job with the following code (adjust for image labels if needed):  def flavors = [ 'centos6' , 'centos7' , 'sles11' , 'sles12' , 'ubuntu' ] def steps = [:] flavors.each{def flavor -> steps[flavor] = { stage(flavor) { timeout(1) { echo "Allocating ${flavor}" node( "${flavor}&&dockerswarm" ) { sh "date" } } } } } timestamps { parallel steps } Run this job a few times.  Job will succeed at first, but after a few successful runs one or more of the image flavors will start to time out.   Restart Jenkins. Job will succeed again for a little while.  (UPDATE: restart does not always help, issue sometimes begins immediately after) Jenkins logs are attached: Job run when provisioning hangs:   jenkins-log-provisioning-fail.log Job run when provisioning succeeds:  jenkins-log-provisioning-success.log   Note that while Jenkins is blocked trying to allocate a node, I can manually allocate one using docker command-line, proving that the actual docker infrastructure is not the problem.

          Same thing for me

          David van Laatum added a comment - Same thing for me

          Jason Swager added a comment -

          We are seeing similar symptoms after upgrading the Docker plugin.  Our Jenkins masters started seeing this problem after upgrading to Docker Plugin v1.1.1 and even v1.1.2. One big difference, we're using SSH to connect rather than JNLP.

          The larger and more busy the Jenkins master, the faster this problem occurs.  Our larger one we had to downgrade the Docker plugin to it's prior version, 0.16.2.  The smaller Jenkins don't suffer immediately from the problem and a restart of them clears the problem - at least for another couple days. 

          Jason Swager added a comment - We are seeing similar symptoms after upgrading the Docker plugin.  Our Jenkins masters started seeing this problem after upgrading to Docker Plugin v1.1.1 and even v1.1.2. One big difference, we're using SSH to connect rather than JNLP. The larger and more busy the Jenkins master, the faster this problem occurs.  Our larger one we had to downgrade the Docker plugin to it's prior version, 0.16.2.  The smaller Jenkins don't suffer immediately from the problem and a restart of them clears the problem - at least for another couple days. 

          I've been trying to reproduce this issue on my local box using SSH executors with both 1.1.2 and 1.1.3 to no avail. We are still seeing it on other instances.

          In the short term, I've thrown together  a quick and dirty script job to "unstick" the jobs.

          https://gist.github.com/MattLud/1f8a56fcce933f7e97c366de54c85ba9

           

           

          Matthew Ludlum added a comment - I've been trying to reproduce this issue on my local box using SSH executors with both 1.1.2 and 1.1.3 to no avail. We are still seeing it on other instances. In the short term, I've thrown together  a quick and dirty script job to "unstick" the jobs. https://gist.github.com/MattLud/1f8a56fcce933f7e97c366de54c85ba9    

          pjdarton added a comment -

          We used to encounter these kinds of issues. We eventually traced it to cloud plugins doing blocking operations within the main "provisioning" thread and node-disposal operations (which happen while a Jenkins core object is locked) were also sometimes taking a long time. These combined to causing Jenkins' ability to create and destroy slaves was severely impeded. This only happened when the servers that the plugins were communicating with weren't responding swiftly, but as Docker is prone to total lockups (and vSphere can take 4 hours to fail a 40millisecond operation), when all is not well, Jenkins also becomes unwell.

          It took a fair bit of work, but I made enhancements to both the vSphere plugin and the Docker plugin to reduce the amount of remote-API-calls made during the "provisioning" and "termination" process threads, and to ensure that everything had a timeout (so nothing would lock up forever). The vSphere plugin (version 2.16 onwards) contains my changes, but you'd have to get the bleeding-edge build of the docker-plugin from here for my changes to that (as we've not done a release of that yet).
          Note: If you take the docker plugin, make sure that you set a non-zero connection timeout, read timeout and (in the templates) pull timeout. Also, for pure speed, remove the instance caps (if you don't specify an instance cap, the plugin no longer counts the instances). That should ensure that nothing can cause the core Jenkins cloud resources to stay "locked" for a long period of time.

          See also: JENKINS-49235 as that has the potential to cause problems for a busy Jenkins server.

          pjdarton added a comment - We used to encounter these kinds of issues. We eventually traced it to cloud plugins doing blocking operations within the main "provisioning" thread and node-disposal operations (which happen while a Jenkins core object is locked) were also sometimes taking a long time. These combined to causing Jenkins' ability to create and destroy slaves was severely impeded. This only happened when the servers that the plugins were communicating with weren't responding swiftly, but as Docker is prone to total lockups (and vSphere can take 4 hours to fail a 40millisecond operation), when all is not well, Jenkins also becomes unwell. It took a fair bit of work, but I made enhancements to both the vSphere plugin and the Docker plugin to reduce the amount of remote-API-calls made during the "provisioning" and "termination" process threads, and to ensure that everything had a timeout (so nothing would lock up forever). The vSphere plugin (version 2.16 onwards) contains my changes, but you'd have to get the bleeding-edge build of the docker-plugin from here for my changes to that (as we've not done a release of that yet). Note: If you take the docker plugin, make sure that you set a non-zero connection timeout, read timeout and (in the templates) pull timeout. Also, for pure speed, remove the instance caps (if you don't specify an instance cap, the plugin no longer counts the instances). That should ensure that nothing can cause the core Jenkins cloud resources to stay "locked" for a long period of time. See also: JENKINS-49235 as that has the potential to cause problems for a busy Jenkins server.

          pjdarton added a comment -

          I believe that, through careful use of "read timeout" and "pull timeout" on docker clouds & templates, coupled with the "avoid using broken clouds/templates" feature (all introduced in 1.1.4) this issue should now be fixed.

          We have a fairly busy Jenkins server with dozens of static nodes, lots of docker clouds, hundreds of jobs, and many dozens of builds running at any one time and, since adding this functionality, it all seems stable now (other than JENKINS-53621 which is a separate issue).

           

          TL;DR: I think it's fixed; re-open if it's still there in 1.1.6.

          pjdarton added a comment - I believe that, through careful use of "read timeout" and "pull timeout" on docker clouds & templates, coupled with the "avoid using broken clouds/templates" feature (all introduced in 1.1.4) this issue should now be fixed. We have a fairly busy Jenkins server with dozens of static nodes, lots of docker clouds, hundreds of jobs, and many dozens of builds running at any one time and, since adding this functionality, it all seems stable now (other than JENKINS-53621 which is a separate issue).   TL;DR: I think it's fixed; re-open if it's still there in 1.1.6.

          For reference: We hit a similar issue where the cloud-stats-plugin was the cause. See JENKINS-56863.

          Dennis Keitzel added a comment - For reference: We hit a similar issue where the cloud-stats-plugin was the cause. See JENKINS-56863 .

            pjdarton pjdarton
            akom Alexander Komarov
            Votes:
            2 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: