Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48490

Intermittently slow docker provisioning with no errors

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • docker-plugin
    • None
    • Jenkins 2.93
      Docker Plugin 1.1.1
      Containers are using JNLP

      I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

      When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:

      1. Node is allocated immediately
      2. Node is not allocated and jenkins logs indicate why (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
      3. Node is allocated with a significant delay (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
      4. Node is allocated with a ridiculous delay (I just had one take 77 minutes).  Logs do not indicate any activity from the Docker plugin until it is allocated.  Other jobs have gotten containers allocated since (and those events are in the logs).  An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

      How can I troubleshoot this behavior, especially #4?

      Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

      In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

          [JENKINS-48490] Intermittently slow docker provisioning with no errors

          Alexander Komarov created issue -
          Alexander Komarov made changes -
          Description Original: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 30 minutes).  Logs do not indicate any activity from the Docker plugin.  It will (probably) eventually get a node.  Other jobs have gotten containers allocated since (and those events are in the logs).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x -> 1.1.x upgrade (possibly also Jenkins 2.92->2.93 upgrade)
          New: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 30 minutes).  Logs do not indicate any activity from the Docker plugin.  It will (probably) eventually get a node.  Other jobs have gotten containers allocated since (and those events are in the logs).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)
          Alexander Komarov made changes -
          Description Original: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 30 minutes).  Logs do not indicate any activity from the Docker plugin.  It will (probably) eventually get a node.  Other jobs have gotten containers allocated since (and those events are in the logs).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)
          New: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 60 minutes).  Logs do not indicate any activity from the Docker plugin.  It will (probably) eventually get a node.  Other jobs have gotten containers allocated since (and those events are in the logs).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

          In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)
          Alexander Komarov made changes -
          Description Original: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 60 minutes).  Logs do not indicate any activity from the Docker plugin.  It will (probably) eventually get a node.  Other jobs have gotten containers allocated since (and those events are in the logs).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

          In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)
          New: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I just had one take 77 minutes).  Logs do not indicate any activity from the Docker plugin.  It will (probably) eventually get a node.  Other jobs have gotten containers allocated since (and those events are in the logs).  An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

          In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)
          Alexander Komarov made changes -
          Description Original: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I just had one take 77 minutes).  Logs do not indicate any activity from the Docker plugin.  It will (probably) eventually get a node.  Other jobs have gotten containers allocated since (and those events are in the logs).  An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

          In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)
          New: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I just had one take 77 minutes).  Logs do not indicate any activity from the Docker plugin.  It will (probably) eventually get a node.  Other jobs have gotten containers allocated since (and those events are in the logs).  An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

          In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)
          Alexander Komarov made changes -
          Description Original: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I just had one take 77 minutes).  Logs do not indicate any activity from the Docker plugin.  It will (probably) eventually get a node.  Other jobs have gotten containers allocated since (and those events are in the logs).  An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

          In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)
          New: I have a large Docker swarm (old style docker swarm API in a container).  There is plenty of capacity (multi-TB of RAM, etc)

          When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
           # Node is allocated *immediately*
           # Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
           # Node is allocated with a *significant delay* (minutes).  Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
           # Node is allocated with a *ridiculous delay* (I just had one take 77 minutes).  Logs do not indicate any activity from the Docker plugin until it is allocated.  Other jobs have gotten containers allocated since (and those events are in the logs).  An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

          How can I troubleshoot this behavior, especially #4?

          Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

          In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

          Node allocation is controlled by NodeProvisioner, which is a terrible beast. I never have been able to fully understand how it works and how to tweak it for consistent results.

          I'd like to get docker-plugin rely on One-Shot-Executor so we can get rid of this stuff, but this is a long terms effort.

          Nicolas De Loof added a comment - Node allocation is controlled by NodeProvisioner, which is a terrible beast. I never have been able to fully understand how it works and how to tweak it for consistent results. I'd like to get docker-plugin rely on One-Shot-Executor so we can get rid of this stuff, but this is a long terms effort.

          Alexander Komarov added a comment - - edited

          As of this morning, of the 20+ jobs (from Bitbucket Branch Source Org project), only one PR job got a container 18 hours later (meaning it took 18 hours for it get a node).  The swarm was not being used at all otherwise.

          I downgraded Docker plugin to 1.0.4 and it's working better right now.  I had to re-enter the docker URL in the Cloud config (it was blank after downgrade along with the timeout options).

          Alexander Komarov added a comment - - edited As of this morning, of the 20+ jobs (from Bitbucket Branch Source Org project), only one PR job got a container 18 hours later (meaning it took 18 hours for it get a node).  The swarm was not being used at all otherwise. I downgraded Docker plugin to 1.0.4 and it's working better right now.  I had to re-enter the docker URL in the Cloud config (it was blank after downgrade along with the timeout options).

          Interesting feedback. Node provisioning decision logic hasn't changed between 1.0.4 and 1.1, or there's some unexpected side effect from another change. Will need to investigate in more details. 

          Thanks for reporting.

          Nicolas De Loof added a comment - Interesting feedback. Node provisioning decision logic hasn't changed between 1.0.4 and 1.1, or there's some unexpected side effect from another change. Will need to investigate in more details.  Thanks for reporting.

          Forgot to say that I downgraded 1.1.1 to 1.0.4.  My other installation has 1.1 and seems to be working, but it's too small to be a reliable indicator.  I'll upgrade it and see if it breaks.

          Alexander Komarov added a comment - Forgot to say that I downgraded 1.1.1 to 1.0.4.  My other installation has 1.1 and seems to be working, but it's too small to be a reliable indicator.  I'll upgrade it and see if it breaks.

            pjdarton pjdarton
            akom Alexander Komarov
            Votes:
            2 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: