[JENKINS-48490] Intermittently slow docker provisioning with no errors

Type: Bug
Resolution: Fixed
Priority: Minor
Component/s: docker-plugin
Labels:
None
Environment:
Jenkins 2.93
Docker Plugin 1.1.1
Containers are using JNLP

Similar Issues:
Powered by SuggestiMate

Show

I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:

Node is allocated immediately
Node is not allocated and jenkins logs indicate why (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
Node is allocated with a significant delay (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
Node is allocated with a ridiculous delay (I just had one take 77 minutes). Logs do not indicate any activity from the Docker plugin until it is allocated. Other jobs have gotten containers allocated since (and those events are in the logs). An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

jenkins-log-provisioning-success.log
28 kB
2018-02-08 16:12
jenkins-log-provisioning-fail.log
45 kB
2018-02-08 16:12

Alexander Komarov created issue - 2017-12-11 19:25

Alexander Komarov made changes - 2017-12-11 19:28

Description

Original: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 30 minutes). Logs do not indicate any activity from the Docker plugin. It will (probably) eventually get a node. Other jobs have gotten containers allocated since (and those events are in the logs).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x -> 1.1.x upgrade (possibly also Jenkins 2.92->2.93 upgrade)

New: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 30 minutes). Logs do not indicate any activity from the Docker plugin. It will (probably) eventually get a node. Other jobs have gotten containers allocated since (and those events are in the logs).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

Alexander Komarov made changes - 2017-12-11 20:13

Description

Original: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 30 minutes). Logs do not indicate any activity from the Docker plugin. It will (probably) eventually get a node. Other jobs have gotten containers allocated since (and those events are in the logs).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

New: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 60 minutes). Logs do not indicate any activity from the Docker plugin. It will (probably) eventually get a node. Other jobs have gotten containers allocated since (and those events are in the logs).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

Alexander Komarov made changes - 2017-12-11 20:17

Description

Original: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I have a job waiting for a node now for 60 minutes). Logs do not indicate any activity from the Docker plugin. It will (probably) eventually get a node. Other jobs have gotten containers allocated since (and those events are in the logs).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

New: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I just had one take 77 minutes). Logs do not indicate any activity from the Docker plugin. It will (probably) eventually get a node. Other jobs have gotten containers allocated since (and those events are in the logs). An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

Alexander Komarov made changes - 2017-12-11 20:18

Description

Original: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I just had one take 77 minutes). Logs do not indicate any activity from the Docker plugin. It will (probably) eventually get a node. Other jobs have gotten containers allocated since (and those events are in the logs). An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

New: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I just had one take 77 minutes). Logs do not indicate any activity from the Docker plugin. It will (probably) eventually get a node. Other jobs have gotten containers allocated since (and those events are in the logs). An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

Alexander Komarov made changes - 2017-12-11 20:19

Description

Original: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I just had one take 77 minutes). Logs do not indicate any activity from the Docker plugin. It will (probably) eventually get a node. Other jobs have gotten containers allocated since (and those events are in the logs). An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

New: I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)

When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
# Node is allocated *immediately*
# Node is not allocated and jenkins *logs indicate why* (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
# Node is allocated with a *significant delay* (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
# Node is allocated with a *ridiculous delay* (I just had one take 77 minutes). Logs do not indicate any activity from the Docker plugin until it is allocated. Other jobs have gotten containers allocated since (and those events are in the logs). An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).

How can I troubleshoot this behavior, especially #4?

Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)

In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)

Nicolas De Loof added a comment - 2017-12-11 20:25

Node allocation is controlled by NodeProvisioner, which is a terrible beast. I never have been able to fully understand how it works and how to tweak it for consistent results.

I'd like to get docker-plugin rely on One-Shot-Executor so we can get rid of this stuff, but this is a long terms effort.

Nicolas De Loof added a comment - 2017-12-11 20:25 Node allocation is controlled by NodeProvisioner, which is a terrible beast. I never have been able to fully understand how it works and how to tweak it for consistent results. I'd like to get docker-plugin rely on One-Shot-Executor so we can get rid of this stuff, but this is a long terms effort.

Alexander Komarov added a comment - 2017-12-12 14:57 - edited

As of this morning, of the 20+ jobs (from Bitbucket Branch Source Org project), only one PR job got a container 18 hours later (meaning it took 18 hours for it get a node). The swarm was not being used at all otherwise.

I downgraded Docker plugin to 1.0.4 and it's working better right now. I had to re-enter the docker URL in the Cloud config (it was blank after downgrade along with the timeout options).

Alexander Komarov added a comment - 2017-12-12 14:57 - edited As of this morning, of the 20+ jobs (from Bitbucket Branch Source Org project), only one PR job got a container 18 hours later (meaning it took 18 hours for it get a node). The swarm was not being used at all otherwise. I downgraded Docker plugin to 1.0.4 and it's working better right now. I had to re-enter the docker URL in the Cloud config (it was blank after downgrade along with the timeout options).

Nicolas De Loof added a comment - 2017-12-12 15:05

Interesting feedback. Node provisioning decision logic hasn't changed between 1.0.4 and 1.1, or there's some unexpected side effect from another change. Will need to investigate in more details.

Thanks for reporting.

Nicolas De Loof added a comment - 2017-12-12 15:05 Interesting feedback. Node provisioning decision logic hasn't changed between 1.0.4 and 1.1, or there's some unexpected side effect from another change. Will need to investigate in more details. Thanks for reporting.

Alexander Komarov added a comment - 2017-12-12 15:20

Forgot to say that I downgraded 1.1.1 to 1.0.4. My other installation has 1.1 and seems to be working, but it's too small to be a reliable indicator. I'll upgrade it and see if it breaks.

Alexander Komarov added a comment - 2017-12-12 15:20 Forgot to say that I downgraded 1.1.1 to 1.0.4. My other installation has 1.1 and seems to be working, but it's too small to be a reliable indicator. I'll upgrade it and see if it breaks.

Assignee:: pjdarton

Reporter:: Alexander Komarov

Votes:: 2 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2017-12-11 19:25

Updated:: 2019-04-03 10:51

Resolved:: 2019-02-08 11:52

Jenkins

Details

Description

Attachments

Attachments

Activity

Collapse comment: Nicolas De Loof added a comment - 2017-12-11 20:25

Expand comment: Nicolas De Loof added a comment - 2017-12-11 20:25

Collapse comment: Alexander Komarov added a comment - 2017-12-12 14:57, Edited by Alexander Komarov - 2017-12-12 14:57

Expand comment: Alexander Komarov added a comment - 2017-12-12 14:57, Edited by Alexander Komarov - 2017-12-12 14:57

Collapse comment: Nicolas De Loof added a comment - 2017-12-12 15:05

Expand comment: Nicolas De Loof added a comment - 2017-12-12 15:05

Collapse comment: Alexander Komarov added a comment - 2017-12-12 15:20

Expand comment: Alexander Komarov added a comment - 2017-12-12 15:20

People

Dates