-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Jenkins 2.93
Docker Plugin 1.1.1
Containers are using JNLP
-
Powered by SuggestiMate
I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)
When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
- Node is allocated immediately
- Node is not allocated and jenkins logs indicate why (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
- Node is allocated with a significant delay (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
- Node is allocated with a ridiculous delay (I just had one take 77 minutes). Logs do not indicate any activity from the Docker plugin until it is allocated. Other jobs have gotten containers allocated since (and those events are in the logs). An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).
How can I troubleshoot this behavior, especially #4?
Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)
In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)
[JENKINS-48490] Intermittently slow docker provisioning with no errors
As of this morning, of the 20+ jobs (from Bitbucket Branch Source Org project), only one PR job got a container 18 hours later (meaning it took 18 hours for it get a node). The swarm was not being used at all otherwise.
I downgraded Docker plugin to 1.0.4 and it's working better right now. I had to re-enter the docker URL in the Cloud config (it was blank after downgrade along with the timeout options).
Interesting feedback. Node provisioning decision logic hasn't changed between 1.0.4 and 1.1, or there's some unexpected side effect from another change. Will need to investigate in more details.
Thanks for reporting.
Forgot to say that I downgraded 1.1.1 to 1.0.4. My other installation has 1.1 and seems to be working, but it's too small to be a reliable indicator. I'll upgrade it and see if it breaks.
My smaller Jenkins does not experience provisioning issues with 1.1.1, so I can't provide anything useful.
Should be fixed by https://github.com/jenkinsci/docker-plugin/commit/b906c37f0add6a4a44fb4858aa8ff9de0ebeeabe#diff-0a438d5fa3c18711a9f82de9b72440c0R277
I will need to revisit Provisioner implementation at some point.
I'm now on docker-plugin 1.1.2 and Jenkins 2.98 (ndeloof it looks like the commit you referenced should be included in 1.1.2). Note that I'm not using the SSH launcher.
I'm experiencing the same behavior as before.
I'll try to describe it better:
- I have 5 main docker templates with unique labels.
- I have bitbucket-branch-source based jobs that occasionally launch a lot of jobs rapidly (one per PR, for example, about 40 is typical)
- These jobs use two of the docker templates by labels
- Sometimes one or more will get stuck with "waiting for..." forever.
- The logs will mention nothing about provisioning (that I noticed)
The interesting part:
- If I then create and run a new freestyle job that allocates the same label (and does nothing), the blocked (real) job will immediately get a node. The new freestyle job will then wait for a node until another job requests one.
- In other words, the NodeProvisioner appears to be one allocation behind. Time is not a factor (the job can wait for 10 hours).
- Currently I have the aforementioned job running frequently as a workaround, and that seems to be working (I changed it to a pipeline job that allocates each of the 5 docker templates and times out quickly). This seems to have real jobs running but obviously adds some overhead and wear to the cloud.
ndeloof, would you suggest playing with system properties like these?
-Dhudson.slaves.NodeProvisioner.initialDelay=X
-Dhudson.slaves.NodeProvisioner.recurrencePeriod=X
-Dhudson.slaves.NodeProvisioner.MARGIN=X
This feels like the wrong approach because I did not need to do this with Jenkins <=2.93 and docker-plugin <=1.0.x, so it would seem that something changed...
I think I may be seeing the same thing but my builds have a timeout so they never wait around for the 10 hour mark.
I have 2 docker servers with the same template configured and a pipeline job that starts two parallel jobs against the label on the docker containers but it frequently either only creates a slave on one or neither. Do get the odd exception like:
SEVERE: I/O error in channel docker-d5fcd775ff9f java.io.IOException: Expected 8 bytes application/vnd.docker.raw-stream header, got 4 at io.jenkins.docker.client.DockerMultiplexedInputStream.readInternal(DockerMultiplexedInputStream.java:45) at io.jenkins.docker.client.DockerMultiplexedInputStream.read(DockerMultiplexedInputStream.java:25) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
But its not a 1 to 1 with when it fails (ie sometimes it doesn't have that exception but still fails, but maybe it does disrupt something for a while?)
I was finally able to reproduce this in an isolated Jenkins installation running the following versions:
- Jenkins 2.98
- Docker Plugin 1.1.2
Here are the steps to reproduce:
- Create a Cloud configuration with several docker image template definitions. Mine are custom built images, but the random nature of this issue suggests that this should happen with any images. Note: I am using JNLP but the problem also occurs with "Attached" method. Ensure that there is enough capacity (instance limits, etc) so that this is not a bottleneck.
- Create a Pipeline job with the following code (adjust for image labels if needed):
def flavors = ['centos6','centos7','sles11','sles12','ubuntu'] def steps = [:] flavors.each{def flavor -> steps[flavor] = { stage(flavor) { timeout(1) { echo "Allocating ${flavor}" node("${flavor}&&dockerswarm") { sh "date" } } } } } timestamps { parallel steps }
- Run this job a few times. Job will succeed at first, but after a few successful runs one or more of the image flavors will start to time out.
- Restart Jenkins.
- Job will succeed again for a little while. (UPDATE: restart does not always help, issue sometimes begins immediately after)
Jenkins logs are attached:
Job run when provisioning hangs: jenkins-log-provisioning-fail.log
Job run when provisioning succeeds: jenkins-log-provisioning-success.log
Note that while Jenkins is blocked trying to allocate a node, I can manually allocate one using docker command-line, proving that the actual docker infrastructure is not the problem.
We are seeing similar symptoms after upgrading the Docker plugin. Our Jenkins masters started seeing this problem after upgrading to Docker Plugin v1.1.1 and even v1.1.2. One big difference, we're using SSH to connect rather than JNLP.
The larger and more busy the Jenkins master, the faster this problem occurs. Our larger one we had to downgrade the Docker plugin to it's prior version, 0.16.2. The smaller Jenkins don't suffer immediately from the problem and a restart of them clears the problem - at least for another couple days.
I've been trying to reproduce this issue on my local box using SSH executors with both 1.1.2 and 1.1.3 to no avail. We are still seeing it on other instances.
In the short term, I've thrown together a quick and dirty script job to "unstick" the jobs.
https://gist.github.com/MattLud/1f8a56fcce933f7e97c366de54c85ba9
We used to encounter these kinds of issues. We eventually traced it to cloud plugins doing blocking operations within the main "provisioning" thread and node-disposal operations (which happen while a Jenkins core object is locked) were also sometimes taking a long time. These combined to causing Jenkins' ability to create and destroy slaves was severely impeded. This only happened when the servers that the plugins were communicating with weren't responding swiftly, but as Docker is prone to total lockups (and vSphere can take 4 hours to fail a 40millisecond operation), when all is not well, Jenkins also becomes unwell.
It took a fair bit of work, but I made enhancements to both the vSphere plugin and the Docker plugin to reduce the amount of remote-API-calls made during the "provisioning" and "termination" process threads, and to ensure that everything had a timeout (so nothing would lock up forever). The vSphere plugin (version 2.16 onwards) contains my changes, but you'd have to get the bleeding-edge build of the docker-plugin from here for my changes to that (as we've not done a release of that yet).
Note: If you take the docker plugin, make sure that you set a non-zero connection timeout, read timeout and (in the templates) pull timeout. Also, for pure speed, remove the instance caps (if you don't specify an instance cap, the plugin no longer counts the instances). That should ensure that nothing can cause the core Jenkins cloud resources to stay "locked" for a long period of time.
See also: JENKINS-49235 as that has the potential to cause problems for a busy Jenkins server.
I believe that, through careful use of "read timeout" and "pull timeout" on docker clouds & templates, coupled with the "avoid using broken clouds/templates" feature (all introduced in 1.1.4) this issue should now be fixed.
We have a fairly busy Jenkins server with dozens of static nodes, lots of docker clouds, hundreds of jobs, and many dozens of builds running at any one time and, since adding this functionality, it all seems stable now (other than JENKINS-53621 which is a separate issue).
TL;DR: I think it's fixed; re-open if it's still there in 1.1.6.
For reference: We hit a similar issue where the cloud-stats-plugin was the cause. See JENKINS-56863.
Node allocation is controlled by NodeProvisioner, which is a terrible beast. I never have been able to fully understand how it works and how to tweak it for consistent results.
I'd like to get docker-plugin rely on One-Shot-Executor so we can get rid of this stuff, but this is a long terms effort.