Loading...

Type: Bug
Resolution: Fixed
Priority: Critical
Component/s: credentials-plugin, ssh-slaves-plugin
Labels:
Environment:
Jenkins v2.89.2
ssh-slaves v1.25
docker-plugin v1.1.2

Similar Issues:

Show
Released As:
credentials-2.2.1

On our Jenkins instance we have recently upgraded the docker-plugin to the latest version (1.1.2), which brought with it a requirement to installer a newer version of the ssh-slaves plugin (we were on 1.20). After doing this, we have been subject to quite a severe performance problem when creating new dynamic slaves using the docker-plugin. (We run our own Swarm, but the target environment isn't a key part of this problem)

We run around 50000 jobs per day, and each gets their own fresh container - the same image template is used for each, and therefore the same credentials object (this is pertinent to the problem)

After a while, the slaves would remain in 'offline' state for a long time before the node log would indicate that an SSH connection was being attempted. This might take 2 or 3 minutes, but would work eventually, but the queue would build up with unserviced work.

The thread dump showed a thread for each offline agent stuck waiting for a lock like this:
...
at hudson.XmlFile.write(XmlFile.java:186)
at hudson.model.Fingerprint.save(Fingerprint.java:1301)
at hudson.model.Fingerprint.save(Fingerprint.java:1245)

locked hudson.model.Fingerprint@40e3a4f1
at hudson.BulkChange.commit(BulkChange.java:98)
at com.cloudbees.plugins.credentials.CredentialsProvider.trackAll(CredentialsProvider.java:1533)
at com.cloudbees.plugins.credentials.CredentialsProvider.track(CredentialsProvider.java:1478)
at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:856)
locked hudson.plugins.sshslaves.SSHLauncher@57dc4a8a
...

Each thread was waiting on the hudson.model.Fingerprint@40e3a4f1 lock.

It turned out that the XML file the fingerprint is writing was up to 18Mb, with 50000 entries (around 250000 lines) after about a day after the ssh-slaves plugin upgrade.
I found file file in the fingerprints/ directory:
18685851 ./67/9f/2885cf5cc9831735a38ab418a233.xml
This fingerprint contains the ID for our docker image credential.

It transpired that our disk was being saturated with writes to this file, and slaves could only be connected to and be brought online as fast as this file was written - and it was being persisted to disk for every slave launch.

Fingerprinting credentials when used via the SSH Launcher is a relatively new addition - added last in version 1.21 of ssh-slaves

This is the line to blame: https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L856

Which was added with this commit:
https://github.com/jenkinsci/ssh-slaves-plugin/commit/cfd1b329153ae7e1270d16bf2644c9587c3942fb

This ‘feature’ is in ssh-slaves version1.21 and above (latest is 1.25)

Feature added through this issue: https://issues.jenkins-ci.org/browse/JENKINS-38832
and this PR: https://github.com/jenkinsci/ssh-slaves-plugin/pull/35

The tracking of credentials seems to have been added via https://issues.jenkins-ci.org/plugins/servlet/mobile#issue/JENKINS-20139

As each slave gets a new name (e.g. docker-xxxxxxxx, where xx = short container id), a new entry is added and they are never removed.

The crux of the problem appears to be what should happen to fingerprinted artifacts (in this case credentials) that are no longer available, should fingerprinted credentials be removed by the docker-plugin code and other similar cloud plugins, even though they don't know this has happened? Should the fingerprinting library have a timeout on the items contained, e.g. if they aren't used in x amount of time then remove them? - although this doesn't really help unless the time is very short (1hr) in our case.

A faster disk means a longer time to failure, but the fingerprint collection grows without bounds in memory as more nodes are created, and at some point even fast disks can't write 50Mb for every slave creation, maybe once per second, along with everything else happening on the server.

I have raised this against core, rather than a specific plugin, as it is probably for the project maintainers to decide who needs to fix their plugin, and/or to determine the expectation of the fingerprinting

is duplicated by

JENKINS-50984 SSHLauncher/Fingerprint Thread Locking Stopping Dynamic Slave Launch