Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49235

Fingerprinting added in ssh-slaves causes memory-leak and performance issue with dynamic slaves

    XMLWordPrintable

    Details

    • Similar Issues:
    • Released As:
      credentials-2.2.1

      Description

      On our Jenkins instance we have recently upgraded the docker-plugin to the latest version (1.1.2), which brought with it a requirement to installer a newer version of the ssh-slaves plugin (we were on 1.20). After doing this, we have been subject to quite a severe performance problem when creating new dynamic slaves using the docker-plugin. (We run our own Swarm, but the target environment isn't a key part of this problem)

      We run around 50000 jobs per day, and each gets their own fresh container - the same image template is used for each, and therefore the same credentials object (this is pertinent to the problem)

      After a while, the slaves would remain in 'offline' state for a long time before the node log would indicate that an SSH connection was being attempted. This might take 2 or 3 minutes, but would work eventually, but the queue would build up with unserviced work.

      The thread dump showed a thread for each offline agent stuck waiting for a lock like this:
      ...
      at hudson.XmlFile.write(XmlFile.java:186)
      at hudson.model.Fingerprint.save(Fingerprint.java:1301)
      at hudson.model.Fingerprint.save(Fingerprint.java:1245)

      • locked hudson.model.Fingerprint@40e3a4f1
        at hudson.BulkChange.commit(BulkChange.java:98)
        at com.cloudbees.plugins.credentials.CredentialsProvider.trackAll(CredentialsProvider.java:1533)
        at com.cloudbees.plugins.credentials.CredentialsProvider.track(CredentialsProvider.java:1478)
        at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:856)
      • locked hudson.plugins.sshslaves.SSHLauncher@57dc4a8a
        ...

      Each thread was waiting on the hudson.model.Fingerprint@40e3a4f1 lock. 

      It turned out that the XML file the fingerprint is writing was up to 18Mb, with 50000 entries (around 250000 lines) after about a day after the ssh-slaves plugin upgrade.
      I found file file in the fingerprints/ directory: 
      18685851 ./67/9f/2885cf5cc9831735a38ab418a233.xml
      This fingerprint contains the ID for our docker image credential.

      It transpired that our disk was being saturated with writes to this file, and slaves could only be connected to and be brought online as fast as this file was written - and it was being persisted to disk for every slave launch.

      Fingerprinting credentials when used via the SSH Launcher is a relatively new addition - added last in version 1.21 of ssh-slaves

      This is the line to blame: https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L856

      Which was added with this commit:
      https://github.com/jenkinsci/ssh-slaves-plugin/commit/cfd1b329153ae7e1270d16bf2644c9587c3942fb

      This ‘feature’ is in ssh-slaves version1.21 and above (latest is 1.25)

      Feature added through this issue: https://issues.jenkins-ci.org/browse/JENKINS-38832
      and this PR: https://github.com/jenkinsci/ssh-slaves-plugin/pull/35

      The tracking of credentials seems to have been added via https://issues.jenkins-ci.org/plugins/servlet/mobile#issue/JENKINS-20139

      As each slave gets a new name (e.g. docker-xxxxxxxx, where xx = short container id), a new entry is added and they are never removed.

      The crux of the problem appears to be what should happen to fingerprinted artifacts (in this case credentials) that are no longer available, should fingerprinted credentials be removed by the docker-plugin code and other similar cloud plugins, even though they don't know this has happened? Should the fingerprinting library have a timeout on the items contained, e.g. if they aren't used in x amount of time then remove them? - although this doesn't really help unless the time is very short (1hr) in our case.

      A faster disk means a longer time to failure, but the fingerprint collection grows without bounds in memory as more nodes are created, and at some point even fast disks can't write 50Mb for every slave creation, maybe once per second, along with everything else happening on the server.

      I have raised this against core, rather than a specific plugin, as it is probably for the project maintainers to decide who needs to fix their plugin, and/or to determine the expectation of the fingerprinting

        Attachments

          Issue Links

            Activity

            fraz3alpha Andy Taylor created issue -
            oleg_nenashev Oleg Nenashev made changes -
            Field Original Value New Value
            Component/s credentials-plugin [ 16523 ]
            Component/s ssh-slaves-plugin [ 15578 ]
            Component/s core [ 15593 ]
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            So this is caused by JENKINS-38832 in SSH Slaves 1.21

            Show
            oleg_nenashev Oleg Nenashev added a comment - So this is caused by JENKINS-38832 in SSH Slaves 1.21
            oleg_nenashev Oleg Nenashev made changes -
            Link This issue relates to JENKINS-38832 [ JENKINS-38832 ]
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            I think we need an advanced option to SSH Slaves, which allows disabling credentials tracking for particular launchers.
            Unfortunately it won't be enough since there are downstream usages of old APIs, and the plugins may need to be updated to support the new option: https://github.com/search?l=Java&q=org%3Ajenkinsci+SSHLauncher&type=Code
            Maybe a global setting is a better workaround for now.

            WDYT Jesse Glick?

            Show
            oleg_nenashev Oleg Nenashev added a comment - I think we need an advanced option to SSH Slaves, which allows disabling credentials tracking for particular launchers. Unfortunately it won't be enough since there are downstream usages of old APIs, and the plugins may need to be updated to support the new option: https://github.com/search?l=Java&q=org%3Ajenkinsci+SSHLauncher&type=Code Maybe a global setting is a better workaround for now. WDYT Jesse Glick ?
            Hide
            jglick Jesse Glick added a comment -

            Stephen Connolly could comment but I do not think introducing advanced user options is a good solution—probably only the reporter of this issue will ever know to turn it on. Perhaps better to introduce an expiry on credentials fingerprints of some kind, for example discarding all but the last thousand.

            Show
            jglick Jesse Glick added a comment - Stephen Connolly could comment but I do not think introducing advanced user options is a good solution—probably only the reporter of this issue will ever know to turn it on. Perhaps better to introduce an expiry on credentials fingerprints of some kind, for example discarding all but the last thousand.
            jglick Jesse Glick made changes -
            Labels cloud docker-slaves fingerprints ssh-slave cloud docker-slaves fingerprints performance ssh-slave
            Hide
            fraz3alpha Andy Taylor added a comment -

            Thank you both for taking a look at this so quickly. I like the idea of getting the Credentials plugin to handle the removal as it simplifies the solution down to just one plugin.

            If an expiry is being considered, we should also consider a removing the Fingerprint Facets for nodes which no longer exist in Jenkins. The code to add a new usage facet already loops through all usages of that fingerprint to remove any existing instance for the current node name, how about we also remove any nodes that are no-longer configured on the server - see https://github.com/jenkinsci/credentials-plugin/blob/master/src/main/java/com/cloudbees/plugins/credentials/CredentialsProvider.java#L1510 . In particular an extra if statement could be added similar to https://github.com/jenkinsci/credentials-plugin/blob/master/src/main/java/com/cloudbees/plugins/credentials/CredentialsProvider.java#L1524 to remove anything not in the list returned by http://javadoc.jenkins-ci.org/jenkins/model/Jenkins.html#getNodes-- ?

            This would keep the fingerprint contents trimmed down to the minimum, and I don't believe it would be that costly in terms of extra computation given that the iterator loop already exists and loops through every item.

            Show
            fraz3alpha Andy Taylor added a comment - Thank you both for taking a look at this so quickly. I like the idea of getting the Credentials plugin to handle the removal as it simplifies the solution down to just one plugin. If an expiry is being considered, we should also consider a removing the Fingerprint Facets for nodes which no longer exist in Jenkins. The code to add a new usage facet already loops through all usages of that fingerprint to remove any existing instance for the current node name, how about we also remove any nodes that are no-longer configured on the server - see https://github.com/jenkinsci/credentials-plugin/blob/master/src/main/java/com/cloudbees/plugins/credentials/CredentialsProvider.java#L1510  . In particular an extra if statement could be added similar to https://github.com/jenkinsci/credentials-plugin/blob/master/src/main/java/com/cloudbees/plugins/credentials/CredentialsProvider.java#L1524  to remove anything not in the list returned by http://javadoc.jenkins-ci.org/jenkins/model/Jenkins.html#getNodes--  ? This would keep the fingerprint contents trimmed down to the minimum, and I don't believe it would be that costly in terms of extra computation given that the iterator loop already exists and loops through every item.
            Hide
            jglick Jesse Glick added a comment -

            Andy Taylor from a quick glance your proposal seems to make sense (cf. Fingerprint.trim for Run-based records), and proper cleanup might remove the need to expire fingerprints at some arbitrary cutoff. Do you care to try to file a PR? Bonus points for a test—current behavior would I guess be that NodeCredentialsFingerprintFacet exists yet getNode() == null; desired behavior is that the NodeCredentialsFingerprintFacet is removed after the Node itself is removed.

            Show
            jglick Jesse Glick added a comment - Andy Taylor from a quick glance your proposal seems to make sense (cf. Fingerprint.trim for Run -based records), and proper cleanup might remove the need to expire fingerprints at some arbitrary cutoff. Do you care to try to file a PR? Bonus points for a test—current behavior would I guess be that NodeCredentialsFingerprintFacet exists yet getNode() == null ; desired behavior is that the NodeCredentialsFingerprintFacet is removed after the Node itself is removed.
            Hide
            fraz3alpha Andy Taylor added a comment -

            Thanks Jesse, I will create a PR and will try and work out a way to create a test for it

            Show
            fraz3alpha Andy Taylor added a comment - Thanks Jesse, I will create a PR and will try and work out a way to create a test for it
            Hide
            fraz3alpha Andy Taylor added a comment -

            I have created https://github.com/jenkinsci/credentials-plugin/pull/97 for this, and would welcome feedback. I wrote the test first, based on what I expected the behaviour to be, and would like to know if my assumptions here are compatible with your thinking. 

            Although my tests pass I have not yet tried it on our production server - I am going to deploy the fixed version on my server in the morning to check that it works well in our scenario (it's approaching the end of the working day for me, and upgrading and running away is never a good thing...)

            Show
            fraz3alpha Andy Taylor added a comment - I have created https://github.com/jenkinsci/credentials-plugin/pull/97  for this, and would welcome feedback. I wrote the test first, based on what I expected the behaviour to be, and would like to know if my assumptions here are compatible with your thinking.  Although my tests pass I have not yet tried it on our production server - I am going to deploy the fixed version on my server in the morning to check that it works well in our scenario (it's approaching the end of the working day for me, and upgrading and running away is never a good thing...)
            jglick Jesse Glick made changes -
            Assignee Andy Taylor [ fraz3alpha ]
            jglick Jesse Glick made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            jglick Jesse Glick made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            Hide
            fraz3alpha Andy Taylor added a comment -

            I have been running a version of the credentials-plugin code with my changes for the past 4 days and our server is performing much better. The rate of job execution is consistently high, and hasn't shown the dramatic slowdown (and eventual paralysis) that existed before.

             

            Show
            fraz3alpha Andy Taylor added a comment - I have been running a version of the credentials-plugin code with my changes for the past 4 days and our server is performing much better. The rate of job execution is consistently high, and hasn't shown the dramatic slowdown (and eventual paralysis) that existed before.  
            dnusbaum Devin Nusbaum made changes -
            Remote Link This issue links to "jenkinsci/credentials-plugin#97 (Web Link)" [ 20489 ]
            jamesdumay James Dumay made changes -
            Remote Link This issue links to "CloudBees Internal OSS-2693 (Web Link)" [ 20515 ]
            cloudbees CloudBees Inc. made changes -
            Remote Link This issue links to "CloudBees Internal FNDN-229 (Web Link)" [ 20736 ]
            dnusbaum Devin Nusbaum made changes -
            Link This issue relates to JENKINS-51694 [ JENKINS-51694 ]
            ifernandezcalvo Ivan Fernandez Calvo made changes -
            Remote Link This issue links to "PR allow enable/disable credentials tracking (Web Link)" [ 21251 ]
            ifernandezcalvo Ivan Fernandez Calvo made changes -
            Link This issue is duplicated by JENKINS-50984 [ JENKINS-50984 ]
            Hide
            fraz3alpha Andy Taylor added a comment -

            Some additional updates to this as discussions have been had in various PRs.

            Devin Nusbaum created https://github.com/jenkinsci/credentials-plugin/pull/103 which improved upon my changes in https://github.com/jenkinsci/credentials-plugin/pull/97 and included all the requested changes in that PR. Would anyone be able to look at that and see if that ticks all the boxes to merge please?

            Additionally I have just found out about https://issues.jenkins-ci.org/browse/JENKINS-50984 which led to a PR being merged against the ssh-slaves plugin to be able to selectively disable the tracking of credentials for slave launching: https://github.com/jenkinsci/ssh-slaves-plugin/pull/94 . This would possibly be able to eliminate the issue I raised here, but is a very broad brush and appears to remove all of the tracking of the credentials for slave launching, i.e. not limited to cloud connections where the slaves are not re-used. I am going to test that and see if it helps my situation. That update hasn't made it's way to the download site yet though, so I'll have to grab the latest built version from the Jenkins build server.

            Show
            fraz3alpha Andy Taylor added a comment - Some additional updates to this as discussions have been had in various PRs. Devin Nusbaum created https://github.com/jenkinsci/credentials-plugin/pull/103  which improved upon my changes in https://github.com/jenkinsci/credentials-plugin/pull/97  and included all the requested changes in that PR. Would anyone be able to look at that and see if that ticks all the boxes to merge please? Additionally I have just found out about https://issues.jenkins-ci.org/browse/JENKINS-50984  which led to a PR being merged against the ssh-slaves plugin to be able to selectively disable the tracking of credentials for slave launching: https://github.com/jenkinsci/ssh-slaves-plugin/pull/94  . This would possibly be able to eliminate the issue I raised here, but is a very broad brush and appears to remove all of the tracking of the credentials for slave launching, i.e. not limited to cloud connections where the slaves are not re-used. I am going to test that and see if it helps my situation. That update hasn't made it's way to the download site yet though, so I'll have to grab the latest built version from the Jenkins build server.
            Hide
            ifernandezcalvo Ivan Fernandez Calvo added a comment -

            there is a change to try to mitigate this issue in ssh-slave-1.29.0

            Show
            ifernandezcalvo Ivan Fernandez Calvo added a comment - there is a change to try to mitigate this issue in ssh-slave-1.29.0
            pjdarton pjdarton made changes -
            Remote Link This issue links to "jenkinsci/credentials-plugin PR#103 (Web Link)" [ 23007 ]
            Hide
            jvz Matt Sicker added a comment -

            Merged to master.

            Show
            jvz Matt Sicker added a comment - Merged to master.
            jvz Matt Sicker made changes -
            Resolution Fixed [ 1 ]
            Status In Review [ 10005 ] Fixed but Unreleased [ 10203 ]
            Hide
            jvz Matt Sicker added a comment -

            Released in credentials-2.2.1.

            Show
            jvz Matt Sicker added a comment - Released in credentials-2.2.1.
            jvz Matt Sicker made changes -
            Released As credentials-2.2.1
            Status Fixed but Unreleased [ 10203 ] Resolved [ 5 ]
            jlundy Jeremy Lundy made changes -
            Link This issue relates to JENKINS-60898 [ JENKINS-60898 ]

              People

              Assignee:
              fraz3alpha Andy Taylor
              Reporter:
              fraz3alpha Andy Taylor
              Votes:
              5 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: