• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • saml-plugin
    • Jenkins 2.401.1-lts (jenkins/jenkins:2.401.1-lts docker image)
      SAML Plugin 4.418.vdfa_7489a_b_a_2d
      Ubuntu 20.04.6 LTS
    • 4.514.vfd5088cc4ed7

      My Jenkins installation uses the SAML plugin to delegate authentication to my IDP (Okta).

      i've noticed that 5 months ago, the jvm_threads_current in my system had suddenly ballooned to 1400+ threads from the previous average of ~400 threads.

      I took a threadDump to see what was causing the issue and found nearly 1200 threads of

      Timer for org.opensaml.saml.metadata.resolver.impl.FilesystemMetadataResolver
      

      in TIMED_WAITING state.

       

      also seeing a lot of logs regarding the FilesystemMetadataResolver 

      024-04-01 17:54:32.557+0000 [id=39251126]	INFO	o.o.s.m.r.i.AbstractReloadingMetadataResolver#processNonExpiredMetadata: Metadata Resolver FilesystemMetadataResolver org.opensaml.saml.metadata.resolver.impl.FilesystemMetadataResolver: New metadata successfully loaded for '/var/jenkins_home/saml-sp-metadata.xml'
      2024-04-01 17:54:32.557+0000 [id=39251126]	INFO	o.o.s.m.r.i.AbstractReloadingMetadataResolver#refresh: Metadata Resolver FilesystemMetadataResolver org.opensaml.saml.metadata.resolver.impl.FilesystemMetadataResolver: Next refresh cycle for metadata provider '/var/jenkins_home/saml-sp-metadata.xml' will occur on '2024-04-01T20:54:32.555055Z' ('2024-04-01T20:54:32.555055Z[Etc/UTC]' local time)
      2024-04-01 17:54:32.641+0000 [id=145389598]	INFO	o.o.s.m.r.i.AbstractReloadingMetadataResolver#processNonExpiredMetadata: Metadata Resolver FilesystemMetadataResolver org.opensaml.saml.metadata.resolver.impl.FilesystemMetadataResolver: New metadata successfully loaded for '/var/jenkins_home/saml-sp-metadata.xml'
      2024-04-01 17:54:32.641+0000 [id=145389598]	INFO	o.o.s.m.r.i.AbstractReloadingMetadataResolver#refresh: Metadata Resolver FilesystemMetadataResolver org.opensaml.saml.metadata.resolver.impl.FilesystemMetadataResolver: Next refresh cycle for metadata provider '/var/jenkins_home/saml-sp-metadata.xml' will occur on '2024-04-01T20:54:32.639186Z' ('2024-04-01T20:54:32.639186Z[Etc/UTC]' local time)
      2024-04-01 17:54:39.014+0000 [id=260446046]	INFO	o.o.s.m.r.i.AbstractReloadingMetadataResolver#processNonExpiredMetadata: Metadata Resolver FilesystemMetadataResolver org.opensaml.saml.metadata.resolver.impl.FilesystemMetadataResolver: New metadata successfully loaded for '/var/jenkins_home/saml-sp-metadata.xml'
      2024-04-01 17:54:39.014+0000 [id=260446046]	INFO	o.o.s.m.r.i.AbstractReloadingMetadataResolver#refresh: Metadata Resolver FilesystemMetadataResolver org.opensaml.saml.metadata.resolver.impl.FilesystemMetadataResolver: Next refresh cycle for metadata provider '/var/jenkins_home/saml-sp-metadata.xml' will occur on '2024-04-01T20:54:39.012191Z' ('2024-04-01T20:54:39.012191Z[Etc/UTC]' local time)
      

      is this normal? this is not currently having any visible effects on the system but i am afraid it could lead to a larger problem in the future.

      I'm posting it as a bug in case it's there is actually a thread leak in the SAML plugin.

      Please let me know if I can provide any other information.

      Thanks!

          [JENKINS-72946] Possible thread leak from saml plugin

          Do you use an NFS or other network filesystem for your JENKINS_HOME folder?

          Ivan Fernandez Calvo added a comment - Do you use an NFS or other network filesystem for your JENKINS_HOME folder?

          It is weird that every 5 seconds, the configuration is refreshed, I would need more details about your configuration, Could you provide some screenshots of the configuration? feel free to mask URLs and Keys I am interested in the settings you have.

          Ivan Fernandez Calvo added a comment - It is weird that every 5 seconds, the configuration is refreshed, I would need more details about your configuration, Could you provide some screenshots of the configuration? feel free to mask URLs and Keys I am interested in the settings you have.

          bread added a comment -

          Yes, the JENKINS_HOME folder is stored on a network filesystem (AWS EFS instance using the NFSv4 protocol). Would this be a problem?

           

          Have attached a screenshot of config.xml.

           

          Please let me know if I can provide any other information.

          Thanks!

          bread added a comment - Yes, the JENKINS_HOME folder is stored on a network filesystem (AWS EFS instance using the NFSv4 protocol). Would this be a problem?   Have attached a screenshot of config.xml.   Please let me know if I can provide any other information. Thanks!

          Yes, NFS filesystems are always a headache, and they do not play well with frequent small IO operations that are probably related to the bunch of threads you have. There is a cache option to avoid those IO operations, you can enable it at Advanced Configuration/Use cache for configuration files.

          NFS for the Jenkins Home usually causes issues. If the NFS filesystem does not perform well, it must work as close as possible to a hard drive. It must read/write 100-150 MB/s. If you have less speed than that, you will face a bunch of weird behaviors.

          https://docs.cloudbees.com/docs/cloudbees-ci-kb/latest/client-and-managed-controllers/nfs-guide

           

           

          Ivan Fernandez Calvo added a comment - Yes, NFS filesystems are always a headache, and they do not play well with frequent small IO operations that are probably related to the bunch of threads you have. There is a cache option to avoid those IO operations, you can enable it at Advanced Configuration/ Use cache for configuration files. NFS for the Jenkins Home usually causes issues. If the NFS filesystem does not perform well, it must work as close as possible to a hard drive. It must read/write 100-150 MB/s. If you have less speed than that, you will face a bunch of weird behaviors. https://docs.cloudbees.com/docs/cloudbees-ci-kb/latest/client-and-managed-controllers/nfs-guide    

          Devin Nusbaum added a comment - - edited

          FWIW I recently saw a case where a user had ~400 of these same threads, and the number of threads grew consistently. This reoccurred after multiple restarts of Jenkins so it seems like a consistent problem. The securityRealm configuration looks mostly uninteresting:

          <securityRealm class="org.jenkinsci.plugins.saml.SamlSecurityRealm" plugin="saml@4.429.v9a_781a_61f1da_">
              <displayNameAttributeName>{lastName}, {firstName}</displayNameAttributeName>
              <groupsAttributeName>groupname</groupsAttributeName>
              <maximumAuthenticationLifetime>86400</maximumAuthenticationLifetime>
              <emailAttributeName>mail</emailAttributeName>
              <usernameCaseConversion>none</usernameCaseConversion>
              <usernameAttributeName>uid</usernameAttributeName>
              <binding>urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect</binding>
              <idpMetadataConfiguration>
                  <xml>... redacted ...</xml>
                  <url/>
                  <period>0</period>
              </idpMetadataConfiguration>
          </securityRealm>
          

          I see the same repeated log messages regarding `FilesystemMetadataResolver` as in the description of this issue.

          This user is also having intermittent issues where users cannot authenticate for a short period of time, but I have no idea whether it is related. I do not think this user is using NFS.

          Devin Nusbaum added a comment - - edited FWIW I recently saw a case where a user had ~400 of these same threads, and the number of threads grew consistently. This reoccurred after multiple restarts of Jenkins so it seems like a consistent problem. The securityRealm configuration looks mostly uninteresting: <securityRealm class= "org.jenkinsci.plugins.saml.SamlSecurityRealm" plugin= "saml@4.429.v9a_781a_61f1da_" > <displayNameAttributeName>{lastName}, {firstName}</displayNameAttributeName> <groupsAttributeName>groupname</groupsAttributeName> <maximumAuthenticationLifetime>86400</maximumAuthenticationLifetime> <emailAttributeName>mail</emailAttributeName> <usernameCaseConversion>none</usernameCaseConversion> <usernameAttributeName>uid</usernameAttributeName> <binding>urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect</binding> <idpMetadataConfiguration> <xml>... redacted ...</xml> <url/> <period>0</period> </idpMetadataConfiguration> </securityRealm> I see the same repeated log messages regarding `FilesystemMetadataResolver` as in the description of this issue. This user is also having intermittent issues where users cannot authenticate for a short period of time, but I have no idea whether it is related. I do not think this user is using NFS.

          dnusbaum Do you remember if they use the disk cache setting? It reduces access to the disk.

          Ivan Fernandez Calvo added a comment - dnusbaum Do you remember if they use the disk cache setting? It reduces access to the disk.

          Ivan Fernandez Calvo added a comment - - edited

          I think I found why in the pac4j documentation.

          Note: after use {{SAML2Client}} must be explicitly destroyed with {{destroy}} method call. The importance of this step is justified by the underlying implementation. {{FilesystemMetadataResolver}} is using a daemon thread to watch the changes to metadata file. Without destroying {{SAML2Client}} this thread will keep running, thus there is a risk to get a threads leak problem.
          

          I have reviewed the code, and after using it, the client is always destroyed by calling the method .destroy(). For some reason, that is not cleaning those clients. We recently upgrade the version of pac4j to 6.x I wonder if this version resolves the issue.

          Ivan Fernandez Calvo added a comment - - edited I think I found why in the pac4j documentation. Note: after use {{SAML2Client}} must be explicitly destroyed with {{destroy}} method call. The importance of this step is justified by the underlying implementation. {{FilesystemMetadataResolver}} is using a daemon thread to watch the changes to metadata file. Without destroying {{SAML2Client}}  this thread will keep running, thus there is a risk to get a threads leak problem. I have reviewed the code, and after using it, the client is always destroyed by calling the method .destroy(). For some reason, that is not cleaning those clients. We recently upgrade the version of pac4j to 6.x I wonder if this version resolves the issue.

          Allan BURDAJEWICZ added a comment - > I have reviewed the code, and after using it, the client is always destroyed by calling the method .destroy() Actually if anything happens before the explicit client.destroy() the then the client may live one: https://github.com/jenkinsci/saml-plugin/blob/4.501.v4313a_01e3a_18/src/main/java/org/jenkinsci/plugins/saml/SamlSPMetadataWrapper.java#L47 https://github.com/jenkinsci/saml-plugin/blob/4.501.v4313a_01e3a_18/src/main/java/org/jenkinsci/plugins/saml/SamlRedirectActionWrapper.java#L49-L52 https://github.com/jenkinsci/saml-plugin/blob/4.501.v4313a_01e3a_18/src/main/java/org/jenkinsci/plugins/saml/SamlProfileWrapper.java#L57-L62 Not sure how plausible that would be. But it is a possibility. If that was the case, note that version 6.0.4 of pac4j make the SAML2Client closeable: https://github.com/pac4j/pac4j/commit/2aa27122ae8c3a942ead36f8a74199ddc727a826#diff-f89c296deeaa74c8249039e441ccd9c5fb9bdbc586b3ec4a080ab19193c42213R12 ..

          Drew added a comment -

          Curious if anyone who was having performance issues saw improvements from caching. Are there any potential side effects?

          Drew added a comment - Curious if anyone who was having performance issues saw improvements from caching. Are there any potential side effects?

            ifernandezcalvo Ivan Fernandez Calvo
            bread47 bread
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: