Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-69534

File descriptor leak on slave.log when using cloud agents (regression in 2.294)

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • Jenkins: 2.332.2 (previously 2.249.1)
      Jenkins kubernetes plugin: 3581.0.0 (previously 1.27.7)
      OS: Red Hat Enterprise Linux Server 7.6
      Java: jre1.8.0_211
    • 2.369

      Description

      Recently we upgraded our Jenkins and kubernetes plugin versions and discovered a file descriptor leak. With this leak our Jenkins runs out of file descriptors during normal operations, requiring reboots several times a day.

      Our setup is that agents are launched via the kubernetes plugin on a kubernetes cluster of static size that applies resource quotas. When builds enter the queue we want agent pods to be launched as soon as possible on the kubernetes cluster. We regularly have more builds in queue than there is room for agents on the cluster, so it is normal that many requests to create pods are answered with 403 exceeded quota. When this happens there’s a chance that the file descriptor towards that agent’s slave.log file is garbage collected but not closed.

      We have NodeProvisioner configured with a "very aggressive" configuration as described in this thread:
      https://groups.google.com/g/jenkinsci-dev/c/SYerQ-8E-10

      We use:
      hudson.slaves.NodeProvisioner.MARGIN 50
      hudson.slaves.NodeProvisioner.MARGIN0 0.85

      We have not made any specific configuration of hudson.model.LoadStatistics.decay so it has the default value of 0.9.

      Analysis

      There seems to be (at least) two paths into AbstractCIBase.createNewComputerForNode(Node n, boolean automaticAgentLaunch):
      •    From NodeProvisioner.update() via AbstractCIBase.updateNewComputer(Node n, boolean automaticAgentLaunch)
      •    From AbstractCloudSlave.terminate() via Jenkins.removeNode(Node n) and AbstractCIBase.updateComputerList(boolean automaticAgentLaunch)

      The first is protected by NodeProvisioner.provisioningLock. It used to be protected by Queue#lock before https://github.com/jenkinsci/jenkins/pull/5450.
      The second is protected by Queue#lock.

      Let's say there are two planned Nodes: NodeA and NodeB.

      In thread 1, a ComputerA1 for NodeA fails to launch, so KubernetesLauncher.launch(SlaveComputer computer, TaskListener listener) invokes AbstractCloudSlave.terminate().
      In thread 2, at the same time, a timer thread invokes NodeProvisioner.suggestReviewNow().

      Both these threads enter AbstractCIBase.createNewComputerForNode(Node n, boolean automaticAgentLaunch) for NodeB. Both create an instance of Computer, so now there's both ComputerB1 and ComputerB2. They now compete against each other to associate their Computer instance with NodeB in Jenkins.computers.

      One of them will be the last one to write to Jenkins.computers, overwriting the other one.

      If the launch of ComputerB1 and ComputerB2 both write to SlaveComputer.log, then there are two file descriptors opened to jenkinsdata/logs/slaves/<agent name>/slave.log.

      When it’s time to kill these Computers, AbstractCIBase.updateComputerList will only find one of these Computers and close its file descriptor. The other Computer will eventually be garbage collected but its file descriptor is never closed.

      Suggested Fix

      I’ve locally forked 2.332.2 and added back the Queue#lock in NodeProvisioner.update(). This change solves the problem for us.

      diff --git a/core/src/main/java/hudson/slaves/NodeProvisioner.java b/core/src/main/java/hudson/slaves/NodeProvisioner.java
      index 87a0d7d45c..4672007162 100644
      --- a/core/src/main/java/hudson/slaves/NodeProvisioner.java
      +++ b/core/src/main/java/hudson/slaves/NodeProvisioner.java
      @@ -216,6 +216,7 @@ public class NodeProvisioner {
               try {
                   lastSuggestedReview = System.currentTimeMillis();
                   queuedReview = false;
      +            Queue.withLock(() -> {
                   Jenkins jenkins = Jenkins.get();
                   // clean up the cancelled launch activity, then count the # of executors that we are about to
                   // bring up.
      @@ -315,6 +316,7 @@ public class NodeProvisioner {
                   } else {
                       provisioningState = new StrategyState(snapshot, label, plannedCapacitySnapshot);
                   }
      +            });
      
                   if (provisioningState != null) {
                       List<Strategy> strategies = Jenkins.get().getExtensionList(Strategy.class);
      

          [JENKINS-69534] File descriptor leak on slave.log when using cloud agents (regression in 2.294)

          Basil Crow added a comment -

          Suggested Fix

          But that would effectively undo jenkinsci/jenkins#5450 and reintroduce the performance pathology that jenkinsci/jenkins#5450 was meant to solve. And grabbing the high-level Queue lock just to avoid a lower-level collision seems excessive. The real problem is that createNewComputerForNode is not thread-safe. Since hudson.model.Node#createComputer is cheap, synchronization could be achieved by making the computer map a ConcurrentMap and calling hudson.model.Node#createComputer from ConcurrentMap#computeIfAbsent to ensure that at most one Computer is created per Node.

          Basil Crow added a comment - Suggested Fix But that would effectively undo jenkinsci/jenkins#5450 and reintroduce the performance pathology that jenkinsci/jenkins#5450 was meant to solve. And grabbing the high-level Queue lock just to avoid a lower-level collision seems excessive. The real problem is that createNewComputerForNode is not thread-safe. Since hudson.model.Node#createComputer is cheap, synchronization could be achieved by making the computer map a ConcurrentMap and calling hudson.model.Node#createComputer from ConcurrentMap#computeIfAbsent to ensure that at most one Computer is created per Node .

          Jonas Lind added a comment -

          Thanks basil for speedy feedback and input!

          I agree that adding back the Queue#lock in NodeProvisioner.update() is not  the most performant approach and a more elegant solution would be preferable.

          At a quick glance it seems that read characteristics of a java.util.concurrent.ConcurrentHashMap is not worse than a hudson.util.CopyOnWriteMap so switching map implementation shouldn't have any significant negative side effects. Do you have any thoughts on the matter?

          Additionally, it looks like the Queue#lock can be removed from AbstractCIBase.removeComputer() as well, unless I'm missing something there?

          I'll have another go at making a change according to your suggestions and will update here once I have something to share.

          Jonas Lind added a comment - Thanks basil for speedy feedback and input! I agree that adding back the Queue#lock in NodeProvisioner.update() is not  the most performant approach and a more elegant solution would be preferable. At a quick glance it seems that read characteristics of a java.util.concurrent.ConcurrentHashMap is not worse than a hudson.util.CopyOnWriteMap so switching map implementation shouldn't have any significant negative side effects. Do you have any thoughts on the matter? Additionally, it looks like the Queue#lock can be removed from AbstractCIBase.removeComputer() as well, unless I'm missing something there? I'll have another go at making a change according to your suggestions and will update here once I have something to share.

          Basil Crow added a comment -

          At a quick glance it seems that read characteristics of a java.util.concurrent.ConcurrentHashMap is not worse than a hudson.util.CopyOnWriteMap so switching map implementation shouldn't have any significant negative side effects. Do you have any thoughts on the matter?

          Yeah, that was my conclusion as well. And ConcurrentHashMap#computeIfAbsent has the atomicity we desire:

          The entire method invocation is performed atomically, so the function is applied at most once per key.

          An implementation of createNewComputerForNode based on ConcurrentHashMap#computeIfAbsent seems like it would be both simple and correct. I could imagine other implementations based on synchronizing on the Node, but in the absence of a compelling reason to write custom code a simple implementation seems more maintainable.

          Additionally, it looks like the Queue#lock can be removed from AbstractCIBase.removeComputer() as well, unless I'm missing something there?

          Possibly. I am not too familiar with this so I am not sure offhand.

          Basil Crow added a comment - At a quick glance it seems that read characteristics of a java.util.concurrent.ConcurrentHashMap is not worse than a hudson.util.CopyOnWriteMap so switching map implementation shouldn't have any significant negative side effects. Do you have any thoughts on the matter? Yeah, that was my conclusion as well. And ConcurrentHashMap#computeIfAbsent has the atomicity we desire: The entire method invocation is performed atomically, so the function is applied at most once per key. An implementation of createNewComputerForNode based on ConcurrentHashMap#computeIfAbsent seems like it would be both simple and correct. I could imagine other implementations based on synchronizing on the Node , but in the absence of a compelling reason to write custom code a simple implementation seems more maintainable. Additionally, it looks like the Queue#lock can be removed from AbstractCIBase.removeComputer() as well, unless I'm missing something there? Possibly. I am not too familiar with this so I am not sure offhand.

          Jonas Lind added a comment -

          I've created a pull request with an attempt at fixing this bug:

          https://github.com/jenkinsci/jenkins/pull/7087

          Jonas Lind added a comment - I've created a pull request with an attempt at fixing this bug: https://github.com/jenkinsci/jenkins/pull/7087

          Basil Crow added a comment - - edited

          I am applying the lts-candidate label with the following justification: this is a fix for a regression in 2.294 that, while impacting a low number of users, is high in severity for the impacted users. While the fix carries some degree of risk, we are early in the LTS development cycle, and there is plenty of time for these changes to be deployed and tested by users of weekly releases. If a regression is discovered between now and the next LTS backporting window, this justification can be reconsidered.

          Basil Crow added a comment - - edited I am applying the lts-candidate label with the following justification: this is a fix for a regression in 2.294 that, while impacting a low number of users, is high in severity for the impacted users. While the fix carries some degree of risk, we are early in the LTS development cycle, and there is plenty of time for these changes to be deployed and tested by users of weekly releases. If a regression is discovered between now and the next LTS backporting window, this justification can be reconsidered.

            jonaslind Jonas Lind
            jonaslind Jonas Lind
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: