-
Bug
-
Resolution: Fixed
-
Major
-
Jenkins: 2.332.2 (previously 2.249.1)
Jenkins kubernetes plugin: 3581.0.0 (previously 1.27.7)
OS: Red Hat Enterprise Linux Server 7.6
Java: jre1.8.0_211
-
-
2.369
Description
Recently we upgraded our Jenkins and kubernetes plugin versions and discovered a file descriptor leak. With this leak our Jenkins runs out of file descriptors during normal operations, requiring reboots several times a day.
Our setup is that agents are launched via the kubernetes plugin on a kubernetes cluster of static size that applies resource quotas. When builds enter the queue we want agent pods to be launched as soon as possible on the kubernetes cluster. We regularly have more builds in queue than there is room for agents on the cluster, so it is normal that many requests to create pods are answered with 403 exceeded quota. When this happens there’s a chance that the file descriptor towards that agent’s slave.log file is garbage collected but not closed.
We have NodeProvisioner configured with a "very aggressive" configuration as described in this thread:
https://groups.google.com/g/jenkinsci-dev/c/SYerQ-8E-10
We use:
hudson.slaves.NodeProvisioner.MARGIN 50
hudson.slaves.NodeProvisioner.MARGIN0 0.85
We have not made any specific configuration of hudson.model.LoadStatistics.decay so it has the default value of 0.9.
Analysis
There seems to be (at least) two paths into AbstractCIBase.createNewComputerForNode(Node n, boolean automaticAgentLaunch):
• From NodeProvisioner.update() via AbstractCIBase.updateNewComputer(Node n, boolean automaticAgentLaunch)
• From AbstractCloudSlave.terminate() via Jenkins.removeNode(Node n) and AbstractCIBase.updateComputerList(boolean automaticAgentLaunch)
The first is protected by NodeProvisioner.provisioningLock. It used to be protected by Queue#lock before https://github.com/jenkinsci/jenkins/pull/5450.
The second is protected by Queue#lock.
Let's say there are two planned Nodes: NodeA and NodeB.
In thread 1, a ComputerA1 for NodeA fails to launch, so KubernetesLauncher.launch(SlaveComputer computer, TaskListener listener) invokes AbstractCloudSlave.terminate().
In thread 2, at the same time, a timer thread invokes NodeProvisioner.suggestReviewNow().
Both these threads enter AbstractCIBase.createNewComputerForNode(Node n, boolean automaticAgentLaunch) for NodeB. Both create an instance of Computer, so now there's both ComputerB1 and ComputerB2. They now compete against each other to associate their Computer instance with NodeB in Jenkins.computers.
One of them will be the last one to write to Jenkins.computers, overwriting the other one.
If the launch of ComputerB1 and ComputerB2 both write to SlaveComputer.log, then there are two file descriptors opened to jenkinsdata/logs/slaves/<agent name>/slave.log.
When it’s time to kill these Computers, AbstractCIBase.updateComputerList will only find one of these Computers and close its file descriptor. The other Computer will eventually be garbage collected but its file descriptor is never closed.
Suggested Fix
I’ve locally forked 2.332.2 and added back the Queue#lock in NodeProvisioner.update(). This change solves the problem for us.
diff --git a/core/src/main/java/hudson/slaves/NodeProvisioner.java b/core/src/main/java/hudson/slaves/NodeProvisioner.java index 87a0d7d45c..4672007162 100644 --- a/core/src/main/java/hudson/slaves/NodeProvisioner.java +++ b/core/src/main/java/hudson/slaves/NodeProvisioner.java @@ -216,6 +216,7 @@ public class NodeProvisioner { try { lastSuggestedReview = System.currentTimeMillis(); queuedReview = false; + Queue.withLock(() -> { Jenkins jenkins = Jenkins.get(); // clean up the cancelled launch activity, then count the # of executors that we are about to // bring up. @@ -315,6 +316,7 @@ public class NodeProvisioner { } else { provisioningState = new StrategyState(snapshot, label, plannedCapacitySnapshot); } + }); if (provisioningState != null) { List<Strategy> strategies = Jenkins.get().getExtensionList(Strategy.class);