-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
kubernetes-plugin 1.29.2
We are trying to use the kubernetes-plugin together with Windows workers and large containers images. It takes ~16 minutes to pull the images. During this time period, Jenkins/kubernetes-plugin times out the pod and destroys/recreates it every ~6 minutes.
When we run this on Google Kubernetes Engine, its node autoscaler gets confused by the pod destruction/creation and starts up additional nodes. This, in turn, causes problems with PVCs and mounts when multiple nodes try to mount the same PV (GKE only allows a PV to be mounted in read/write mode to one pod at a time).
The core problem seems to be that Jenkins core misinterprets a pod starting up as being an idle Computer. If the startup takes too long, Jenkins decides to kill the Computer (and try again).
Root cause analysis:
When a pod is created, a KubernetesSlave gets created. It will immediately create one Executor (https://github.com/jenkinsci/kubernetes-plugin/blob/kubernetes-1.29.2/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L155). Kubernetes will start pulling down all container images in the background around now.
This Executor will report as being idle during the time that the image pull is occurring (https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Executor.java#L606). Therefore, the Computer will also report as being idle during this time (https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Computer.java#L1023).
The KubernetesSlave will use one out of two retention strategies; OnceRetentionStrategy or CloudRetentionStrategy. Either of them will time out after a machine has been idle for a given number of minutes (https://github.com/jenkinsci/durable-task-plugin/blob/durable-task-1.35/src/main/java/org/jenkinsci/plugins/durabletask/executors/OnceRetentionStrategy.java#L67, https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/slaves/CloudRetentionStrategy.java#L54). If either of those trigger, the Computer will get deleted.
Workarounds:
Disable OnceRetentionStrategy/CloudRetentionStrategy altogether.
Specify a high podTemplate.idleTimeout (but this will also make the 'real' idle timeouts occur a lot later).
Solutions:
No good solution at the moment.
If Computer.isIdle() ( https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Computer.java#L1023) was possible to override, then this could have a special implementation for KubernetesSlave, where it had special handling of the initial image pull period (so the node isn't considered idle during this bootstrapping period), but that would require changes in both Jenkins core and Kubernetes plugin.
It would be possible to create a new RetentionStrategy, specific to Kubernetes. There would be no need to make changes to core Jenkins then... but that seems like a patchy solution.
Regardless of strategy, we'd need to somehow understand when a KubernetesSlave is in this warmup state. I'm not sure how to determine that. Can a KubernetesSlave be re-used for different pods? (If so, there may be long periods between jobs when the node is pulling images again)