-
Bug
-
Resolution: Fixed
-
Major
-
None
-
kubernetes-plugin:1.17.2
workflow-durable-task-step-plugin:2.33
core:2.176.3.2
-
-
kubernetes 1.25.5
When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:
- the node remains in Jenkins, as disconnected
- the pipeline hangs forever
- the pod remains in kubernetes, in Terminated state, with OOMKilled status
A manual intervention is necessary to fix this situation:
- Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
- Deleting the pod manually cause the node to be removed (after about 5 minutes for some reason) and eventually the pipeline to be aborted
Expected Behavior
The pipeline should abort automatically and the node be automatically removed.
How to Reproduce
We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a jnlp agent with stress-ng: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)
- Create a pipeline that simulate an kubernetes `OOMKilled` during the build:
pipeline { agent { kubernetes { yaml """ metadata: labels: cloudbees.com/master: "dse-team-apac" jenkins: "slave" jenkins/stress: "true" spec: containers: - name: "jnlp" image: "dohbedoh/jnlp-stress-agent:alpine" imagePullPolicy: "Always" resources: limits: memory: "128Mi" cpu: "0.2" requests: memory: "100Mi" cpu: "0.2" securityContext: privileged: true tty: true """ } } stages { stage('stress') { steps { sh "stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v" } } } }
The pod should get OOMKilled by kubernetes:
$ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj NAME READY STATUS RESTARTS AGE dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj 0/1 OOMKilled 0 3m21s
And the pipeline jobs show the disconnection and hangs forever:
Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario [Pipeline] { [Pipeline] stage [Pipeline] { (stress) [Pipeline] sh + stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v stress-ng: debug: [86] 2 processors online, 2 processors configured stress-ng: info: [86] dispatching hogs: 2 vm stress-ng: debug: [86] cache allocate: default cache size: 46080K stress-ng: debug: [86] starting stressors stress-ng: debug: [86] 2 stressors spawned stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1) stress-ng: debug: [89] stress-ng-vm using method 'all' stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0) stress-ng: debug: [88] stress-ng-vm using method 'all' Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
- relates to
-
JENKINS-49707 Auto retry for elastic agents after channel closure
- Resolved
- links to