[JENKINS-59340] Pipeline hangs when Agent pod is Terminated

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: kubernetes-plugin, workflow-durable-task-step-plugin
Labels:
None
Environment:
kubernetes-plugin:1.17.2
workflow-durable-task-step-plugin:2.33
core:2.176.3.2

Similar Issues:
Powered by SuggestiMate

Show
Released As:
kubernetes 1.25.5

When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

the node remains in Jenkins, as disconnected
the pipeline hangs forever
the pod remains in kubernetes, in Terminated state, with OOMKilled status

A manual intervention is necessary to fix this situation:

Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
Deleting the pod manually cause the node to be removed (after about 5 minutes for some reason) and eventually the pipeline to be aborted

Expected Behavior

The pipeline should abort automatically and the node be automatically removed.

How to Reproduce

We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a jnlp agent with stress-ng: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

pipeline {
  agent {
    kubernetes {
      yaml """
metadata:
  labels:
    cloudbees.com/master: "dse-team-apac"
    jenkins: "slave"
    jenkins/stress: "true"
spec:
  containers:
  - name: "jnlp"
    image: "dohbedoh/jnlp-stress-agent:alpine"
    imagePullPolicy: "Always"
    resources:
      limits:
        memory: "128Mi"
        cpu: "0.2"
      requests:
        memory: "100Mi"
        cpu: "0.2"
    securityContext:
      privileged: true
    tty: true
"""
    }
  }
  stages {
    stage('stress') {
      steps {
        sh "stress-ng --vm 2 --vm-bytes 1G  --timeout 30s -v"
      }
    }
  }
}

The pod should get OOMKilled by kubernetes:

$ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
NAME                                                          READY   STATUS      RESTARTS   AGE
dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj   0/1     OOMKilled   0          3m21s

And the pipeline jobs show the disconnection and hangs forever:

Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
[Pipeline] {
[Pipeline] stage
[Pipeline] { (stress)
[Pipeline] sh
+ stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
stress-ng: debug: [86] 2 processors online, 2 processors configured
stress-ng: info:  [86] dispatching hogs: 2 vm
stress-ng: debug: [86] cache allocate: default cache size: 46080K
stress-ng: debug: [86] starting stressors
stress-ng: debug: [86] 2 stressors spawned
stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
stress-ng: debug: [89] stress-ng-vm using method 'all'
stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
stress-ng: debug: [88] stress-ng-vm using method 'all'
Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

agent-oom-killed-description.txt
4 kB
2019-09-13 01:04
build.log
3 kB
2019-09-13 01:04
durabletask-and-workflowdurabletask-fine.log
27 kB
2019-09-13 01:04
kubernetes-plugin-fine.log
112 kB
2019-09-13 01:04
support-bundle_2019-09-13_00.50.40.zip
260 kB
2019-09-13 01:04

relates to

JENKINS-49707 Auto retry for elastic agents after channel closure

Resolved

links to

CloudBees-internal issue

PR #772

Allan BURDAJEWICZ created issue - 2019-09-13 00:49

Allan BURDAJEWICZ made changes - 2019-09-13 00:50

Description

Original: When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

* the node remains in Jenkins, as disconnected
* the pipeline hangs forever
* the pod remains in kubernetes, in Terminated state, with OOMKilled status

A manual intervention is necessary to fix this situation:

* Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
* Deleting the pod manually cause the node to be removed (after about *5 minutes* for some reason) and eventually the pipeline to be aborted

h3. How to Reproduce

We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a _jnlp_ agent with _stress-ng_: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

* Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

{code}
pipeline {
  agent {
    kubernetes {
      yaml """
metadata:
  labels:
    cloudbees.com/master: "dse-team-apac"
    jenkins: "slave"
    jenkins/stress: "true"
spec:
  containers:
  - name: "jnlp"
    image: "dohbedoh/jnlp-stress-agent:alpine"
    imagePullPolicy: "Always"
    resources:
      limits:
        memory: "128Mi"
        cpu: "0.2"
      requests:
        memory: "100Mi"
        cpu: "0.2"
    securityContext:
      privileged: true
    tty: true
"""
    }
  }
  stages {
    stage('stress') {
      steps {
        sh "stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v"
      }
    }
  }
}
{code}

The pod should get OOMKilled by kubernetes:

{code}
$ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
NAME READY STATUS RESTARTS AGE
dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj 0/1 OOMKilled 0 3m21s
{code}

And the pipeline jobs show the disconnection and hangs forever:

{code}
Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
[Pipeline] {
[Pipeline] stage
[Pipeline] { (stress)
[Pipeline] sh
+ stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
stress-ng: debug: [86] 2 processors online, 2 processors configured
stress-ng: info: [86] dispatching hogs: 2 vm
stress-ng: debug: [86] cache allocate: default cache size: 46080K
stress-ng: debug: [86] starting stressors
stress-ng: debug: [86] 2 stressors spawned
stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
stress-ng: debug: [89] stress-ng-vm using method 'all'
stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
stress-ng: debug: [88] stress-ng-vm using method 'all'
Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
{code}

New: When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

* the node remains in Jenkins, as disconnected
* the pipeline hangs forever
* the pod remains in kubernetes, in Terminated state, with OOMKilled status

A manual intervention is necessary to fix this situation:

* Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
* Deleting the pod manually cause the node to be removed (after about *5 minutes* for some reason) and eventually the pipeline to be aborted

h3. Expected Behavior

The pipeline should abort automatically and the node be automatically removed.

h3. How to Reproduce

We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a _jnlp_ agent with _stress-ng_: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

* Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

{code}
pipeline {
  agent {
    kubernetes {
      yaml """
metadata:
  labels:
    cloudbees.com/master: "dse-team-apac"
    jenkins: "slave"
    jenkins/stress: "true"
spec:
  containers:
  - name: "jnlp"
    image: "dohbedoh/jnlp-stress-agent:alpine"
    imagePullPolicy: "Always"
    resources:
      limits:
        memory: "128Mi"
        cpu: "0.2"
      requests:
        memory: "100Mi"
        cpu: "0.2"
    securityContext:
      privileged: true
    tty: true
"""
    }
  }
  stages {
    stage('stress') {
      steps {
        sh "stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v"
      }
    }
  }
}
{code}

The pod should get OOMKilled by kubernetes:

{code}
$ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
NAME READY STATUS RESTARTS AGE
dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj 0/1 OOMKilled 0 3m21s
{code}

And the pipeline jobs show the disconnection and hangs forever:

{code}
Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
[Pipeline] {
[Pipeline] stage
[Pipeline] { (stress)
[Pipeline] sh
+ stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
stress-ng: debug: [86] 2 processors online, 2 processors configured
stress-ng: info: [86] dispatching hogs: 2 vm
stress-ng: debug: [86] cache allocate: default cache size: 46080K
stress-ng: debug: [86] starting stressors
stress-ng: debug: [86] 2 stressors spawned
stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
stress-ng: debug: [89] stress-ng-vm using method 'all'
stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
stress-ng: debug: [88] stress-ng-vm using method 'all'
Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
{code}

Allan BURDAJEWICZ made changes - 2019-09-13 01:04

Attachment

New: support-bundle_2019-09-13_00.50.40.zip [ 48725 ]

Allan BURDAJEWICZ made changes - 2019-09-13 01:04

Attachment		New: kubernetes-plugin-fine.log [ 48726 ]
Attachment		New: durabletask-and-workflowdurabletask-fine.log [ 48727 ]
Attachment		New: build.log [ 48728 ]
Attachment		New: agent-oom-killed-description.txt [ 48729 ]

Allan BURDAJEWICZ made changes - 2019-09-13 01:06

Summary

Original: Pipeline hangs when Agent pod is Terminated but still exist

New: Pipeline hangs when Agent pod is Terminated

Allan BURDAJEWICZ made changes - 2019-09-13 01:09

Link

New: This issue relates to ~~JENKINS-49707~~ [ ~~JENKINS-49707~~ ]

Jesse Glick made changes - 2019-09-13 12:55

Remote Link

New: This issue links to "CloudBees-internal issue (Web Link)" [ 23615 ]

Allan BURDAJEWICZ added a comment - 2020-04-09 03:22

Another way to reproduce this issue is by using an activeDeadlineSeconds to kill the pod after some time. The pod would fail with DeadlineExceeded but will not be deleted. The pipeline will hang until a manual operation is taken.

Allan BURDAJEWICZ added a comment - 2020-04-09 03:22 Another way to reproduce this issue is by using an activeDeadlineSeconds to kill the pod after some time. The pod would fail with DeadlineExceeded but will not be deleted. The pipeline will hang until a manual operation is taken.

Vincent Latombe made changes - 2020-04-23 14:36

Assignee

New: Vincent Latombe [ vlatombe ]

Vincent Latombe made changes - 2020-04-23 14:36

Status

Original: Open [ 1 ]

New: In Progress [ 3 ]

Assignee:: Vincent Latombe

Reporter:: Allan BURDAJEWICZ

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2019-09-13 00:49

Updated:: 2020-05-18 12:21

Resolved:: 2020-05-18 12:21

Jenkins

Details

Description

Expected Behavior

How to Reproduce

Attachments

Attachments

Issue Links

Activity

Collapse comment: Allan BURDAJEWICZ added a comment - 2020-04-09 03:22

Expand comment: Allan BURDAJEWICZ added a comment - 2020-04-09 03:22

People

Dates