• kubernetes 1.25.5

      When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

      • the node remains in Jenkins, as disconnected
      • the pipeline hangs forever
      • the pod remains in kubernetes, in Terminated state, with OOMKilled status

      A manual intervention is necessary to fix this situation:

      • Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
      • Deleting the pod manually cause the node to be removed (after about 5 minutes for some reason) and eventually the pipeline to be aborted

      Expected Behavior

      The pipeline should abort automatically and the node be automatically removed.

      How to Reproduce

      We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a jnlp agent with stress-ng: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

      • Create a pipeline that simulate an kubernetes `OOMKilled` during the build:
      pipeline {
        agent {
          kubernetes {
            yaml """
      metadata:
        labels:
          cloudbees.com/master: "dse-team-apac"
          jenkins: "slave"
          jenkins/stress: "true"
      spec:
        containers:
        - name: "jnlp"
          image: "dohbedoh/jnlp-stress-agent:alpine"
          imagePullPolicy: "Always"
          resources:
            limits:
              memory: "128Mi"
              cpu: "0.2"
            requests:
              memory: "100Mi"
              cpu: "0.2"
          securityContext:
            privileged: true
          tty: true
      """
          }
        }
        stages {
          stage('stress') {
            steps {
              sh "stress-ng --vm 2 --vm-bytes 1G  --timeout 30s -v"
            }
          }
        }
      }
      

      The pod should get OOMKilled by kubernetes:

      $ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
      NAME                                                          READY   STATUS      RESTARTS   AGE
      dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj   0/1     OOMKilled   0          3m21s
      

      And the pipeline jobs show the disconnection and hangs forever:

      Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
      [Pipeline] {
      [Pipeline] stage
      [Pipeline] { (stress)
      [Pipeline] sh
      + stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
      stress-ng: debug: [86] 2 processors online, 2 processors configured
      stress-ng: info:  [86] dispatching hogs: 2 vm
      stress-ng: debug: [86] cache allocate: default cache size: 46080K
      stress-ng: debug: [86] starting stressors
      stress-ng: debug: [86] 2 stressors spawned
      stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
      stress-ng: debug: [89] stress-ng-vm using method 'all'
      stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
      stress-ng: debug: [88] stress-ng-vm using method 'all'
      Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
      

          [JENKINS-59340] Pipeline hangs when Agent pod is Terminated

          Allan BURDAJEWICZ created issue -
          Allan BURDAJEWICZ made changes -
          Description Original: When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

          * the node remains in Jenkins, as disconnected
          * the pipeline hangs forever
          * the pod remains in kubernetes, in Terminated state, with OOMKilled status

          A manual intervention is necessary to fix this situation:

          * Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
          * Deleting the pod manually cause the node to be removed (after about *5 minutes* for some reason) and eventually the pipeline to be aborted

          h3. How to Reproduce

          We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a _jnlp_ agent with _stress-ng_: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

          * Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

          {code}
          pipeline {
            agent {
              kubernetes {
                yaml """
          metadata:
            labels:
              cloudbees.com/master: "dse-team-apac"
              jenkins: "slave"
              jenkins/stress: "true"
          spec:
            containers:
            - name: "jnlp"
              image: "dohbedoh/jnlp-stress-agent:alpine"
              imagePullPolicy: "Always"
              resources:
                limits:
                  memory: "128Mi"
                  cpu: "0.2"
                requests:
                  memory: "100Mi"
                  cpu: "0.2"
              securityContext:
                privileged: true
              tty: true
          """
              }
            }
            stages {
              stage('stress') {
                steps {
                  sh "stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v"
                }
              }
            }
          }
          {code}

          The pod should get OOMKilled by kubernetes:

          {code}
          $ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
          NAME READY STATUS RESTARTS AGE
          dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj 0/1 OOMKilled 0 3m21s
          {code}

          And the pipeline jobs show the disconnection and hangs forever:

          {code}
          Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
          [Pipeline] {
          [Pipeline] stage
          [Pipeline] { (stress)
          [Pipeline] sh
          + stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
          stress-ng: debug: [86] 2 processors online, 2 processors configured
          stress-ng: info: [86] dispatching hogs: 2 vm
          stress-ng: debug: [86] cache allocate: default cache size: 46080K
          stress-ng: debug: [86] starting stressors
          stress-ng: debug: [86] 2 stressors spawned
          stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
          stress-ng: debug: [89] stress-ng-vm using method 'all'
          stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
          stress-ng: debug: [88] stress-ng-vm using method 'all'
          Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
          {code}
          New: When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

          * the node remains in Jenkins, as disconnected
          * the pipeline hangs forever
          * the pod remains in kubernetes, in Terminated state, with OOMKilled status

          A manual intervention is necessary to fix this situation:

          * Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
          * Deleting the pod manually cause the node to be removed (after about *5 minutes* for some reason) and eventually the pipeline to be aborted

          h3. Expected Behavior

          The pipeline should abort automatically and the node be automatically removed.

          h3. How to Reproduce

          We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a _jnlp_ agent with _stress-ng_: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

          * Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

          {code}
          pipeline {
            agent {
              kubernetes {
                yaml """
          metadata:
            labels:
              cloudbees.com/master: "dse-team-apac"
              jenkins: "slave"
              jenkins/stress: "true"
          spec:
            containers:
            - name: "jnlp"
              image: "dohbedoh/jnlp-stress-agent:alpine"
              imagePullPolicy: "Always"
              resources:
                limits:
                  memory: "128Mi"
                  cpu: "0.2"
                requests:
                  memory: "100Mi"
                  cpu: "0.2"
              securityContext:
                privileged: true
              tty: true
          """
              }
            }
            stages {
              stage('stress') {
                steps {
                  sh "stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v"
                }
              }
            }
          }
          {code}

          The pod should get OOMKilled by kubernetes:

          {code}
          $ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
          NAME READY STATUS RESTARTS AGE
          dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj 0/1 OOMKilled 0 3m21s
          {code}

          And the pipeline jobs show the disconnection and hangs forever:

          {code}
          Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
          [Pipeline] {
          [Pipeline] stage
          [Pipeline] { (stress)
          [Pipeline] sh
          + stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
          stress-ng: debug: [86] 2 processors online, 2 processors configured
          stress-ng: info: [86] dispatching hogs: 2 vm
          stress-ng: debug: [86] cache allocate: default cache size: 46080K
          stress-ng: debug: [86] starting stressors
          stress-ng: debug: [86] 2 stressors spawned
          stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
          stress-ng: debug: [89] stress-ng-vm using method 'all'
          stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
          stress-ng: debug: [88] stress-ng-vm using method 'all'
          Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
          {code}
          Allan BURDAJEWICZ made changes -
          Attachment New: support-bundle_2019-09-13_00.50.40.zip [ 48725 ]
          Allan BURDAJEWICZ made changes -
          Attachment New: kubernetes-plugin-fine.log [ 48726 ]
          Attachment New: durabletask-and-workflowdurabletask-fine.log [ 48727 ]
          Attachment New: build.log [ 48728 ]
          Attachment New: agent-oom-killed-description.txt [ 48729 ]
          Allan BURDAJEWICZ made changes -
          Summary Original: Pipeline hangs when Agent pod is Terminated but still exist New: Pipeline hangs when Agent pod is Terminated
          Allan BURDAJEWICZ made changes -
          Link New: This issue relates to JENKINS-49707 [ JENKINS-49707 ]
          Jesse Glick made changes -
          Remote Link New: This issue links to "CloudBees-internal issue (Web Link)" [ 23615 ]

          Another way to reproduce this issue is by using an activeDeadlineSeconds to kill the pod after some time. The pod would fail with DeadlineExceeded but will not be deleted. The pipeline will hang until a manual operation is taken.

          Allan BURDAJEWICZ added a comment - Another way to reproduce this issue is by using an activeDeadlineSeconds to kill the pod after some time. The pod would fail with DeadlineExceeded but will not be deleted. The pipeline will hang until a manual operation is taken.
          Vincent Latombe made changes -
          Assignee New: Vincent Latombe [ vlatombe ]
          Vincent Latombe made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]

            vlatombe Vincent Latombe
            allan_burdajewicz Allan BURDAJEWICZ
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: