Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59340

Pipeline hangs when Agent pod is Terminated

XMLWordPrintable

    • kubernetes 1.25.5

      When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

      • the node remains in Jenkins, as disconnected
      • the pipeline hangs forever
      • the pod remains in kubernetes, in Terminated state, with OOMKilled status

      A manual intervention is necessary to fix this situation:

      • Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
      • Deleting the pod manually cause the node to be removed (after about 5 minutes for some reason) and eventually the pipeline to be aborted

      Expected Behavior

      The pipeline should abort automatically and the node be automatically removed.

      How to Reproduce

      We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a jnlp agent with stress-ng: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

      • Create a pipeline that simulate an kubernetes `OOMKilled` during the build:
      pipeline {
        agent {
          kubernetes {
            yaml """
      metadata:
        labels:
          cloudbees.com/master: "dse-team-apac"
          jenkins: "slave"
          jenkins/stress: "true"
      spec:
        containers:
        - name: "jnlp"
          image: "dohbedoh/jnlp-stress-agent:alpine"
          imagePullPolicy: "Always"
          resources:
            limits:
              memory: "128Mi"
              cpu: "0.2"
            requests:
              memory: "100Mi"
              cpu: "0.2"
          securityContext:
            privileged: true
          tty: true
      """
          }
        }
        stages {
          stage('stress') {
            steps {
              sh "stress-ng --vm 2 --vm-bytes 1G  --timeout 30s -v"
            }
          }
        }
      }
      

      The pod should get OOMKilled by kubernetes:

      $ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
      NAME                                                          READY   STATUS      RESTARTS   AGE
      dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj   0/1     OOMKilled   0          3m21s
      

      And the pipeline jobs show the disconnection and hangs forever:

      Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
      [Pipeline] {
      [Pipeline] stage
      [Pipeline] { (stress)
      [Pipeline] sh
      + stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
      stress-ng: debug: [86] 2 processors online, 2 processors configured
      stress-ng: info:  [86] dispatching hogs: 2 vm
      stress-ng: debug: [86] cache allocate: default cache size: 46080K
      stress-ng: debug: [86] starting stressors
      stress-ng: debug: [86] 2 stressors spawned
      stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
      stress-ng: debug: [89] stress-ng-vm using method 'all'
      stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
      stress-ng: debug: [88] stress-ng-vm using method 'all'
      Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
      

        1. agent-oom-killed-description.txt
          4 kB
          Allan BURDAJEWICZ
        2. build.log
          3 kB
          Allan BURDAJEWICZ
        3. durabletask-and-workflowdurabletask-fine.log
          27 kB
          Allan BURDAJEWICZ
        4. kubernetes-plugin-fine.log
          112 kB
          Allan BURDAJEWICZ

            vlatombe Vincent Latombe
            allan_burdajewicz Allan BURDAJEWICZ
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: