Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59652

[kubernetes plugin] Protect Jenkins agent pods from eviction

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Minor Minor
    • kubernetes-plugin
    • None
    • GKE cluster master and node pools version: 1.14
      Cluster autoscaler activated
      Jenkins master LTS installed with official Helm chart (1.1.24)
      Kubernetes plugin: 1.19.0

      I have a sporadic bug occuring on my Jenkins installation for months now:

      java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
      at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
      at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
      at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
      at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: EOF
      

      I believe it was already reported in these threads and I understand that this is caused by an HTTP 500 returned by the kubernetes API:

      However, after further investigation, I am sure now that the bug occurs only when the cluster autoscaler is on and more precisely when the autoscaler scales down while a Jenkins build is running. It maybe an edge case.

       

      To fix this, I set the annotation on all my pods in the podTemplate yaml:

      cluster-autoscaler.kubernetes.io/safe-to-evict: "false" 

      However, it didn't protect them. So I am trying now to setup a PodDisruptionBudget for each of my slave pod to protect them from eviction.

      But, when passing the PDB into the podTemplate yaml it is just totally ignored. How can I protect my jenkins slave pods from eviction? 

          [JENKINS-59652] [kubernetes plugin] Protect Jenkins agent pods from eviction

          J Knurek added a comment -

          I found that we were experiencing similar problems. This happens to us when scaling up and running more than 40 nested slave pods. I looked into that Google issue and made an attempt to reproduce (wasn't able to).

          I've been investigating a little further and found that the nodes themselves are crashing with `docker daemon exited`. This doesn't evict the Jenkins pods, but it does put them in a non-Running state and Jenkins looses connection to them and fails the job. 

          In summary, I don't yet know how to address this and keep our builds from failing, but I also don't think it's specific to GKE's autoscaling. 

          J Knurek added a comment - I found that we were experiencing similar problems. This happens to us when scaling up and running more than 40 nested slave pods. I looked into that Google issue and made an attempt to reproduce (wasn't able to). I've been investigating a little further and found that the nodes themselves are crashing with `docker daemon exited`. This doesn't evict the Jenkins pods, but it does put them in a non-Running state and Jenkins looses connection to them and fails the job.  In summary, I don't yet know how to address this and keep our builds from failing, but I also don't think it's specific to GKE's autoscaling. 

          Allan BURDAJEWICZ added a comment - - edited

          We have seen an environment recently (in GKE) where disabling the node-problem-detector helped. Supposedly because the node-problem-detector may restart kubelet which could cause intermittent disconnection.

          Allan BURDAJEWICZ added a comment - - edited We have seen an environment recently (in GKE) where disabling the node-problem-detector helped. Supposedly because the node-problem-detector may restart kubelet which could cause intermittent disconnection.

          Krystan added a comment -

          We have now encountered this issue on EKS unfortunately.

          Krystan added a comment - We have now encountered this issue on EKS unfortunately.

          Berker added a comment -

          Is there any plan to deal with this issue?

          Berker added a comment - Is there any plan to deal with this issue?

          My Jenkins jobs have often failed as a result of HTTP 500 replies from the Kubernetes API server as described in this issue. I have configured my cluster to scale up to run pods created by the Jenkins Kubernetes plugin. Those pods only run on non-preemptible nodes and are not subject to eviction. After reading the Google issue, I don't know whether to blame the Docker daemon, the Kubernetes cluster autoscaler, some other component of Kubernetes, or something specific to the way Google runs Kubernetes clusters.

          Since AFAICT, the 500s are intermittent and subsequent exec calls to the running pod can succeed, I have worked around the problem by adding a step which wraps the built-in "sh" pipeline step. Regardless of whether the root cause is ever dealt with, I think it would be a good idea to incorporate similar retry logic into the Kubernetes plugin. The file is "shwithRetry.groovy" in the "vars" directory of my global pipeline library:

           

          def call(String script) {
            while (true) {
          	try {
          	  return sh(script: script)
          	} catch (io.fabric8.kubernetes.client.KubernetesClientException e) {
          	  echo "Retrying after catching ${e}"
          	}
            }
          }
          
          def sh_with_retry_inner(Map args) {
            if (args.returnStatus) {
          	return sh(script: args.script, returnStatus: true)
            } else if (args.returnStdout) {
          	return sh(script: args.script, returnStdout: true)
            } else {
          	sh args.script
            }
          }
          
          def call(Map args) {
            while (true) {
          	try {
          	  return sh_with_retry_inner(args)
          	} catch (io.fabric8.kubernetes.client.KubernetesClientException e) {
          	  echo "Retrying after catching ${e}"
          	}
            }
          }

           

          Jonathan Rogers added a comment - My Jenkins jobs have often failed as a result of HTTP 500 replies from the Kubernetes API server as described in this issue. I have configured my cluster to scale up to run pods created by the Jenkins Kubernetes plugin. Those pods only run on non-preemptible nodes and are not subject to eviction. After reading the Google issue, I don't know whether to blame the Docker daemon, the Kubernetes cluster autoscaler, some other component of Kubernetes, or something specific to the way Google runs Kubernetes clusters. Since AFAICT, the 500s are intermittent and subsequent exec calls to the running pod can succeed, I have worked around the problem by adding a step which wraps the built-in "sh" pipeline step. Regardless of whether the root cause is ever dealt with, I think it would be a good idea to incorporate similar retry logic into the Kubernetes plugin. The file is "shwithRetry.groovy" in the "vars" directory of my global pipeline library:   def call( String script) { while ( true ) { try { return sh(script: script) } catch (io.fabric8.kubernetes.client.KubernetesClientException e) { echo "Retrying after catching ${e}" } } } def sh_with_retry_inner(Map args) { if (args.returnStatus) { return sh(script: args.script, returnStatus: true ) } else if (args.returnStdout) { return sh(script: args.script, returnStdout: true ) } else { sh args.script } } def call(Map args) { while ( true ) { try { return sh_with_retry_inner(args) } catch (io.fabric8.kubernetes.client.KubernetesClientException e) { echo "Retrying after catching ${e}" } } }  

          Pirx Danford added a comment -

          We are facing the same issue with https://issues.jenkins.io/browse/JENKINS-64848 and a retry function works quite ok to catch it. I like the global approach by jrogers and might try that out, but of course a general fix within the plugin would be much welcome.

          Pirx Danford added a comment - We are facing the same issue with https://issues.jenkins.io/browse/JENKINS-64848  and a retry function works quite ok to catch it. I like the global approach by jrogers and might try that out, but of course a general fix within the plugin would be much welcome.

          Something seems to have changed recently in one of the Jenkins plugins that resulted in my custom step no longer retrying. I think exceptions of class io.fabric8.kubernetes.client.KubernetesClientException are now caught somewhere else. I changed my step to also catch java.net.ProtocolException.

          Jonathan Rogers added a comment - Something seems to have changed recently in one of the Jenkins plugins that resulted in my custom step no longer retrying. I think exceptions of class io.fabric8.kubernetes.client.KubernetesClientException are now caught somewhere else. I changed my step to also catch java.net.ProtocolException.

          Jesse Glick added a comment -

          The Google bug has been closed.

          Are people observing this problem using the container step (shown in JENKINS-64848 as ContainerExecDecorator but not in any stack trace or Pipeline snippet here)? Currently this step routes control flow through the API server, which usually works but is inefficient and fragile; it should be rewritten to use, say, a named pipe shared with the jnlp (agent JVM) container which would stream messages over the Remoting channel.

          Jesse Glick added a comment - The Google bug has been closed. Are people observing this problem using the container step (shown in JENKINS-64848 as ContainerExecDecorator but not in any stack trace or Pipeline snippet here)? Currently this step routes control flow through the API server, which usually works but is inefficient and fragile; it should be rewritten to use, say, a named pipe shared with the jnlp (agent JVM) container which would stream messages over the Remoting channel.

          Niklas Grebe added a comment -

          jglick indeed we encounter the very same issues and are using containers in our Pipeline.

          Error with stacktrace:

           

          Executing sh script inside container foo of pod bar-1234-abcde-1a2b3-123a4
          11:16:19  java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
          11:16:19  	at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
          11:16:19  	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
          11:16:19  	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
          11:16:19  	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
          11:16:19  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          11:16:19  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          11:16:19  	at java.lang.Thread.run(Thread.java:748)
          [...]
          Also:   hudson.remoting.ProxyException: java.lang.Throwable: waiting here
          		at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:151)
          		at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener.waitUntilReady(ExecWebSocketListener.java:188)
          		at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:331)
          		at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:86)
          		at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:421)
          		at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:338)
          		at hudson.Launcher$ProcStarter.start(Launcher.java:508)

          We do use a pipeline that looks like this:

          pipeline {
              agent {
                  kubernetes {
                      cloud 'ci-cloud'
                      yaml """
          apiVersion: v1
          kind: Pod
          metadata:
            annotations:
              cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
          spec:
            containers:
            - name: git
              image: alpine/git:latest
              command:
              - cat
              tty: true
              requests:
                memory: 1Gi
                cpu: 1
          """
                  }
              }
              stages {
                  stage('Checkout') {
                      steps {
                          container('git') {
                              script {
                                  println "Environment variables of this Jenkins job:"
                                  sh 'printenv | sort'
                              }
                          }
                      }
                  }
              }
          }

          and we run on GKE (without autopilot enabled).

           

           

          Niklas Grebe added a comment - jglick  indeed we encounter the very same issues and are using containers in our Pipeline. Error with stacktrace:   Executing sh script inside container foo of pod bar-1234-abcde-1a2b3-123a4 11:16:19 java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error' 11:16:19 at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229) 11:16:19 at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196) 11:16:19 at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) 11:16:19 at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 11:16:19 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 11:16:19 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 11:16:19 at java.lang. Thread .run( Thread .java:748) [...] Also: hudson.remoting.ProxyException: java.lang.Throwable: waiting here at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:151) at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener.waitUntilReady(ExecWebSocketListener.java:188) at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:331) at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:86) at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:421) at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:338) at hudson.Launcher$ProcStarter.start(Launcher.java:508) We do use a pipeline that looks like this: pipeline { agent { kubernetes { cloud 'ci-cloud' yaml """ apiVersion: v1 kind: Pod metadata: annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: " false " spec: containers: - name: git image: alpine/git:latest command: - cat tty: true requests: memory: 1Gi cpu: 1 """ } } stages { stage( 'Checkout' ) { steps { container( 'git' ) { script { println "Environment variables of this Jenkins job:" sh 'printenv | sort' } } } } } } and we run on GKE (without autopilot enabled).    

          Allan BURDAJEWICZ added a comment - - edited

          There is an open PR that should help alleviate the problem by implementing a retry mechanism: https://github.com/jenkinsci/kubernetes-plugin/pull/1212.
          At least until the ContainerExecDecorator is rewritten as Jesse mentioned.

          Allan BURDAJEWICZ added a comment - - edited There is an open PR that should help alleviate the problem by implementing a retry mechanism: https://github.com/jenkinsci/kubernetes-plugin/pull/1212 . At least until the ContainerExecDecorator is rewritten as Jesse mentioned.

            Unassigned Unassigned
            jpigree Jonathan Pigrée
            Votes:
            5 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated: