[JENKINS-59652] [kubernetes plugin] Protect Jenkins agent pods from eviction

Type: Improvement
Resolution: Unresolved
Priority: Minor
Component/s: kubernetes-plugin
Labels:
None
Environment:
GKE cluster master and node pools version: 1.14
Cluster autoscaler activated
Jenkins master LTS installed with official Helm chart (1.1.24)
Kubernetes plugin: 1.19.0

Similar Issues:
Powered by SuggestiMate

Show

I have a sporadic bug occuring on my Jenkins installation for months now:

java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: EOF

I believe it was already reported in these threads and I understand that this is caused by an HTTP 500 returned by the kubernetes API:

However, after further investigation, I am sure now that the bug occurs only when the cluster autoscaler is on and more precisely when the autoscaler scales down while a Jenkins build is running. It maybe an edge case.

To fix this, I set the annotation on all my pods in the podTemplate yaml:

cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

However, it didn't protect them. So I am trying now to setup a PodDisruptionBudget for each of my slave pod to protect them from eviction.

But, when passing the PDB into the podTemplate yaml it is just totally ignored. How can I protect my jenkins slave pods from eviction?

is duplicated by

JENKINS-67167 in a kubernetes pod sh steps inside container() are failing sporadically

Open

relates to

JENKINS-64848 Shell step failing randomly

Open

JENKINS-67474 Pipeline is failing due to io.fabric8.kubernetes.client.KubernetesClientException: not ready after n milliseconds

Closed

Sigi Kiermayer added a comment - 2019-10-15 10:10

We are also running Jenkins on GKE. At least we don't have issues that a running jenkins slave gets 'moved' to downscale the cluster but we have purposefully created a node pool only for jenkins slaves and sized them so that one jenkins slave uses one node. With autoscaling it is relativly quick but you can and should also have one node running idle.

One thing to be aware of: We added the pdb to make sure jenkins is not killed/moved etc. but we removed it again. When gke is doing maintenance, it will only delay the eviction of a pod by one hour. Which makes the whole process much slower as gke will wait for every pod with pdb for an hour.

Sigi Kiermayer added a comment - 2019-10-15 10:10 We are also running Jenkins on GKE. At least we don't have issues that a running jenkins slave gets 'moved' to downscale the cluster but we have purposefully created a node pool only for jenkins slaves and sized them so that one jenkins slave uses one node. With autoscaling it is relativly quick but you can and should also have one node running idle. One thing to be aware of: We added the pdb to make sure jenkins is not killed/moved etc. but we removed it again. When gke is doing maintenance, it will only delay the eviction of a pod by one hour. Which makes the whole process much slower as gke will wait for every pod with pdb for an hour.

Allan BURDAJEWICZ added a comment - 2020-06-09 01:34

A bug has been opened at Google: https://issuetracker.google.com/issues/156556218

Allan BURDAJEWICZ added a comment - 2020-06-09 01:34 A bug has been opened at Google: https://issuetracker.google.com/issues/156556218

J Knurek added a comment - 2021-01-26 09:34

I found that we were experiencing similar problems. This happens to us when scaling up and running more than 40 nested slave pods. I looked into that Google issue and made an attempt to reproduce (wasn't able to).

I've been investigating a little further and found that the nodes themselves are crashing with `docker daemon exited`. This doesn't evict the Jenkins pods, but it does put them in a non-Running state and Jenkins looses connection to them and fails the job.

In summary, I don't yet know how to address this and keep our builds from failing, but I also don't think it's specific to GKE's autoscaling.

J Knurek added a comment - 2021-01-26 09:34 I found that we were experiencing similar problems. This happens to us when scaling up and running more than 40 nested slave pods. I looked into that Google issue and made an attempt to reproduce (wasn't able to). I've been investigating a little further and found that the nodes themselves are crashing with `docker daemon exited`. This doesn't evict the Jenkins pods, but it does put them in a non-Running state and Jenkins looses connection to them and fails the job. In summary, I don't yet know how to address this and keep our builds from failing, but I also don't think it's specific to GKE's autoscaling.

Allan BURDAJEWICZ added a comment - 2021-02-09 11:43 - edited

We have seen an environment recently (in GKE) where disabling the node-problem-detector helped. Supposedly because the node-problem-detector may restart kubelet which could cause intermittent disconnection.

Allan BURDAJEWICZ added a comment - 2021-02-09 11:43 - edited We have seen an environment recently (in GKE) where disabling the node-problem-detector helped. Supposedly because the node-problem-detector may restart kubelet which could cause intermittent disconnection.

Krystan added a comment - 2021-02-12 16:06

We have now encountered this issue on EKS unfortunately.

Krystan added a comment - 2021-02-12 16:06 We have now encountered this issue on EKS unfortunately.

Berker added a comment - 2021-04-29 13:28

Is there any plan to deal with this issue?

Berker added a comment - 2021-04-29 13:28 Is there any plan to deal with this issue?

Jonathan Rogers added a comment - 2021-04-29 15:10

My Jenkins jobs have often failed as a result of HTTP 500 replies from the Kubernetes API server as described in this issue. I have configured my cluster to scale up to run pods created by the Jenkins Kubernetes plugin. Those pods only run on non-preemptible nodes and are not subject to eviction. After reading the Google issue, I don't know whether to blame the Docker daemon, the Kubernetes cluster autoscaler, some other component of Kubernetes, or something specific to the way Google runs Kubernetes clusters.

Since AFAICT, the 500s are intermittent and subsequent exec calls to the running pod can succeed, I have worked around the problem by adding a step which wraps the built-in "sh" pipeline step. Regardless of whether the root cause is ever dealt with, I think it would be a good idea to incorporate similar retry logic into the Kubernetes plugin. The file is "shwithRetry.groovy" in the "vars" directory of my global pipeline library:

def call(String script) {
  while (true) {
	try {
	  return sh(script: script)
	} catch (io.fabric8.kubernetes.client.KubernetesClientException e) {
	  echo "Retrying after catching ${e}"
	}
  }
}

def sh_with_retry_inner(Map args) {
  if (args.returnStatus) {
	return sh(script: args.script, returnStatus: true)
  } else if (args.returnStdout) {
	return sh(script: args.script, returnStdout: true)
  } else {
	sh args.script
  }
}

def call(Map args) {
  while (true) {
	try {
	  return sh_with_retry_inner(args)
	} catch (io.fabric8.kubernetes.client.KubernetesClientException e) {
	  echo "Retrying after catching ${e}"
	}
  }
}

Jonathan Rogers added a comment - 2021-04-29 15:10 My Jenkins jobs have often failed as a result of HTTP 500 replies from the Kubernetes API server as described in this issue. I have configured my cluster to scale up to run pods created by the Jenkins Kubernetes plugin. Those pods only run on non-preemptible nodes and are not subject to eviction. After reading the Google issue, I don't know whether to blame the Docker daemon, the Kubernetes cluster autoscaler, some other component of Kubernetes, or something specific to the way Google runs Kubernetes clusters. Since AFAICT, the 500s are intermittent and subsequent exec calls to the running pod can succeed, I have worked around the problem by adding a step which wraps the built-in "sh" pipeline step. Regardless of whether the root cause is ever dealt with, I think it would be a good idea to incorporate similar retry logic into the Kubernetes plugin. The file is "shwithRetry.groovy" in the "vars" directory of my global pipeline library: def call( String script) { while ( true ) { try { return sh(script: script) } catch (io.fabric8.kubernetes.client.KubernetesClientException e) { echo "Retrying after catching ${e}" } } } def sh_with_retry_inner(Map args) { if (args.returnStatus) { return sh(script: args.script, returnStatus: true ) } else if (args.returnStdout) { return sh(script: args.script, returnStdout: true ) } else { sh args.script } } def call(Map args) { while ( true ) { try { return sh_with_retry_inner(args) } catch (io.fabric8.kubernetes.client.KubernetesClientException e) { echo "Retrying after catching ${e}" } } }

Pirx Danford added a comment - 2021-07-06 07:43

We are facing the same issue with https://issues.jenkins.io/browse/JENKINS-64848 and a retry function works quite ok to catch it. I like the global approach by jrogers and might try that out, but of course a general fix within the plugin would be much welcome.

Pirx Danford added a comment - 2021-07-06 07:43 We are facing the same issue with https://issues.jenkins.io/browse/JENKINS-64848 and a retry function works quite ok to catch it. I like the global approach by jrogers and might try that out, but of course a general fix within the plugin would be much welcome.

Jonathan Rogers added a comment - 2021-08-04 09:59

Something seems to have changed recently in one of the Jenkins plugins that resulted in my custom step no longer retrying. I think exceptions of class io.fabric8.kubernetes.client.KubernetesClientException are now caught somewhere else. I changed my step to also catch java.net.ProtocolException.

Jonathan Rogers added a comment - 2021-08-04 09:59 Something seems to have changed recently in one of the Jenkins plugins that resulted in my custom step no longer retrying. I think exceptions of class io.fabric8.kubernetes.client.KubernetesClientException are now caught somewhere else. I changed my step to also catch java.net.ProtocolException.

Jesse Glick added a comment - 2021-10-07 11:51

The Google bug has been closed.

Are people observing this problem using the container step (shown in JENKINS-64848 as ContainerExecDecorator but not in any stack trace or Pipeline snippet here)? Currently this step routes control flow through the API server, which usually works but is inefficient and fragile; it should be rewritten to use, say, a named pipe shared with the jnlp (agent JVM) container which would stream messages over the Remoting channel.

Jesse Glick added a comment - 2021-10-07 11:51 The Google bug has been closed. Are people observing this problem using the container step (shown in JENKINS-64848 as ContainerExecDecorator but not in any stack trace or Pipeline snippet here)? Currently this step routes control flow through the API server, which usually works but is inefficient and fragile; it should be rewritten to use, say, a named pipe shared with the jnlp (agent JVM) container which would stream messages over the Remoting channel.

Niklas Grebe added a comment - 2021-11-09 13:00

jglick indeed we encounter the very same issues and are using containers in our Pipeline.

Error with stacktrace:

Executing sh script inside container foo of pod bar-1234-abcde-1a2b3-123a4
11:16:19  java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
11:16:19  	at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
11:16:19  	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
11:16:19  	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
11:16:19  	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
11:16:19  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
11:16:19  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
11:16:19  	at java.lang.Thread.run(Thread.java:748)
[...]
Also:   hudson.remoting.ProxyException: java.lang.Throwable: waiting here
		at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:151)
		at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener.waitUntilReady(ExecWebSocketListener.java:188)
		at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:331)
		at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:86)
		at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:421)
		at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:338)
		at hudson.Launcher$ProcStarter.start(Launcher.java:508)

We do use a pipeline that looks like this:

pipeline {
    agent {
        kubernetes {
            cloud 'ci-cloud'
            yaml """
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
  containers:
  - name: git
    image: alpine/git:latest
    command:
    - cat
    tty: true
    requests:
      memory: 1Gi
      cpu: 1
"""
        }
    }
    stages {
        stage('Checkout') {
            steps {
                container('git') {
                    script {
                        println "Environment variables of this Jenkins job:"
                        sh 'printenv | sort'
                    }
                }
            }
        }
    }
}

and we run on GKE (without autopilot enabled).

Niklas Grebe added a comment - 2021-11-09 13:00 jglick indeed we encounter the very same issues and are using containers in our Pipeline. Error with stacktrace: Executing sh script inside container foo of pod bar-1234-abcde-1a2b3-123a4 11:16:19 java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error' 11:16:19 at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229) 11:16:19 at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196) 11:16:19 at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) 11:16:19 at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 11:16:19 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 11:16:19 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 11:16:19 at java.lang. Thread .run( Thread .java:748) [...] Also: hudson.remoting.ProxyException: java.lang.Throwable: waiting here at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:151) at io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener.waitUntilReady(ExecWebSocketListener.java:188) at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:331) at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:86) at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:421) at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:338) at hudson.Launcher$ProcStarter.start(Launcher.java:508) We do use a pipeline that looks like this: pipeline { agent { kubernetes { cloud 'ci-cloud' yaml """ apiVersion: v1 kind: Pod metadata: annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: " false " spec: containers: - name: git image: alpine/git:latest command: - cat tty: true requests: memory: 1Gi cpu: 1 """ } } stages { stage( 'Checkout' ) { steps { container( 'git' ) { script { println "Environment variables of this Jenkins job:" sh 'printenv | sort' } } } } } } and we run on GKE (without autopilot enabled).

Allan BURDAJEWICZ added a comment - 2022-08-05 05:55 - edited

There is an open PR that should help alleviate the problem by implementing a retry mechanism: https://github.com/jenkinsci/kubernetes-plugin/pull/1212.
At least until the ContainerExecDecorator is rewritten as Jesse mentioned.

Allan BURDAJEWICZ added a comment - 2022-08-05 05:55 - edited There is an open PR that should help alleviate the problem by implementing a retry mechanism: https://github.com/jenkinsci/kubernetes-plugin/pull/1212 . At least until the ContainerExecDecorator is rewritten as Jesse mentioned.

Assignee:: Unassigned

Reporter:: Jonathan Pigrée

Votes:: 5 Vote for this issue

Watchers:: 19 Start watching this issue

Created:: 2019-10-04 06:34

Updated:: 2023-11-21 12:21

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Sigi Kiermayer added a comment - 2019-10-15 10:10

Expand comment: Sigi Kiermayer added a comment - 2019-10-15 10:10

Collapse comment: Allan BURDAJEWICZ added a comment - 2020-06-09 01:34

Expand comment: Allan BURDAJEWICZ added a comment - 2020-06-09 01:34

Collapse comment: J Knurek added a comment - 2021-01-26 09:34

Expand comment: J Knurek added a comment - 2021-01-26 09:34

Collapse comment: Allan BURDAJEWICZ added a comment - 2021-02-09 11:43, Edited by Allan BURDAJEWICZ - 2021-02-09 11:43

Expand comment: Allan BURDAJEWICZ added a comment - 2021-02-09 11:43, Edited by Allan BURDAJEWICZ - 2021-02-09 11:43

Collapse comment: Krystan added a comment - 2021-02-12 16:06

Expand comment: Krystan added a comment - 2021-02-12 16:06

Collapse comment: Berker added a comment - 2021-04-29 13:28

Expand comment: Berker added a comment - 2021-04-29 13:28

Collapse comment: Jonathan Rogers added a comment - 2021-04-29 15:10

Expand comment: Jonathan Rogers added a comment - 2021-04-29 15:10

Collapse comment: Pirx Danford added a comment - 2021-07-06 07:43

Expand comment: Pirx Danford added a comment - 2021-07-06 07:43

Collapse comment: Jonathan Rogers added a comment - 2021-08-04 09:59

Expand comment: Jonathan Rogers added a comment - 2021-08-04 09:59

Collapse comment: Jesse Glick added a comment - 2021-10-07 11:51

Expand comment: Jesse Glick added a comment - 2021-10-07 11:51

Collapse comment: Niklas Grebe added a comment - 2021-11-09 13:00

Expand comment: Niklas Grebe added a comment - 2021-11-09 13:00

Collapse comment: Allan BURDAJEWICZ added a comment - 2022-08-05 05:55, Edited by Allan BURDAJEWICZ - 2022-08-05 05:55

Expand comment: Allan BURDAJEWICZ added a comment - 2022-08-05 05:55, Edited by Allan BURDAJEWICZ - 2022-08-05 05:55

People

Dates