-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
GKE cluster master and node pools version: 1.14
Cluster autoscaler activated
Jenkins master LTS installed with official Helm chart (1.1.24)
Kubernetes plugin: 1.19.0
I have a sporadic bug occuring on my Jenkins installation for months now:
java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error' at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: EOF
I believe it was already reported in these threads and I understand that this is caused by an HTTP 500 returned by the kubernetes API:
- https://issues.jenkins-ci.org/browse/JENKINS-39844
- https://stackoverflow.com/questions/50949718/kubernetes-gke-error-dialing-backend-eof-on-random-exec-command
However, after further investigation, I am sure now that the bug occurs only when the cluster autoscaler is on and more precisely when the autoscaler scales down while a Jenkins build is running. It maybe an edge case.
To fix this, I set the annotation on all my pods in the podTemplate yaml:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
However, it didn't protect them. So I am trying now to setup a PodDisruptionBudget for each of my slave pod to protect them from eviction.
But, when passing the PDB into the podTemplate yaml it is just totally ignored. How can I protect my jenkins slave pods from eviction?
- is duplicated by
-
JENKINS-67167 in a kubernetes pod sh steps inside container() are failing sporadically
-
- Open
-
- relates to
-
JENKINS-64848 Shell step failing randomly
-
- Open
-
-
JENKINS-67474 Pipeline is failing due to io.fabric8.kubernetes.client.KubernetesClientException: not ready after n milliseconds
-
- Closed
-
We are also running Jenkins on GKE. At least we don't have issues that a running jenkins slave gets 'moved' to downscale the cluster but we have purposefully created a node pool only for jenkins slaves and sized them so that one jenkins slave uses one node. With autoscaling it is relativly quick but you can and should also have one node running idle.
One thing to be aware of: We added the pdb to make sure jenkins is not killed/moved etc. but we removed it again. When gke is doing maintenance, it will only delay the eviction of a pod by one hour. Which makes the whole process much slower as gke will wait for every pod with pdb for an hour.