-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
Plugin is able to create the slave on GKE, but after that watch is failing with below error
Jul 31, 2024 5:52:57 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch Created Pod: kubernetes-gke-gcp default/jenkins-slave-gcp-gke-default-0f1x4 Jul 31, 2024 5:52:57 PM SEVERE io.fabric8.kubernetes.client.informers.impl.cache.Reflector onException listSyncAndWatch failed for v1/namespaces/default/pods, will stop Also: java.lang.Throwable: waiting here at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:174) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:933) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:921) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:97) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:222) at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) io.fabric8.kubernetes.client.KubernetesClientException: Received 400 on websocket. Failure executing: GET at: https://gke-xxx.us-central1.gke.goog/api/v1/namespaces/default/pods?allowWatchBookmarks=true&fieldSelector=metadata.name%3Djenkins-slave-gcp-gke-default-0f1x4&resourceVersion=27395937&timeoutSeconds=600&watch=true. Message: Bad Request. at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.lambda$start$3(WatchConnectionManager.java:86)Caused: java.util.concurrent.CompletionException at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.uniHandle(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141) at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$buildWebSocket$17(StandardHttpClient.java:244) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source) at io.fabric8.kubernetes.client.utils.AsyncUtils.lambda$retryWithExponentialBackoff$3(AsyncUtils.java:90) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source) at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$1.onFailure(OkHttpWebSocketImpl.java:88) at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.kt:592) at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.kt:174) at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Jul 31, 2024 5:52:57 PM WARNING org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch Error in provisioning; agent=KubernetesSlave name: jenkins-slave-gcp-gke-default-0f1x4, template=dcc21005-5414-4e09-b32e-1d53376ffd45 Also: java.lang.Throwable: waiting here at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:174) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:933) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:921) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:97) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:222) at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) io.fabric8.kubernetes.client.KubernetesClientException: Received 400 on websocket. Failure executing: GET at: https://gke-xxx.us-central1.gke.goog/api/v1/namespaces/default/pods?allowWatchBookmarks=true&fieldSelector=metadata.name%3Djenkins-slave-gcp-gke-default-0f1x4&resourceVersion=27395937&timeoutSeconds=600&watch=true. Message: Bad Request. at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.lambda$start$3(WatchConnectionManager.java:86) at java.base/java.util.concurrent.CompletableFuture.uniHandle(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141) at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$buildWebSocket$17(StandardHttpClient.java:244) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source) at io.fabric8.kubernetes.client.utils.AsyncUtils.lambda$retryWithExponentialBackoff$3(AsyncUtils.java:90) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source) at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$1.onFailure(OkHttpWebSocketImpl.java:88) at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.kt:592) at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.kt:174) at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Jul 31, 2024 5:52:57 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate Terminating Kubernetes instance for agent jenkins-slave-gcp-gke-default-0f1x4 Jul 31, 2024 5:52:57 PM WARNING org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate Agent pod jenkins-slave-gcp-gke-default-0f1x4 was not deleted due to retention policy Always. Jul 31, 2024 5:52:57 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate Disconnected computer jenkins-slave-gcp-gke-default-0f1x4 Jul 31, 2024 5:53:07 PM INFO io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager logEndError Unknown Watch error received 11571 times without progress, will reconnect if possible java.net.SocketTimeoutException: connect timed out at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) at java.base/java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
On GKE I am getting this log
DEFAULT 2024-07-31T17:52:57.799248Z [resource.labels.clusterName: gcp-dev] [protoPayload.serviceName: k8s.io] [protoPayload.methodName: io.k8s.core.v1.pods.get] [protoPayload.resourceName: core/v1/namespaces/default/pods/jenkins-slave-gcp-gke-default-0f1x4] [protoPayload.authenticationInfo.principalEmail: xxx@xxx.iam.gserviceaccount.com] pods "jenkins-slave-gcp-gke-default-0f1x4" not found DEFAULT 2024-07-31T17:52:57.892006Z [resource.labels.clusterName: gcp-dev] [protoPayload.serviceName: k8s.io] [protoPayload.methodName: io.k8s.core.v1.pods.create] [protoPayload.resourceName: core/v1/namespaces/default/pods/jenkins-slave-gcp-gke-default-0f1x4] [protoPayload.authenticationInfo.principalEmail: xxx@xxx.iam.gserviceaccount.com] audit_log, method: "io.k8s.core.v1.pods.create", principal_email: "xxx@xxx.iam.gserviceaccount.com"
Seems like GET pod API is being made first and its saying "pod not found" and then it makes call to create that pod.
I am working with Alpesh on this and I have some additional information. The 400 error on the GET request doesn't always happen but happens most of the time. Every once in a while it succeeds and the worker pod comes up but it is something like 1 out of 100 times it succeeds. My suspicion is the GKE control plane is slower than some kubernetes runtimes and the API doesn't acknowledge the pod exists yet. Either this needs a retry loop tolerant of the 400 and retry a few times (preferred) or allow a delay. I believe the line in question is here: https://github.com/jenkinsci/kubernetes-plugin/blob/cc37496a1d10ad363c7ee5f3f0af66ba8be27896/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L222C22-L222C36