• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • kubernetes-plugin
    • None

      Plugin is able to create the slave on GKE, but after that watch is failing with below error

      Jul 31, 2024 5:52:57 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
      Created Pod: kubernetes-gke-gcp default/jenkins-slave-gcp-gke-default-0f1x4
      Jul 31, 2024 5:52:57 PM SEVERE io.fabric8.kubernetes.client.informers.impl.cache.Reflector onException
      listSyncAndWatch failed for v1/namespaces/default/pods, will stop
      Also:   java.lang.Throwable: waiting here 
        at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:174)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:933)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:921)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:97)
        at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:222)
        at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
        at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)	
      at java.base/java.lang.Thread.run(Unknown Source)
      io.fabric8.kubernetes.client.KubernetesClientException: Received 400 on websocket. Failure executing: GET at: https://gke-xxx.us-central1.gke.goog/api/v1/namespaces/default/pods?allowWatchBookmarks=true&fieldSelector=metadata.name%3Djenkins-slave-gcp-gke-default-0f1x4&resourceVersion=27395937&timeoutSeconds=600&watch=true. Message: Bad Request.	
      at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)	
      at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.lambda$start$3(WatchConnectionManager.java:86)Caused: java.util.concurrent.CompletionException
      at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source)
      at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source)
      at java.base/java.util.concurrent.CompletableFuture.uniHandle(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)	
      at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141)	
      at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$buildWebSocket$17(StandardHttpClient.java:244)	
      at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)	
      at io.fabric8.kubernetes.client.utils.AsyncUtils.lambda$retryWithExponentialBackoff$3(AsyncUtils.java:90)	
      at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)	
      at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$1.onFailure(OkHttpWebSocketImpl.java:88)	
      at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.kt:592)	
      at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.kt:174)	
      at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)	
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)	
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)	
      at java.base/java.lang.Thread.run(Unknown Source)
      
      Jul 31, 2024 5:52:57 PM WARNING org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
      Error in provisioning; agent=KubernetesSlave name: jenkins-slave-gcp-gke-default-0f1x4, template=dcc21005-5414-4e09-b32e-1d53376ffd45
      Also:   java.lang.Throwable: waiting here	
      at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:174)	
      at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:933)	
      at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:921)	
      at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:97)	
      at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:222)	
      at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)	
      at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)	
      at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)	
      at java.base/java.util.concurrent.FutureTask.run(Unknown Source)	
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)	
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)	
      at java.base/java.lang.Thread.run(Unknown Source)
      
      io.fabric8.kubernetes.client.KubernetesClientException: Received 400 on websocket. Failure executing: GET at: https://gke-xxx.us-central1.gke.goog/api/v1/namespaces/default/pods?allowWatchBookmarks=true&fieldSelector=metadata.name%3Djenkins-slave-gcp-gke-default-0f1x4&resourceVersion=27395937&timeoutSeconds=600&watch=true. Message: Bad Request.	
      at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)	
      at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.lambda$start$3(WatchConnectionManager.java:86)
      at java.base/java.util.concurrent.CompletableFuture.uniHandle(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)	
      at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141)	
      at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$buildWebSocket$17(StandardHttpClient.java:244)	
      at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)	
      at io.fabric8.kubernetes.client.utils.AsyncUtils.lambda$retryWithExponentialBackoff$3(AsyncUtils.java:90)	
      at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)	
      at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)	
      at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$1.onFailure(OkHttpWebSocketImpl.java:88)	
      at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.kt:592)	
      at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.kt:174)	
      at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)	
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)	
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)	
      at java.base/java.lang.Thread.run(Unknown Source)
      
      Jul 31, 2024 5:52:57 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate
      Terminating Kubernetes instance for agent jenkins-slave-gcp-gke-default-0f1x4
      Jul 31, 2024 5:52:57 PM WARNING org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate
      Agent pod jenkins-slave-gcp-gke-default-0f1x4 was not deleted due to retention policy Always.
      Jul 31, 2024 5:52:57 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate
      Disconnected computer jenkins-slave-gcp-gke-default-0f1x4
      Jul 31, 2024 5:53:07 PM INFO io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager logEndError
      Unknown Watch error received 11571 times without progress, will reconnect if possible
      java.net.SocketTimeoutException: connect timed out	
      at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)	
      at java.base/java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)

       

      On GKE I am getting this log 

      DEFAULT 2024-07-31T17:52:57.799248Z [resource.labels.clusterName: gcp-dev] [protoPayload.serviceName: k8s.io]
      [protoPayload.methodName: io.k8s.core.v1.pods.get] 
      [protoPayload.resourceName: core/v1/namespaces/default/pods/jenkins-slave-gcp-gke-default-0f1x4] 
      [protoPayload.authenticationInfo.principalEmail: xxx@xxx.iam.gserviceaccount.com] 
      pods "jenkins-slave-gcp-gke-default-0f1x4" not found
      
      DEFAULT 2024-07-31T17:52:57.892006Z [resource.labels.clusterName: gcp-dev] [protoPayload.serviceName: k8s.io]
      [protoPayload.methodName: io.k8s.core.v1.pods.create] 
      [protoPayload.resourceName: core/v1/namespaces/default/pods/jenkins-slave-gcp-gke-default-0f1x4]
      [protoPayload.authenticationInfo.principalEmail: xxx@xxx.iam.gserviceaccount.com] 
      audit_log, method: "io.k8s.core.v1.pods.create", principal_email: "xxx@xxx.iam.gserviceaccount.com"

      Seems like GET pod API is being made first and its saying "pod not found" and then it makes call to create that pod.

          [JENKINS-73538] GKE integration issue

          I am working with Alpesh on this and I have some additional information. The 400 error on the GET request doesn't always happen but happens most of the time. Every once in a while it succeeds and the worker pod comes up but it is something like 1 out of 100 times it succeeds. My suspicion is the GKE control plane is slower than some kubernetes runtimes and the API doesn't acknowledge the pod exists yet. Either this needs a retry loop tolerant of the 400 and retry a few times (preferred) or allow a delay. I believe the line in question is here: https://github.com/jenkinsci/kubernetes-plugin/blob/cc37496a1d10ad363c7ee5f3f0af66ba8be27896/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L222C22-L222C36

          John Smilanick added a comment - I am working with Alpesh on this and I have some additional information. The 400 error on the GET request doesn't always happen but happens most of the time. Every once in a while it succeeds and the worker pod comes up but it is something like 1 out of 100 times it succeeds. My suspicion is the GKE control plane is slower than some kubernetes runtimes and the API doesn't acknowledge the pod exists yet. Either this needs a retry loop tolerant of the 400 and retry a few times (preferred) or allow a delay. I believe the line in question is here: https://github.com/jenkinsci/kubernetes-plugin/blob/cc37496a1d10ad363c7ee5f3f0af66ba8be27896/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L222C22-L222C36

          One more followup. I tested the same GET request on the pod myself (required using a pod retention policy of always otherwise it is deleted immediately), and it works just fine so I am fairly certain this is a race condition specific to GKE control plane being slower to update than expected

          John Smilanick added a comment - One more followup. I tested the same GET request on the pod myself (required using a pod retention policy of always otherwise it is deleted immediately), and it works just fine so I am fairly certain this is a race condition specific to GKE control plane being slower to update than expected

            Unassigned Unassigned
            akhambhala Alpesh
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: