Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67664

KubernetesClientException: not ready after 5000 MILLISECONDS

    XMLWordPrintable

Details

    • Blue Ocean - Candidates

    Description

      We have 4 Jenkins servers in 4 AKS Clusters. 

      Af of sudden all Jenkins agent pods started giving Errors as below, Few pods are working and few are giving Errors.  This is Happening 1-2 times out of 4-5 attempts. 

      AKS Version : 1.20.13 

      Jenkins Version, which clusters is having different version. I can reproduce this Error in all versions. 

      AKS-1:

      • kubernetes:1.30.1
      • kubernetes-client-api:5.10.1-171.vaa0774fb8c20
      • kubernetes-credentials:0.8.0

      AKS-2:

      • kubernetes:1.31.3
      • kubernetes-client-api:5.11.2-182.v0f1cf4c5904e
      • kubernetes-credentials:0.9.0

      AKS-3:

      • kubernetes:1.30.1
      • kubernetes-client-api:5.10.1-171.vaa0774fb8c20
      • kubernetes-credentials:0.8.0

      AKS-4:

      • kubernetes:1.31.3
      • workflow-job:1145.v7f2433caa07f
      • workflow-aggregator:2.6
        21:00:49 io.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS*21:00:49* at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:176)21:00:49 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:322)21:00:49 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:84)21:00:49 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:427)21:00:49 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:344)21:00:49 at hudson.Launcher$ProcStarter.start(Launcher.java:507)21:00:49 at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176)21:00:49 at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:132)21:00:49 at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:324)21:00:49 at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319)21:00:49 at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)21:00:49 at jdk.internal.reflect.GeneratedMethodAccessor546.invoke(Unknown Source)21:00:49 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)21:00:49 at java.base/java.lang.reflect.Method.invoke(Method.java:566)21:00:49 at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)21:00:49 at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)21:00:49 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)21:00:49 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)21:00:49 at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)21:00:49 at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)21:00:49 at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163)21:00:49 at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)21:00:49 at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)21:00:49 at WorkflowScript.run(WorkflowScript:63)21:00:49 at __cps.transform__(Native Method)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)21:00:49 at jdk.internal.reflect.GeneratedMethodAccessor286.invoke(Unknown Source)21:00:49 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)21:00:49 at java.base/java.lang.reflect.Method.invoke(Method.java:566)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:107)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)21:00:49 at jdk.internal.reflect.GeneratedMethodAccessor286.invoke(Unknown Source)21:00:49 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)21:00:49 at java.base/java.lang.reflect.Method.invoke(Method.java:566)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)21:00:49 at jdk.internal.reflect.GeneratedMethodAccessor286.invoke(Unknown Source)21:00:49 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)21:00:49 at java.base/java.lang.reflect.Method.invoke(Method.java:566)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)21:00:49 at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)21:00:49 at com.cloudbees.groovy.cps.Next.step(Next.java:83)21:00:49 at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)21:00:49 at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)21:00:49 at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)21:00:49 at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)21:00:49 at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)21:00:49 at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)21:00:49 at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:402)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:314)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:278)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)21:00:49 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)21:00:49 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)21:00:49 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)21:00:49 at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)21:00:49 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)21:00:49 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)21:00:49 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)21:00:49 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)21:00:49 at java.base/java.lang.Thread.run(Thread.java:829)21:00:49 [Bitbucket] Notifying commit build result*21:00:50* [Bitbucket] Build result notified*21:00:50* Finished: FAILURE

      Attachments

        Issue Links

          Activity

            mateusvtt Mateus Tanure added a comment -
            I'm wondering if this issue is only happening in the Azure AKS service ? 

            No, we are running on AWS (without EKS) for two years and this issue started happening recently

            mateusvtt Mateus Tanure added a comment - I'm wondering if this issue is only happening in the Azure AKS service ?  No, we are running on AWS (without EKS) for two years and this issue started happening recently
            sbeaulie Samuel Beaulieu added a comment - - edited

            For the record, I have a similar issue when under load after updating jenkins core + k8s plugin however I run the jenkins agents pods on GKE{}

            The behavior / error message presents differently https://issues.jenkins.io/browse/JENKINS-68126 but the threads seem stuck in the watcher.

             

            Edit: what we see is twofold:

            1) it takes time for the plugin to spin new pods even though there are a lot of jobs in waiting queue. It like its lagging behind reality, and doesnt know the queue is huge.

            2) when it spins them up, they have time to start their jnlp connection, so we see the node connected in the UI, but the k8s plugin is stuck and never puts them as accepting tasks. I think it still watches them to see if all containers have started (they obviously did since they are now connected). If we use a script to force them to accept tasks it works but the plugin vs reality becomes out of sync, it thinks that those are still in provisioning and it skews the 'excess workload' number so it exarcerbates the issue #1 above.

            Jenkins.instance.getNodes().each{
              if(it.toComputer().isOnline() && !it.toComputer().isAcceptingTasks()) {
                println("computer is online, set accepting tasks ${it}")
                it.toComputer().setAcceptingTasks(true);
              }
            } 
            sbeaulie Samuel Beaulieu added a comment - - edited For the record, I have a similar issue when under load after updating jenkins core + k8s plugin however I run the jenkins agents pods on GKE { } The behavior / error message presents differently https://issues.jenkins.io/browse/JENKINS-68126 but the threads seem stuck in the watcher.   Edit: what we see is twofold: 1) it takes time for the plugin to spin new pods even though there are a lot of jobs in waiting queue. It like its lagging behind reality, and doesnt know the queue is huge. 2) when it spins them up, they have time to start their jnlp connection, so we see the node connected in the UI, but the k8s plugin is stuck and never puts them as accepting tasks. I think it still watches them to see if all containers have started (they obviously did since they are now connected). If we use a script to force them to accept tasks it works but the plugin vs reality becomes out of sync, it thinks that those are still in provisioning and it skews the 'excess workload' number so it exarcerbates the issue #1 above. Jenkins.instance.getNodes().each{   if (it.toComputer().isOnline() && !it.toComputer().isAcceptingTasks()) {     println( "computer is online, set accepting tasks ${it}" )     it.toComputer().setAcceptingTasks( true );   } }
            rohithg534 SparkC added a comment - - edited

            For me this 5000 MILLISECONDS issue was resolved after downgrading few  jenkins and plugins.  i tested it across 4-5 servers.

            Jenkins: version: "2.303.1"

            • kubernetes:1.30.1
            • kubernetes-client-api:5.4.1

            But i started getting another timeout Error :

            Error:java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds
            2	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:451) 

            Issue : JENKINS-68332

            rohithg534 SparkC added a comment - - edited For me this 5000 MILLISECONDS issue was resolved after downgrading few  jenkins and plugins.  i tested it across 4-5 servers. Jenkins: version: "2.303.1" kubernetes:1.30.1 kubernetes-client-api:5.4.1 But i started getting another timeout Error : Error:java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds 2 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:451) Issue : JENKINS-68332

            I built my own version of the kubernetes-api-client plugin that includes the fabric8 version 5.12.2 essentially this PR https://github.com/jenkinsci/kubernetes-client-api-plugin/pull/149

            -        <revision>5.12.1</revision>
            +        <revision>5.12.2</revision> 

            And I pushed it to a staging sever. I do not see the error "KubernetesClientException: not ready after X" anymore, but my issue is still present where a lot of nodes show as idle / 'suspended' (the node is connected, the pod is running, but the k8s plugin does not know about it and still logs them as 'provisioning') for example Nodeprovisioner logs != to k8s logs:

            In k8s logs:
            In provisioning: [REDACTED LIST OF 100+ nodes, jnlp-parent-8j46]
            In NodeProvisioner logs:
            [id=44]	INFO	hudson.slaves.NodeProvisioner#update: jnlp-parent-8j46j provisioning successfully completed. We have now 116 computer(s) 
            sbeaulie Samuel Beaulieu added a comment - I built my own version of the kubernetes-api-client plugin that includes the fabric8 version 5.12.2 essentially this PR https://github.com/jenkinsci/kubernetes-client-api-plugin/pull/149 -        <revision>5.12.1</revision> +        <revision>5.12.2</revision> And I pushed it to a staging sever. I do not see the error "KubernetesClientException: not ready after X" anymore, but my issue is still present where a lot of nodes show as idle / 'suspended' (the node is connected, the pod is running, but the k8s plugin does not know about it and still logs them as 'provisioning') for example Nodeprovisioner logs != to k8s logs: In k8s logs: In provisioning: [REDACTED LIST OF 100+ nodes, jnlp-parent-8j46] In NodeProvisioner logs: [id=44] INFO hudson.slaves.NodeProvisioner#update: jnlp-parent-8j46j provisioning successfully completed. We have now 116 computer(s)

            PR-149 was merged a few days ago and was released in 5.12.2-193.v26a_6078f65a_9 ; which is the version I see being used today...and I also see these "not ready after" things

            sbeaulie I wonder if its possible that you've made some other changes/etc which might have also contributed to fix this for you?

            kirk Zoltán Haindrich added a comment - PR-149 was merged a few days ago and was released in 5.12.2-193.v26a_6078f65a_9 ; which is the version I see being used today...and I also see these "not ready after" things sbeaulie I wonder if its possible that you've made some other changes/etc which might have also contributed to fix this for you?

            People

              kylecronin Kyle Cronin
              rohithg534 SparkC
              Votes:
              14 Vote for this issue
              Watchers:
              34 Start watching this issue

              Dates

                Created:
                Updated: