Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67664

KubernetesClientException: not ready after 5000 MILLISECONDS

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Blocker Blocker
    • kubernetes-plugin
    • None
    • Prod
    • Blue Ocean - Candidates
    • 3690.va_9ddf6635481

      We have 4 Jenkins servers in 4 AKS Clusters. 

      Af of sudden all Jenkins agent pods started giving Errors as below, Few pods are working and few are giving Errors.  This is Happening 1-2 times out of 4-5 attempts. 

      AKS Version : 1.20.13 

      Jenkins Version, which clusters is having different version. I can reproduce this Error in all versions. 

      AKS-1:

      • kubernetes:1.30.1
      • kubernetes-client-api:5.10.1-171.vaa0774fb8c20
      • kubernetes-credentials:0.8.0

      AKS-2:

      • kubernetes:1.31.3
      • kubernetes-client-api:5.11.2-182.v0f1cf4c5904e
      • kubernetes-credentials:0.9.0

      AKS-3:

      • kubernetes:1.30.1
      • kubernetes-client-api:5.10.1-171.vaa0774fb8c20
      • kubernetes-credentials:0.8.0

      AKS-4:

      • kubernetes:1.31.3
      • workflow-job:1145.v7f2433caa07f
      • workflow-aggregator:2.6
        21:00:49 io.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS*21:00:49* at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:176)21:00:49 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:322)21:00:49 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:84)21:00:49 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:427)21:00:49 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:344)21:00:49 at hudson.Launcher$ProcStarter.start(Launcher.java:507)21:00:49 at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176)21:00:49 at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:132)21:00:49 at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:324)21:00:49 at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319)21:00:49 at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)21:00:49 at jdk.internal.reflect.GeneratedMethodAccessor546.invoke(Unknown Source)21:00:49 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)21:00:49 at java.base/java.lang.reflect.Method.invoke(Method.java:566)21:00:49 at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)21:00:49 at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)21:00:49 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)21:00:49 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)21:00:49 at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)21:00:49 at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)21:00:49 at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163)21:00:49 at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)21:00:49 at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)21:00:49 at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)21:00:49 at WorkflowScript.run(WorkflowScript:63)21:00:49 at __cps.transform__(Native Method)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)21:00:49 at jdk.internal.reflect.GeneratedMethodAccessor286.invoke(Unknown Source)21:00:49 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)21:00:49 at java.base/java.lang.reflect.Method.invoke(Method.java:566)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:107)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)21:00:49 at jdk.internal.reflect.GeneratedMethodAccessor286.invoke(Unknown Source)21:00:49 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)21:00:49 at java.base/java.lang.reflect.Method.invoke(Method.java:566)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)21:00:49 at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)21:00:49 at jdk.internal.reflect.GeneratedMethodAccessor286.invoke(Unknown Source)21:00:49 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)21:00:49 at java.base/java.lang.reflect.Method.invoke(Method.java:566)21:00:49 at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)21:00:49 at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)21:00:49 at com.cloudbees.groovy.cps.Next.step(Next.java:83)21:00:49 at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)21:00:49 at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)21:00:49 at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)21:00:49 at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)21:00:49 at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)21:00:49 at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)21:00:49 at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:402)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:314)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:278)21:00:49 at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)21:00:49 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)21:00:49 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)21:00:49 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)21:00:49 at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)21:00:49 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)21:00:49 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)21:00:49 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)21:00:49 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)21:00:49 at java.base/java.lang.Thread.run(Thread.java:829)21:00:49 [Bitbucket] Notifying commit build result*21:00:50* [Bitbucket] Build result notified*21:00:50* Finished: FAILURE

          [JENKINS-67664] KubernetesClientException: not ready after 5000 MILLISECONDS

          Alex B added a comment -

          We investigated this and found that Increasing the "connection timeout" in the Jenkins kubernetes plugin has resolved the issue for us.

          ie. Manage jenkins->manage nodes and clouds->configure clouds->k8s cloud details

          Alex B added a comment - We investigated this and found that Increasing the "connection timeout" in the Jenkins kubernetes plugin has resolved the issue for us. ie. Manage jenkins->manage nodes and clouds->configure clouds->k8s cloud details

          We're also having the same issue, we use AWS EKS.

          The issue happens way more frequently in parallel blocks reusing the same container.

          Miguel Alexandre added a comment - We're also having the same issue, we use AWS EKS. The issue happens way more frequently in parallel blocks reusing the same container.

          SparkC added a comment -

          Thanks Alex and Miguel, 
          Currently I have this value configured in my Jenkins server.  How much i need to increase? if i'm not wrong this is the first troubleshooting step considered doing. We already increased and it did not work as expected. So reverted that changed back to original.  Is this connection timeout value changed in recently release?

          Connection Timeout: 5
          Read Timeout: 15
          Concurrency Limit: 10
          Max connections to Kubernetes API: 32
          Seconds to wait for pod to be running: 600

          SparkC added a comment - Thanks Alex and Miguel,  Currently I have this value configured in my Jenkins server.  How much i need to increase? if i'm not wrong this is the first troubleshooting step considered doing. We already increased and it did not work as expected. So reverted that changed back to original.  Is this connection timeout value changed in recently release? Connection Timeout: 5 Read Timeout: 15 Concurrency Limit: 10 Max connections to Kubernetes API: 32 Seconds to wait for pod to be running: 600

          SparkC added a comment -

          Update Connection Timeout: 5 into Connection Timeout: 15 and getting same Error again. If value changes why i'm still seeing 5000 MILLISECONDS ?

          16:42:34 io.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS*16:42:34* at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:176)16:42:34 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:322)16:42:34 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:84)16:42:34 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:413)16:42:34 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:330)16:42:34 at hudson.Launcher$ProcStarter.start(Launcher.java:507)16:42:34 at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176)16:42:34 at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:132)16:42:34 at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:324)16:42:34 at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319)16:42:34 at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193)16:42:34 at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)16:42:34 at jdk.internal.reflect.GeneratedMethodAccessor6345.invoke(Unknown Source)16:42:34 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)16:42:34 at java.base/java.lang.reflect.Method.invoke(Method.java:566)16:42:34 at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)16:42:34 at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)16:42:34 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)16:42:34 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022

          SparkC added a comment - Update Connection Timeout: 5 into Connection Timeout: 15 and getting same Error again. If value changes why i'm still seeing 5000 MILLISECONDS ? 16:42:34 io.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS*16:42:34* at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:176) 16:42:34 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:322) 16:42:34 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:84) 16:42:34 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:413) 16:42:34 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:330) 16:42:34 at hudson.Launcher$ProcStarter.start(Launcher.java:507) 16:42:34 at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176) 16:42:34 at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:132) 16:42:34 at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:324) 16:42:34 at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) 16:42:34 at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) 16:42:34 at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) 16:42:34 at jdk.internal.reflect.GeneratedMethodAccessor6345.invoke(Unknown Source) 16:42:34 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 16:42:34 at java.base/java.lang.reflect.Method.invoke(Method.java:566) 16:42:34 at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) 16:42:34 at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) 16:42:34 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) 16:42:34 at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022

          Adam Placzek added a comment -

          rohithg534 you can increase this timeout by adding to JAVA_OPTS something like

          -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=60
          -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout=60000
          -Dkubernetes.websocket.timeout=60000
          More info here:
          https://support.cloudbees.com/hc/en-us/articles/360054642231-Considerations-for-Kubernetes-Clients-Connections-when-using-Kubernetes-Plugin

          Unfortunately in my case, it increased the timeout, but the error still happens but now after  60000 MILLISECONDS

          Adam Placzek added a comment - rohithg534  you can increase this timeout by adding to JAVA_OPTS something like -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=60 -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout=60000 -Dkubernetes.websocket.timeout=60000 More info here: https://support.cloudbees.com/hc/en-us/articles/360054642231-Considerations-for-Kubernetes-Clients-Connections-when-using-Kubernetes-Plugin Unfortunately in my case, it increased the timeout, but the error still happens but now after  60000 MILLISECONDS

          SparkC added a comment -

          Adam Thanks for providing info.  I tried few more thing let me share it with you. created brand new AKS cluster and installed jenkins and Jobs are still failing with same Error message. This time Error message is short and simple. 

          [Pipeline] // stage[Pipeline] echoio.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS[Pipeline] cleanWs[WS-CLEANUP] Deleting project workspace...
          [WS-CLEANUP] Deferred wipeout is used...
          [WS-CLEANUP] done
          Where do i need to add this JAVA_OPTS Variable with those -D args??  is it node or master?
          Which file it is?

          SparkC added a comment - Adam Thanks for providing info.  I tried few more thing let me share it with you. created brand new AKS cluster and installed jenkins and Jobs are still failing with same Error message. This time Error message is short and simple.  [Pipeline] // stage [Pipeline] echoio.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS [Pipeline] cleanWs [WS-CLEANUP] Deleting project workspace... [WS-CLEANUP] Deferred wipeout is used... [WS-CLEANUP] done Where do i need to add this JAVA_OPTS Variable with those -D args??  is it node or master? Which file it is?

          Adam Placzek added a comment -

          Hi, add it as env variable in master Jenkins deployment

          Adam Placzek added a comment - Hi, add it as env variable in master Jenkins deployment

          SparkC added a comment -

          Thanks Adam, i did installed Jenkins like this adding this values inside values.yaml file. 

          # Custom values for jenkins.
          controller:
           tag: "2.319.2"
           imagePullPolicy: "Always"
           javaOpts: >-
           -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=90
           -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout=80000
           -Dkubernetes.websocket.timeout=70000
           -Xms2G
           -Xmx2G

          But still same Error

          io.fabric8.kubernetes.client.KubernetesClientException: not ready after 70000 MILLISECONDS
          

          Anyone here have any idea whats going on? this is blocking complete CICD in my Org

          SparkC added a comment - Thanks Adam, i did installed Jenkins like this adding this values inside values.yaml file.  # Custom values for jenkins. controller: tag: "2.319.2" imagePullPolicy: "Always" javaOpts: >- -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=90 -Dorg.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout=80000 -Dkubernetes.websocket.timeout=70000 -Xms2G -Xmx2G But still same Error io.fabric8.kubernetes.client.KubernetesClientException: not ready after 70000 MILLISECONDS Anyone here have any idea whats going on? this is blocking complete CICD in my Org

          SparkC added a comment -

          Hi Guys, can anyone help me out what this is not working? 
          I tried all possible inputs from you all. Nothing is been working here. Its kind of blocker for us. I appreciate your support for helping me on this?

          SparkC added a comment - Hi Guys, can anyone help me out what this is not working?  I tried all possible inputs from you all. Nothing is been working here. Its kind of blocker for us. I appreciate your support for helping me on this?

          Mirek Kotek added a comment - - edited

          Hi,
          Few days ago we made full update of our two jenkins masters.
          First one is on AKS, and we do not have any kind of issue:

          Unfortunately the same setup on PREM, failed miserable, and we went back to very old versions of plugins:

          kubernetes.websocket.timeout and org.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout did not help us.
          We still see sometimes:

          java.net.SocketTimeoutException: sent ping but didn't receive pong within 30000ms (after 0 successful ping/pongs)
          

          but at least, it does not break our builds anymore.

           

          Mirek Kotek added a comment - - edited Hi, Few days ago we made full update of our two jenkins masters. First one is on AKS, and we do not have any kind of issue: k8s: 1.21.7 Jenkins: 2.319.2 Kubernetes Client API Plugin: 5.11.2-182.v0f1cf4c5904e Kubernetes plugin: 1.31.3 AKS URL: https://kubernetes.default.svc.cluster.local:443 Unfortunately the same setup on PREM, failed miserable, and we went back to very old versions of plugins: k8s: 1.21.7 Jenkins: 2.319.2 Kubernetes Client API Plugin: 4.13.3-1 Kubernetes plugin: 1.29.7 AKS URL: https://xxxyyy.azmk8s.io:443 kubernetes.websocket.timeout and org.csanchez.jenkins.plugins.kubernetes.pipeline.websocketConnectionTimeout did not help us. We still see sometimes: java.net.SocketTimeoutException: sent ping but didn't receive pong within 30000ms (after 0 successful ping/pongs) but at least, it does not break our builds anymore.  

          SparkC added a comment - - edited

          So after going to below plugin did jenkins give any Errors saying need to use latest version k8s plugin like that?

          Looks like you downgraded only k8s plugin and k8s api plugin remaining all latest ?

          • Kubernetes Client API Plugin: 4.13.3-1

          Did this resolve 5000 Milliseconds Error? 

          SparkC added a comment - - edited So after going to below plugin did jenkins give any Errors saying need to use latest version k8s plugin like that? Looks like you downgraded only k8s plugin and k8s api plugin remaining all latest ? Kubernetes Client API Plugin: 4.13.3-1 Did this resolve 5000 Milliseconds Error? 

          Kai B added a comment -

          hi, I confirm the issue, like described above also happens usually in parallel steps (using the same container)

          Kai B added a comment - hi, I confirm the issue, like described above also happens usually in parallel steps (using the same container)

          SparkC added a comment -

          kai, 
          by any chance did you resolve that Error? 

          SparkC added a comment - kai,  by any chance did you resolve that Error? 

          Mirek Kotek added a comment -

          rohithg534 on PREM every plugin has latest version, only Kubernetes Client API Plugin and Kubernetes plugin has been downgraded
          5000 Milliseconds Error does not occur any more.

          Mirek Kotek added a comment - rohithg534 on PREM every plugin has latest version, only Kubernetes Client API Plugin and Kubernetes plugin has been downgraded 5000 Milliseconds Error does not occur any more.

          Peter Simms added a comment -

          Can confirm we are seeing similar issues

          Is the temporary workaround to downgrade plugins to the below?

          • Kubernetes Client API Plugin: 4.13.3-1
          • Kubernetes plugin: 1.29.7

          Peter Simms added a comment - Can confirm we are seeing similar issues Is the temporary workaround to downgrade plugins to the below? Kubernetes Client API Plugin: 4.13.3-1 Kubernetes plugin: 1.29.7

          HerveLeMeur added a comment -

          HerveLeMeur added a comment - Related (duplicate?): https://issues.jenkins.io/browse/JENKINS-67474

          Eddie Simeon added a comment -

          The 5s Error went away after we downgraded to:

          Jenkins: 2.303.1

          Kubernetes: 1.30.1

          Kubernetes Client API: 5.4.1

           

          It has been 24 hours of normal load and we have not experienced the 5s error since applying the above downgrades.

          Eddie Simeon added a comment - The 5s Error went away after we downgraded to: Jenkins: 2.303.1 Kubernetes: 1.30.1 Kubernetes Client API: 5.4.1   It has been 24 hours of normal load and we have not experienced the 5s error since applying the above downgrades.

          SparkC added a comment -

          Hi Eddie, 
          is it happening again after making above changes?

          SparkC added a comment - Hi Eddie,  is it happening again after making above changes?

          Eddie Simeon added a comment -

          rohithg534 - It has not happened again after making the above changes.

          Eddie Simeon added a comment - rohithg534  - It has not happened again after making the above changes.

          Jesse Glick added a comment -

          See JENKINS-67167. Avoid using the container step as it is not reliable under load.

          Jesse Glick added a comment - See JENKINS-67167 . Avoid using the container step as it is not reliable under load.

          dev.samples added a comment -

          jglick This seems like quite a significant change/recommendation when it comes to writing jenkins pipelines running in kubernetes/using the kubernetes-plugin?

          We have a significant amount of scripted pipelines using the container part and is also affected by this sporadic error. Before refactoring all our pipelines to basically just use the jnlp container/image only it could be nice with some public announcement that this will be the new standard for writing pipelines + I guess the container part will go into deprecated/removal state?

          Or maybe there are some work being done on making it more reliable?

          dev.samples added a comment - jglick This seems like quite a significant change/recommendation when it comes to writing jenkins pipelines running in kubernetes/using the kubernetes-plugin? We have a significant amount of scripted pipelines using the container part and is also affected by this sporadic error. Before refactoring all our pipelines to basically just use the jnlp container/image only it could be nice with some public announcement that this will be the new standard for writing pipelines + I guess the container part will go into deprecated/removal state? Or maybe there are some work being done on making it more reliable?

          Mateus Tanure added a comment -

          We also have a lot of pipes using `container` and we have several different workloads where each one has its own docker image. We can't just start using the same JNLP for all of them, the image would be huge.

          What we have done to mitigate was to downgrade as eddiesimeon suggested: Kubernetes: 1.30.1, Kubernetes Client API: 5.4.1 was enough, it wasn't necessary to downgrade the server itself.

          Mateus Tanure added a comment - We also have a lot of pipes using `container` and we have several different workloads where each one has its own docker image. We can't just start using the same JNLP for all of them, the image would be huge. What we have done to mitigate was to downgrade as  eddiesimeon  suggested: Kubernetes: 1.30.1, Kubernetes Client API: 5.4.1 was enough, it wasn't necessary to downgrade the server itself.

          Allan BURDAJEWICZ added a comment - - edited

          It looks like waitUntilReady went through some refactoring, now using a completable future and also does not throw an InterruptedException anymore since k8s client 5.5.0:

          #3271 Waitable.waitUntilReady and Waitable.waitUntilCondition with throw a KubernetesClientTimeoutException instead of an IllegalArgumentException on timeout. The methods will also no longer throw an interrupted exception. Waitable.withWaitRetryBackoff and the associated constants are now deprecated.
          Util Changes:
          #3197 Utils.waitUntilReady now accepts a Future, rather than a BlockingQueue

          See https://github.com/fabric8io/kubernetes-client/pull/3274 and https://github.com/fabric8io/kubernetes-client/pull/3197.

          That would affect the UX in the [ContainerExecDecorator](https://github.com/jenkinsci/kubernetes-plugin/blob/kubernetes-1.30.1/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L419-L444) that used to show a different message when this issue happened.

          For the logic itself, I am not too sure. The websocket timeout has always been there and applied to that code path as far as I can tell.

          It seems to me that maybe this is same problem related to the "Max connections to Kubernetes API" ? - as outlined in Considerations for Kubernetes Clients Connections when using Kubernetes Plugin, but reported differently by the client ?

          And looks like the catching of Interrupted in the decorator are now irrelevant ?

          cc vlatombe

          By the way, a later change in k8s client 5.11.0 fixed some kind of open WS connection leak https://github.com/fabric8io/kubernetes-client/issues/3598. Maybe that could also contribute to the problem. The upcoming release of k8s plugin should fix: https://github.com/jenkinsci/kubernetes-plugin/commit/9ec812ed4ba5018d26f3a3744e10a527060a1748.

          • What does a thread dump look like when the issue happens ?
          • Any body impacted for k8s version earlier than 1.31.0 ?

          Allan BURDAJEWICZ added a comment - - edited It looks like waitUntilReady went through some refactoring, now using a completable future and also does not throw an InterruptedException anymore since k8s client 5.5.0 : #3271 Waitable.waitUntilReady and Waitable.waitUntilCondition with throw a KubernetesClientTimeoutException instead of an IllegalArgumentException on timeout. The methods will also no longer throw an interrupted exception. Waitable.withWaitRetryBackoff and the associated constants are now deprecated. Util Changes: #3197 Utils.waitUntilReady now accepts a Future, rather than a BlockingQueue See https://github.com/fabric8io/kubernetes-client/pull/3274 and https://github.com/fabric8io/kubernetes-client/pull/3197 . That would affect the UX in the [ContainerExecDecorator] ( https://github.com/jenkinsci/kubernetes-plugin/blob/kubernetes-1.30.1/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L419-L444 ) that used to show a different message when this issue happened. For the logic itself, I am not too sure. The websocket timeout has always been there and applied to that code path as far as I can tell. It seems to me that maybe this is same problem related to the "Max connections to Kubernetes API" ? - as outlined in Considerations for Kubernetes Clients Connections when using Kubernetes Plugin , but reported differently by the client ? And looks like the catching of Interrupted in the decorator are now irrelevant ? cc vlatombe By the way, a later change in k8s client 5.11.0 fixed some kind of open WS connection leak https://github.com/fabric8io/kubernetes-client/issues/3598 . Maybe that could also contribute to the problem. The upcoming release of k8s plugin should fix: https://github.com/jenkinsci/kubernetes-plugin/commit/9ec812ed4ba5018d26f3a3744e10a527060a1748 . What does a thread dump look like when the issue happens ? Any body impacted for k8s version earlier than 1.31.0 ?

          allan_burdajewicz your analysis looks fine to me. I've filed https://github.com/jenkinsci/kubernetes-plugin/pull/1159 to adapt to kubernetes-client new behaviour. This won't change the behaviour but at least give some error message that can be understood a bit more easily.

          Vincent Latombe added a comment - allan_burdajewicz  your analysis looks fine to me. I've filed https://github.com/jenkinsci/kubernetes-plugin/pull/1159  to adapt to kubernetes-client new behaviour. This won't change the behaviour but at least give some error message that can be understood a bit more easily.

          dev.samples added a comment - - edited

          As suggested above I tried to install:

          Kubernetes 1.30.1
          Kubernetes Client API:  5.4.1

          Running Jenkins 2.332.1

          But now I just get this instead from time to time:

          java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds
          	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:451)
          	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:338)
          

          So seems the error is still present in the above versions but with another stacktrace/message?

          dev.samples added a comment - - edited As suggested above I tried to install: Kubernetes 1.30.1 Kubernetes Client API:  5.4.1 Running Jenkins 2.332.1 But now I just get this instead from time to time: java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:451) at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:338) So seems the error is still present in the above versions but with another stacktrace/message?

          dev_samples Then as suspected, this is most likely a problem related to the number of max connection to k8s API. Have you tried increasing the "Max connections to Kubernetes API" in the Kubernetes Cloud ? Maybe double it and see the impact. Maybe it'll happen less frequently.
          Since k8s plugin 1.31.0, a more generic type of exception is thrown by such issues and reports the message io.fabric8.kubernetes.client.KubernetesClientException: not ready after <NB_MS> MILLISECONDS. But the problem is the same and the solution too.
          See also Considerations for Kubernetes Clients Connections when using Kubernetes Plugin that provides some explanation on what consumed API calls (steps in container blocks, provisioning, ...).

          Allan BURDAJEWICZ added a comment - dev_samples Then as suspected, this is most likely a problem related to the number of max connection to k8s API. Have you tried increasing the "Max connections to Kubernetes API" in the Kubernetes Cloud ? Maybe double it and see the impact. Maybe it'll happen less frequently. Since k8s plugin 1.31.0, a more generic type of exception is thrown by such issues and reports the message io.fabric8.kubernetes.client.KubernetesClientException: not ready after <NB_MS> MILLISECONDS . But the problem is the same and the solution too. See also Considerations for Kubernetes Clients Connections when using Kubernetes Plugin that provides some explanation on what consumed API calls (steps in container blocks, provisioning, ...).

          dev.samples added a comment - - edited

          allan_burdajewicz Right now its set to 32 (default I guess since I have not changed it). I have one Jenkins running with one job in a 5 node AKS cluster that runs a basic gradle build so I  don't see how that limit can be reached.

          Also as I mentioned the error is:

           

          java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds
           at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:451)
           at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:338)
           
          

          So its a timeout error not a "Max number of connection reached.." error.

          I don't see that limit (30 seconds) in the jenkins configuration anywhere. I do however see this in the AKS documentation:

          https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#api-management-limits

          Maximum total request duration^8^ 30 seconds

          So maybe the 5000 milliseconds error/timeout is indeed solved by downgrading the plugins above and you just hit that error "before" the 30 seconds timeout (which appears to be cloud provider specific) when NOT downgrading the plugins.

          dev.samples added a comment - - edited allan_burdajewicz  Right now its set to 32 (default I guess since I have not changed it). I have one Jenkins running with one job in a 5 node AKS cluster that runs a basic gradle build so I  don't see how that limit can be reached. Also as I mentioned the error is:   java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:451) at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:338)   So its a timeout error not a "Max number of connection reached.." error. I don't see that limit (30 seconds) in the jenkins configuration anywhere. I do however see this in the AKS documentation: https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#api-management-limits Maximum total request duration^8^ 30 seconds So maybe the 5000 milliseconds error/timeout is indeed solved by downgrading the plugins above and you just hit that error "before" the 30 seconds timeout (which appears to be cloud provider specific) when NOT downgrading the plugins.

          Jesse Glick added a comment -

          Or maybe there are some work being done on making it more reliable?

          Not currently. I believe it is possible to rewrite this system to bypass the API server and communicate directly between the controller, the agent container, and the user-defined container, which ought to be at least as reliable as sh steps run on the agent container (i.e. just needs the Remoting channel to the agent pod to be working).

          Jesse Glick added a comment - Or maybe there are some work being done on making it more reliable? Not currently. I believe it is possible to rewrite this system to bypass the API server and communicate directly between the controller, the agent container, and the user-defined container, which ought to be at least as reliable as sh steps run on the agent container (i.e. just needs the Remoting channel to the agent pod to be working).

          In my Jenkins installation I even increased:
          org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout
          to 60 seconds. And the error still occurs.

          Artem Chernenko added a comment - In my Jenkins installation I even increased: org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout to 60 seconds. And the error still occurs.

          dev.samples added a comment -

          achernenkokaaiot Hm maybe it has no effect changing that if its the api server (provider specific) timeout that's the reason. E.g. in AKS it seems its 30 secs:

          https://issues.jenkins.io/browse/JENKINS-67664?focusedCommentId=423929&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-423929

          so changing limits from the jenkins side would probably not solve anything. Not sure about this yet though.

          dev.samples added a comment - achernenkokaaiot  Hm maybe it has no effect changing that if its the api server (provider specific) timeout that's the reason. E.g. in AKS it seems its 30 secs: https://issues.jenkins.io/browse/JENKINS-67664?focusedCommentId=423929&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-423929 so changing limits from the jenkins side would probably not solve anything. Not sure about this yet though.

          Allan BURDAJEWICZ added a comment - - edited

          My understanding is that the okhttp client connection pool size (controlled by Max Connection to k8s API) still has an impact because of the asynchronous logic of the webSocket created by the okhttp client.

          The webSocket object is created and returned in k8s plugin at that line: https://github.com/jenkinsci/kubernetes-plugin/blob/3578.vb_9a_92ea_9845a_/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L428. All the way down, this eventually creates a future at https://github.com/fabric8io/kubernetes-client/blob/43bd021ef9acf9bad12be777e49c3716873f77d8/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/okhttp/OkHttpWebSocketImpl.java#L51-L106.

          Per https://square.github.io/okhttp/4.x/okhttp/okhttp3/-web-socket/-factory/new-web-socket/:

          > Creating a web socket initiates an asynchronous process to connect the socket.

          K8s plugin then wait for the connection to be established with the high level ContainerExecDecorator.websocketConnectionTimeout timeout at https://github.com/jenkinsci/kubernetes-plugin/blob/3578.vb_9a_92ea_9845a_/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L445. Here we are waiting for the websocket initiated to have eventually connected, in which case it the started latch is properly set https://github.com/jenkinsci/kubernetes-plugin/blob/3578.vb_9a_92ea_9845a_/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L395.

          Failure to get any feedback from this listener may be caused by the fact that the okhttp connection pool is overloaded (at the limit configured) and cannot process newer requests for the time of the websocket.

          > I don't see that limit (30 seconds) in the jenkins configuration anywhere. I do however see this in the AKS documentation:

          It is a default value set at https://github.com/jenkinsci/kubernetes-plugin/blob/3578.vb_9a_92ea_9845a_/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L85-L86
          Not clear to me if the Maximum total request duration limit in that Azure document applies to AKS. Seems that this only applies to API Management that is a specific Azure service

          Allan BURDAJEWICZ added a comment - - edited My understanding is that the okhttp client connection pool size (controlled by Max Connection to k8s API) still has an impact because of the asynchronous logic of the webSocket created by the okhttp client. The webSocket object is created and returned in k8s plugin at that line: https://github.com/jenkinsci/kubernetes-plugin/blob/3578.vb_9a_92ea_9845a_/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L428 . All the way down, this eventually creates a future at https://github.com/fabric8io/kubernetes-client/blob/43bd021ef9acf9bad12be777e49c3716873f77d8/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/okhttp/OkHttpWebSocketImpl.java#L51-L106 . Per https://square.github.io/okhttp/4.x/okhttp/okhttp3/-web-socket/-factory/new-web-socket/: > Creating a web socket initiates an asynchronous process to connect the socket. K8s plugin then wait for the connection to be established with the high level ContainerExecDecorator.websocketConnectionTimeout timeout at https://github.com/jenkinsci/kubernetes-plugin/blob/3578.vb_9a_92ea_9845a_/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L445 . Here we are waiting for the websocket initiated to have eventually connected, in which case it the started latch is properly set https://github.com/jenkinsci/kubernetes-plugin/blob/3578.vb_9a_92ea_9845a_/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L395 . Failure to get any feedback from this listener may be caused by the fact that the okhttp connection pool is overloaded (at the limit configured) and cannot process newer requests for the time of the websocket. > I don't see that limit (30 seconds) in the jenkins configuration anywhere. I do however see this in the AKS documentation: It is a default value set at https://github.com/jenkinsci/kubernetes-plugin/blob/3578.vb_9a_92ea_9845a_/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java#L85-L86 Not clear to me if the Maximum total request duration limit in that Azure document applies to AKS. Seems that this only applies to API Management that is a specific Azure service

          Adam Placzek added a comment -

          Hi,

          I'm wondering if this issue is only happening in the Azure AKS service ? 
          In our case, we use only AKS and we face this issue.

          From their documentation, I'm reading that Free AKS Tier has a limit of 50 inflight requests
          https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#azure-kubernetes-service-limits

          Adam Placzek added a comment - Hi, I'm wondering if this issue is only happening in the Azure AKS service ?  In our case, we use only AKS and we face this issue. From their documentation, I'm reading that Free AKS Tier has a limit of 50 inflight requests https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#azure-kubernetes-service-limits

          Mateus Tanure added a comment -
          I'm wondering if this issue is only happening in the Azure AKS service ? 

          No, we are running on AWS (without EKS) for two years and this issue started happening recently

          Mateus Tanure added a comment - I'm wondering if this issue is only happening in the Azure AKS service ?  No, we are running on AWS (without EKS) for two years and this issue started happening recently

          Samuel Beaulieu added a comment - - edited

          For the record, I have a similar issue when under load after updating jenkins core + k8s plugin however I run the jenkins agents pods on GKE{}

          The behavior / error message presents differently https://issues.jenkins.io/browse/JENKINS-68126 but the threads seem stuck in the watcher.

           

          Edit: what we see is twofold:

          1) it takes time for the plugin to spin new pods even though there are a lot of jobs in waiting queue. It like its lagging behind reality, and doesnt know the queue is huge.

          2) when it spins them up, they have time to start their jnlp connection, so we see the node connected in the UI, but the k8s plugin is stuck and never puts them as accepting tasks. I think it still watches them to see if all containers have started (they obviously did since they are now connected). If we use a script to force them to accept tasks it works but the plugin vs reality becomes out of sync, it thinks that those are still in provisioning and it skews the 'excess workload' number so it exarcerbates the issue #1 above.

          Jenkins.instance.getNodes().each{
            if(it.toComputer().isOnline() && !it.toComputer().isAcceptingTasks()) {
              println("computer is online, set accepting tasks ${it}")
              it.toComputer().setAcceptingTasks(true);
            }
          } 

          Samuel Beaulieu added a comment - - edited For the record, I have a similar issue when under load after updating jenkins core + k8s plugin however I run the jenkins agents pods on GKE { } The behavior / error message presents differently https://issues.jenkins.io/browse/JENKINS-68126 but the threads seem stuck in the watcher.   Edit: what we see is twofold: 1) it takes time for the plugin to spin new pods even though there are a lot of jobs in waiting queue. It like its lagging behind reality, and doesnt know the queue is huge. 2) when it spins them up, they have time to start their jnlp connection, so we see the node connected in the UI, but the k8s plugin is stuck and never puts them as accepting tasks. I think it still watches them to see if all containers have started (they obviously did since they are now connected). If we use a script to force them to accept tasks it works but the plugin vs reality becomes out of sync, it thinks that those are still in provisioning and it skews the 'excess workload' number so it exarcerbates the issue #1 above. Jenkins.instance.getNodes().each{   if (it.toComputer().isOnline() && !it.toComputer().isAcceptingTasks()) {     println( "computer is online, set accepting tasks ${it}" )     it.toComputer().setAcceptingTasks( true );   } }

          SparkC added a comment - - edited

          For me this 5000 MILLISECONDS issue was resolved after downgrading few  jenkins and plugins.  i tested it across 4-5 servers.

          Jenkins: version: "2.303.1"

          • kubernetes:1.30.1
          • kubernetes-client-api:5.4.1

          But i started getting another timeout Error :

          Error:java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds
          2	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:451) 

          Issue : JENKINS-68332

          SparkC added a comment - - edited For me this 5000 MILLISECONDS issue was resolved after downgrading few  jenkins and plugins.  i tested it across 4-5 servers. Jenkins: version: "2.303.1" kubernetes:1.30.1 kubernetes-client-api:5.4.1 But i started getting another timeout Error : Error:java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds 2 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:451) Issue : JENKINS-68332

          I built my own version of the kubernetes-api-client plugin that includes the fabric8 version 5.12.2 essentially this PR https://github.com/jenkinsci/kubernetes-client-api-plugin/pull/149

          -        <revision>5.12.1</revision>
          +        <revision>5.12.2</revision> 

          And I pushed it to a staging sever. I do not see the error "KubernetesClientException: not ready after X" anymore, but my issue is still present where a lot of nodes show as idle / 'suspended' (the node is connected, the pod is running, but the k8s plugin does not know about it and still logs them as 'provisioning') for example Nodeprovisioner logs != to k8s logs:

          In k8s logs:
          In provisioning: [REDACTED LIST OF 100+ nodes, jnlp-parent-8j46]
          In NodeProvisioner logs:
          [id=44]	INFO	hudson.slaves.NodeProvisioner#update: jnlp-parent-8j46j provisioning successfully completed. We have now 116 computer(s) 

          Samuel Beaulieu added a comment - I built my own version of the kubernetes-api-client plugin that includes the fabric8 version 5.12.2 essentially this PR https://github.com/jenkinsci/kubernetes-client-api-plugin/pull/149 -        <revision>5.12.1</revision> +        <revision>5.12.2</revision> And I pushed it to a staging sever. I do not see the error "KubernetesClientException: not ready after X" anymore, but my issue is still present where a lot of nodes show as idle / 'suspended' (the node is connected, the pod is running, but the k8s plugin does not know about it and still logs them as 'provisioning') for example Nodeprovisioner logs != to k8s logs: In k8s logs: In provisioning: [REDACTED LIST OF 100+ nodes, jnlp-parent-8j46] In NodeProvisioner logs: [id=44] INFO hudson.slaves.NodeProvisioner#update: jnlp-parent-8j46j provisioning successfully completed. We have now 116 computer(s)

          PR-149 was merged a few days ago and was released in 5.12.2-193.v26a_6078f65a_9 ; which is the version I see being used today...and I also see these "not ready after" things

          sbeaulie I wonder if its possible that you've made some other changes/etc which might have also contributed to fix this for you?

          Zoltán Haindrich added a comment - PR-149 was merged a few days ago and was released in 5.12.2-193.v26a_6078f65a_9 ; which is the version I see being used today...and I also see these "not ready after" things sbeaulie I wonder if its possible that you've made some other changes/etc which might have also contributed to fix this for you?

          I have been trying to reproduce this but I can't, at least I can't with a high value for max connections to API.
          Does anybody have a reproducible scenario ? Something consistent enough so that data (such as thread dumps) can be collected while the step is waiting for the websocket connection to be started. Such evidence would be required to understand what is going on. It could be difficult to get though. If necessary, the org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout and kubernetes.websocket.timeout could be set higher to be able to capture such evidence.

          Allan BURDAJEWICZ added a comment - I have been trying to reproduce this but I can't, at least I can't with a high value for max connections to API. Does anybody have a reproducible scenario ? Something consistent enough so that data (such as thread dumps) can be collected while the step is waiting for the websocket connection to be started. Such evidence would be required to understand what is going on. It could be difficult to get though. If necessary, the org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout and kubernetes.websocket.timeout could be set higher to be able to capture such evidence.

          Allan BURDAJEWICZ added a comment - - edited

          For whoever is impacted, I have opened https://github.com/jenkinsci/kubernetes-plugin/pull/1202 that add a piece of code to generate a thread dump when a KubernetesClientException is thrown from ContainerExecDecorator in a file in $JENKINS_HOME/jenkins-67664/.

          You may install the plugin (binary available at https://ci.jenkins.io/job/Plugins/job/kubernetes-plugin/job/PR-1202/1/artifact/org/csanchez/jenkins/plugins/kubernetes/1.31.2/kubernetes-1.31.2.hpi) and execute org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.JENKINS_67664_ENABLED=true from the Script Console to enable the feature. Then wait for the issue to happen and collect the thread dumps under $JENKINS_HOME/jenkins-67664. And you may then disable it with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.JENKINS_67664_ENABLED=false then.

          Please attach a thread dump here when reproduced. Hopefully that will help understand the root cause.

          Allan BURDAJEWICZ added a comment - - edited For whoever is impacted, I have opened https://github.com/jenkinsci/kubernetes-plugin/pull/1202 that add a piece of code to generate a thread dump when a KubernetesClientException is thrown from ContainerExecDecorator in a file in $JENKINS_HOME/jenkins-67664/ . You may install the plugin (binary available at https://ci.jenkins.io/job/Plugins/job/kubernetes-plugin/job/PR-1202/1/artifact/org/csanchez/jenkins/plugins/kubernetes/1.31.2/kubernetes-1.31.2.hpi ) and execute org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.JENKINS_67664_ENABLED=true from the Script Console to enable the feature. Then wait for the issue to happen and collect the thread dumps under $JENKINS_HOME/jenkins-67664 . And you may then disable it with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.JENKINS_67664_ENABLED=false then. Please attach a thread dump here when reproduced. Hopefully that will help understand the root cause.

          Allan BURDAJEWICZ added a comment - - edited

          While capturing thread during executions, I realized that the CompletableFuture uses ForkJoinPool:

          "Running CpsFlowExecution[Owner[generated/declarative-1/147:generated/declarative-1 #147]]" 
             java.lang.Thread.State: TIMED_WAITING
                  at sun.misc.Unsafe.park(Native Method)
                  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
                  at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
                  at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
                  at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
                  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
                  at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:155)
                  at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:175)
                  [...]
          

          Reading about ForkJoinPool, that pool is very much CPU bound (1 thread per CPU according to the docs). So the problem might not be related to the websocket connection itself but to concurrency on the controller.. Following this theory, increasing the number of CPU available to the Jenkins controller would potentially help mitigate the problem but not fix it.

          The move to CompletableFuture is a major change that happened in k8s client 5.5.0 and that would explain the experience that users have here since k8s plugin 1.31.0 when we bumped the client from 5.4.1 to 5.10.1.

          I reviewed the changes in k8s client several times and could not find a root cause for for. Despite the extra layer that was added, the future is completed quite soon.

          I don't have much expertise about the management of CompletableFuture though.
          vlatombe WDYT about this theory of the ForkJoinPool causing this ?

          Allan BURDAJEWICZ added a comment - - edited While capturing thread during executions, I realized that the CompletableFuture uses ForkJoinPool : "Running CpsFlowExecution[Owner[generated/declarative-1/147:generated/declarative-1 #147]]" java.lang. Thread .State: TIMED_WAITING at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:155) at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:175) [...] Reading about ForkJoinPool , that pool is very much CPU bound (1 thread per CPU according to the docs). So the problem might not be related to the websocket connection itself but to concurrency on the controller.. Following this theory, increasing the number of CPU available to the Jenkins controller would potentially help mitigate the problem but not fix it. The move to CompletableFuture is a major change that happened in k8s client 5.5.0 and that would explain the experience that users have here since k8s plugin 1.31.0 when we bumped the client from 5.4.1 to 5.10.1. I reviewed the changes in k8s client several times and could not find a root cause for for. Despite the extra layer that was added, the future is completed quite soon . I don't have much expertise about the management of CompletableFuture though. vlatombe WDYT about this theory of the ForkJoinPool causing this ?

          Artem Chernenko added a comment - - edited

          I upgraded Jenkins to 2.355 and Kubernetes Plugin to 3651.v908e7db_10d06. And error is gone

           

          UPD: I was wrong. The issue still occurs (

          Artem Chernenko added a comment - - edited I upgraded Jenkins to 2.355 and Kubernetes Plugin to 3651.v908e7db_10d06 . And error is gone   UPD: I was wrong. The issue still occurs (

          So far I was able to reproduce this twice in 3 days. In the 2 occurrences, I noticed some memory pressure on k8s nodes so I believe this it was due to nodes being overloaded and kubelet being not responsive for a moment. Checking into kube-apiserver and kubelet / containerd, the connection eventually succeeded in 7 seconds (just a bit more than 5s). So maybe not the scenario that people are experiencing here...

          achernenkokaaiot That's good to hear. A notable change in the recent release of the Kubernetes plugin is https://github.com/jenkinsci/kubernetes-plugin/pull/1192 released in 3636.v84b_a_1dea_6240, the removal of the watcher when launching agents. While it is not directly related to the problem, the watcher were consuming okhttp threads and given the asynchronous behavior of both k8s client and okhttp under the hood, that could well contribute to delays handling connections.

          It'b worth to know if this version fixes the problem for other users impacted here.

          Allan BURDAJEWICZ added a comment - So far I was able to reproduce this twice in 3 days. In the 2 occurrences, I noticed some memory pressure on k8s nodes so I believe this it was due to nodes being overloaded and kubelet being not responsive for a moment. Checking into kube-apiserver and kubelet / containerd, the connection eventually succeeded in 7 seconds (just a bit more than 5s). So maybe not the scenario that people are experiencing here... achernenkokaaiot That's good to hear. A notable change in the recent release of the Kubernetes plugin is https://github.com/jenkinsci/kubernetes-plugin/pull/1192 released in 3636.v84b_a_1dea_6240 , the removal of the watcher when launching agents. While it is not directly related to the problem, the watcher were consuming okhttp threads and given the asynchronous behavior of both k8s client and okhttp under the hood, that could well contribute to delays handling connections. It'b worth to know if this version fixes the problem for other users impacted here.

          allan_burdajewicz I was wrong. Still facing this issue(

          Artem Chernenko added a comment - allan_burdajewicz  I was wrong. Still facing this issue(

          SparkC added a comment -

          I'm still facing this issue. 

          SparkC added a comment - I'm still facing this issue. 

          K8s Client maintainer are asking to test k8s client api 6.0.0 in https://github.com/fabric8io/kubernetes-client/issues/3795#issuecomment-1185542906. Maybe we need to test the new version. There is already a PR at https://github.com/jenkinsci/kubernetes-client-api-plugin/pull/153. In the meantime, I proposed https://github.com/jenkinsci/kubernetes-plugin/pull/1212 to help mitigate the problem with a retry mechanism.

          Allan BURDAJEWICZ added a comment - K8s Client maintainer are asking to test k8s client api 6.0.0 in https://github.com/fabric8io/kubernetes-client/issues/3795#issuecomment-1185542906 . Maybe we need to test the new version. There is already a PR at https://github.com/jenkinsci/kubernetes-client-api-plugin/pull/153 . In the meantime, I proposed https://github.com/jenkinsci/kubernetes-plugin/pull/1212 to help mitigate the problem with a retry mechanism.

          I have been using a pipeline evaluateExecResponsiveness.groovy to troubleshoot the problem in impacted environment. Users here may use that script to check on this.

          This pipeline starts 2 stages in parallel, one that start an agent using the kubernetes plugin and one that use curl to send exec requests to that test agent pod container (like the kubernetes plugin does) and report response times in a loop. We use --max-time 5 to stop the request if it takes more than 5s.

          Well, it turns out that in environment impacted, the curl requests are responding very quickly (~100ms) and when the problem happens the last curl response times out after 5s.

          [2022-07-19T11:32:47.527Z] [Tue Jul 19 11:32:46 UTC 2022]	[curl]	101	0.122720	 0.000536	0.012001	0.036376
          [...]
          [2022-07-19T11:32:57.390Z] [Tue Jul 19 11:32:51 UTC 2022]	[curl]	101	5.001434	 0.000751	0.005658	0.027702
          

          To me, this is a sign that this is not specific to kubernetes client but really about the responsiveness of Kube API Server (or kubelet on the node where the agent pod is running). Which could be cause by all sorts of things I guess, that would require to look at health of the nodes.

          It would be interesting to get users feedback about this and understand if the same thing is observed.

          Allan BURDAJEWICZ added a comment - I have been using a pipeline evaluateExecResponsiveness.groovy to troubleshoot the problem in impacted environment. Users here may use that script to check on this. This pipeline starts 2 stages in parallel, one that start an agent using the kubernetes plugin and one that use curl to send exec requests to that test agent pod container (like the kubernetes plugin does) and report response times in a loop. We use --max-time 5 to stop the request if it takes more than 5s. Well, it turns out that in environment impacted, the curl requests are responding very quickly (~100ms) and when the problem happens the last curl response times out after 5s. [2022-07-19T11:32:47.527Z] [Tue Jul 19 11:32:46 UTC 2022] [curl] 101 0.122720 0.000536 0.012001 0.036376 [...] [2022-07-19T11:32:57.390Z] [Tue Jul 19 11:32:51 UTC 2022] [curl] 101 5.001434 0.000751 0.005658 0.027702 To me, this is a sign that this is not specific to kubernetes client but really about the responsiveness of Kube API Server (or kubelet on the node where the agent pod is running). Which could be cause by all sorts of things I guess, that would require to look at health of the nodes. It would be interesting to get users feedback about this and understand if the same thing is observed.

          For those interested and running latest version of kubernetes plugin, the PR https://github.com/jenkinsci/kubernetes-plugin/pull/1212 propose a retry mechanism to mitigate the problem. HPI available at kubernetes-3680.vd6c863297215.hpi. Feedback appreciated.

          Allan BURDAJEWICZ added a comment - For those interested and running latest version of kubernetes plugin, the PR https://github.com/jenkinsci/kubernetes-plugin/pull/1212 propose a retry mechanism to mitigate the problem. HPI available at kubernetes-3680.vd6c863297215.hpi . Feedback appreciated.

          allan_burdajewicz thanks for your work. I will try this hpi in my project and let you know after the week if I still facing this issue.

          Artem Chernenko added a comment - allan_burdajewicz thanks for your work. I will try this hpi in my project and let you know after the week if I still facing this issue.

          allan_burdajewicz thank you very much for your PR, we are using the 3680 version for a few days now and we haven't seen any of those infamous 5000ms errors. You single-handedly solved a problem that was afflicting us for weeks and seriously hampering our work. Much appreciated!

          Alessandro Vozza added a comment - allan_burdajewicz  thank you very much for your PR, we are using the 3680 version for a few days now and we haven't seen any of those infamous 5000ms errors. You single-handedly solved a problem that was afflicting us for weeks and seriously hampering our work. Much appreciated!

          allan_burdajewicz We don't see an issue so far. Thank you. Will let you know if we face this issue.

          Artem Chernenko added a comment - allan_burdajewicz We don't see an issue so far. Thank you. Will let you know if we face this issue.

          Adam Placzek added a comment -

          allan_burdajewicz T error is still there, but the exponential backoff allows the pipeline to retry it, continue and finish successfully. Great stuff ! 
           

          Adam Placzek added a comment - allan_burdajewicz T error is still there, but the exponential backoff allows the pipeline to retry it, continue and finish successfully. Great stuff !   

          SparkC added a comment -

          aplaczek  Error is still there with latest Jenkins version and Latest Plugins versions.  i have tried all options. 

          tag: "2.346.2"

          installPlugins:

          • kubernetes:3663.v1c1e0ec5b_650
          • kubernetes-client-api:5.12.2-193.v26a_6078f65a_9
            06:29:10  io.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS
            06:29:10  	at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:181)
            06:29:10  	at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:332)
            06:29:10  	at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:85)
            06:29:10  	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:425)
            06:29:10  	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:328)
            06:29:10  	at hudson.Launcher$ProcStarter.start(Launcher.java:509) 

          SparkC added a comment - aplaczek   Error is still there with latest Jenkins version and Latest Plugins versions.  i have tried all options.  tag: "2.346.2" installPlugins: kubernetes:3663.v1c1e0ec5b_650 kubernetes-client-api:5.12.2-193.v26a_6078f65a_9 06:29:10 io.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS 06:29:10 at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:181) 06:29:10 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:332) 06:29:10 at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:85) 06:29:10 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:425) 06:29:10 at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:328) 06:29:10 at hudson.Launcher$ProcStarter.start(Launcher.java:509)

          Just an fyi, also posted on the PR for this.

          We have deployed the incremental build from PR on our environment, so far no issues observed.

          Jenkins version: `2.346.2`

          Kubernetes plugin version: `3680.va_31c13cda_9b_5`

          Jonathan Hardison added a comment - Just an fyi, also posted on the PR for this. We have deployed the incremental build from PR on our environment, so far no issues observed. Jenkins version: `2.346.2` Kubernetes plugin version: `3680.va_31c13cda_9b_5`

          We see reoccurrence of `Failed to start websocket connection: io.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS`, this time with the addition of the following message:

          Caused by: java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
              at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
              at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)

           

          Still running the 3680 version. We run long-running multi-pipeline builds nightly, we saw it once last week but last night it was fine, so it's really again an hit&miss bug.

          Alessandro Vozza added a comment - We see reoccurrence of `Failed to start websocket connection: io.fabric8.kubernetes.client.KubernetesClientException: not ready after 5000 MILLISECONDS`, this time with the addition of the following message: Caused by: java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'     at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)     at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)   Still running the 3680 version. We run long-running multi-pipeline builds nightly, we saw it once last week but last night it was fine, so it's really again an hit&miss bug.

          Olexandr Shamin added a comment - - edited

          is there problem still exists? As we are still using single container solution.

          Olexandr Shamin added a comment - - edited is there problem still exists? As we are still using single container solution.

          oshamin The sporadic 500s and timeout can still happen yes. This by design due to the fragility of the exec API used by the kubernetes plugin when launching commands in non jnlp containers. But a retry mechanism has been implemented to improve stability of the builds. That should be suitable in most cases. The single container solution still has its benefit of relying less on the K8s REST API.

          Allan BURDAJEWICZ added a comment - oshamin The sporadic 500s and timeout can still happen yes. This by design due to the fragility of the exec API used by the kubernetes plugin when launching commands in non jnlp containers. But a retry mechanism has been implemented to improve stability of the builds. That should be suitable in most cases. The single container solution still has its benefit of relying less on the K8s REST API.

          allan_burdajewicz, thank you for the feedback.

          Olexandr Shamin added a comment - allan_burdajewicz , thank you for the feedback.

            allan_burdajewicz Allan BURDAJEWICZ
            rohithg534 SparkC
            Votes:
            15 Vote for this issue
            Watchers:
            38 Start watching this issue

              Created:
              Updated:
              Resolved: