Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68126

Jenkins agents in suspended state after upgrade to 2.332.1 with kubernetes agents, queued builds not executing

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • kubernetes-plugin
    • None

      Setup:

      Jenkins core 2.332.1 (upgrade from 2.303.2)

      Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

      Java 11 on both jenkins server and agents

      agent using remoting 4.10

      Websocket for the agent connection in k8s plugin

       

      I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and I wanted to document we had to revert the LTS release.

       

      I have also tried updating to non-lts 2.340 and the same issue was present.

       

      Behavior:

      We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

       

      But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended) about 300+ of them. It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

       
      Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

       

      Process space:

       /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 

       

      Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

       

      Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

       

      If you see this issue too, please comment / advise.

          [JENKINS-68126] Jenkins agents in suspended state after upgrade to 2.332.1 with kubernetes agents, queued builds not executing

          Samuel Beaulieu added a comment - - edited

          I was able to reproduce in our staging env by launching a matrix job with ~120 cells that run each in one jenkins-agent.

          The only step is an execute shell with essentially nothing:

          env | sort
          sleep 60
          touch foo
          ls -lrt
          sleep 600
          

          I will test with less to see if there is a breaking point. Here is the thread Dump if you can notice anything unusual

          JENKINS-68126-threadDump.txt

          Samuel Beaulieu added a comment - - edited I was able to reproduce in our staging env by launching a matrix job with ~120 cells that run each in one jenkins-agent. The only step is an execute shell with essentially nothing: env | sort sleep 60 touch foo ls -lrt sleep 600 I will test with less to see if there is a breaking point. Here is the thread Dump if you can notice anything unusual JENKINS-68126-threadDump.txt

          I tried starting with 10 in the queue and ramping up to 120. The issue seems to appear around the number 60, where about 14 jobs get executed while the other 46 are not running, even though we have connected nodes - showing as idle and suspended.

          I also tried upgrading to the latest non-lts release jenkins 2.341 and updated all the plugins but the issue is still there. My next try will be to remove all the plugins and install only the minimum and see if the issue persist, to check if there is some kind of lock/conflict with another plugin.

          Samuel Beaulieu added a comment - I tried starting with 10 in the queue and ramping up to 120. The issue seems to appear around the number 60, where about 14 jobs get executed while the other 46 are not running, even though we have connected nodes - showing as idle and suspended. I also tried upgrading to the latest non-lts release jenkins 2.341 and updated all the plugins but the issue is still there. My next try will be to remove all the plugins and install only the minimum and see if the issue persist, to check if there is some kind of lock/conflict with another plugin.

          I tried with most plugins disabled and it did not help.

           

          We tried setting
          -Dio.jenkins.plugins.kubernetes.disableNoDelayProvisioning=true
          but it still lags greatly behind what's available

           

          Samuel Beaulieu added a comment - I tried with most plugins disabled and it did not help.   We tried setting -Dio.jenkins.plugins.kubernetes.disableNoDelayProvisioning=true but it still lags greatly behind what's available  

          We did some live troubleshooting by changing the kubernetes code and adding logging to see where things are stuck. It seems as the load goes up the first few 20-30 pods do get to accepting the tasks but the rest gets stuck (200+) and do not get to https://github.com/jenkinsci/kubernetes-plugin/blob/3568.vde94f6b_41b_c8/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L248

          There seems to be an issue in the watcher w1 and w2 (maybe a bug in the fabric8 library)

          We changed the code to run these outside of the 'try-with-resources' and added logging and the code would not go past w1 or w2. They would not complete, or detect MODIFIED events, they would just wait indefinitely (maybe fabric8's bug is waituntil() without any timeout setup?)

           

          Problematic code lines:

          https://github.com/jenkinsci/kubernetes-plugin/blob/3568.vde94f6b_41b_c8/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L170-L171

           

          I'm thinking of removing the watchers and just using the fabric8 library to check and continue once the pods are ready. Something like a loop for the duration of the timeout:

           

          Pod checkPod = client.pods().inNamespace(namespace).withName(podName).waitUntilReady(1, TimeUnit.MINUTES);
          if (Readiness.isPodReady(checkPod)) {
              LOGGER.log(INFO, "Pod ready: true " + podName);
              done = true;
              break;
          } else {
              LOGGER.log(INFO, "Pod ready: false " + podName);
          } 

           

          vlatombe if you have any insight there?

           

          Samuel Beaulieu added a comment - We did some live troubleshooting by changing the kubernetes code and adding logging to see where things are stuck. It seems as the load goes up the first few 20-30 pods do get to accepting the tasks but the rest gets stuck (200+) and do not get to https://github.com/jenkinsci/kubernetes-plugin/blob/3568.vde94f6b_41b_c8/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L248 There seems to be an issue in the watcher w1 and w2 (maybe a bug in the fabric8 library) We changed the code to run these outside of the 'try-with-resources' and added logging and the code would not go past w1 or w2. They would not complete, or detect MODIFIED events, they would just wait indefinitely (maybe fabric8's bug is waituntil() without any timeout setup?)   Problematic code lines: https://github.com/jenkinsci/kubernetes-plugin/blob/3568.vde94f6b_41b_c8/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L170-L171   I'm thinking of removing the watchers and just using the fabric8 library to check and continue once the pods are ready. Something like a loop for the duration of the timeout:   Pod checkPod = client.pods().inNamespace(namespace).withName(podName).waitUntilReady(1, TimeUnit.MINUTES); if (Readiness.isPodReady(checkPod)) { LOGGER.log(INFO, "Pod ready: true " + podName); done = true ; break ; } else { LOGGER.log(INFO, "Pod ready: false " + podName); }   vlatombe if you have any insight there?  

          Running our own version with this code change: https://github.com/jenkinsci/kubernetes-plugin/pull/1167/files#diff-8e43be7dacff0fa974a7614fda2f4103bf4b820be52d747d9498e6f7efbd688fR169-R172

           

          The load graph is much better and we do not get nodes stuck ion suspended state.

           

          Samuel Beaulieu added a comment - Running our own version with this code change: https://github.com/jenkinsci/kubernetes-plugin/pull/1167/files#diff-8e43be7dacff0fa974a7614fda2f4103bf4b820be52d747d9498e6f7efbd688fR169-R172   The load graph is much better and we do not get nodes stuck ion suspended state.  

          Carsten Pfeiffer added a comment - AFAICS, this issue is already fixed and released in https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3636.v84b_a_1dea_6240

            vlatombe Vincent Latombe
            sbeaulie Samuel Beaulieu
            Votes:
            5 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: