Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68126

Jenkins agents in suspended state after upgrade to 2.332.1 with kubernetes agents, queued builds not executing

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • kubernetes-plugin
    • None

      Setup:

      Jenkins core 2.332.1 (upgrade from 2.303.2)

      Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

      Java 11 on both jenkins server and agents

      agent using remoting 4.10

      Websocket for the agent connection in k8s plugin

       

      I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and I wanted to document we had to revert the LTS release.

       

      I have also tried updating to non-lts 2.340 and the same issue was present.

       

      Behavior:

      We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

       

      But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended) about 300+ of them. It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

       
      Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

       

      Process space:

       /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 

       

      Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

       

      Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

       

      If you see this issue too, please comment / advise.

          [JENKINS-68126] Jenkins agents in suspended state after upgrade to 2.332.1 with kubernetes agents, queued builds not executing

          Samuel Beaulieu created issue -
          Samuel Beaulieu made changes -
          Description Original: +Setup:+

          Jenkins core 2.332.1 (upgrade from 2.303.2)

          Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

          Java 11 on both jenkins server and agents

          agent using remoting 4.10

          Websocket for the agent connection in k8s plugin

           

          I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

           

          I have also tried updating to non-lts 2.340 and the same issue was present.

           

          +Behavior:+

          We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

           

          But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

           

          Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

           

          +Process space:+
          {code:java}
           /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}
           

          Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

           

          Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

           

          If you see this issue too, please comment / advise.
          New: +Setup:+

          Jenkins core 2.332.1 (upgrade from 2.303.2)

          Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

          Java 11 on both jenkins server and agents

          agent using remoting 4.10

          Websocket for the agent connection in k8s plugin

           

          I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

           

          I have also tried updating to non-lts 2.340 and the same issue was present.

           

          +Behavior:+

          We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

           

          But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

           !Screenshot 2022-03-25 at 10.22.59.png|!
           
          Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

           !Screenshot 2022-03-25 at 10.23.23.png|thumbnail!

           

          +Process space:+
          {code:java}
           /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}
           

          Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

           

          Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

           

          If you see this issue too, please comment / advise.
          Samuel Beaulieu made changes -
          Description Original: +Setup:+

          Jenkins core 2.332.1 (upgrade from 2.303.2)

          Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

          Java 11 on both jenkins server and agents

          agent using remoting 4.10

          Websocket for the agent connection in k8s plugin

           

          I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

           

          I have also tried updating to non-lts 2.340 and the same issue was present.

           

          +Behavior:+

          We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

           

          But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

           !Screenshot 2022-03-25 at 10.22.59.png|!
           
          Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

           !Screenshot 2022-03-25 at 10.23.23.png|thumbnail!

           

          +Process space:+
          {code:java}
           /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}
           

          Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

           

          Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

           

          If you see this issue too, please comment / advise.
          New: +Setup:+

          Jenkins core 2.332.1 (upgrade from 2.303.2)

          Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

          Java 11 on both jenkins server and agents

          agent using remoting 4.10

          Websocket for the agent connection in k8s plugin

           

          I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

           

          I have also tried updating to non-lts 2.340 and the same issue was present.

           

          +Behavior:+

          We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

           

          But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

           !Screenshot 2022-03-25 at 10.23.23.png|!

           
          Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)


           !Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

           

          +Process space:+
          {code:java}
           /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}
           

          Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

           

          Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

           

          If you see this issue too, please comment / advise.
          Samuel Beaulieu made changes -
          Description Original: +Setup:+

          Jenkins core 2.332.1 (upgrade from 2.303.2)

          Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

          Java 11 on both jenkins server and agents

          agent using remoting 4.10

          Websocket for the agent connection in k8s plugin

           

          I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

           

          I have also tried updating to non-lts 2.340 and the same issue was present.

           

          +Behavior:+

          We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

           

          But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

           !Screenshot 2022-03-25 at 10.23.23.png|!

           
          Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)


           !Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

           

          +Process space:+
          {code:java}
           /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}
           

          Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

           

          Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

           

          If you see this issue too, please comment / advise.
          New: +Setup:+

          Jenkins core 2.332.1 (upgrade from 2.303.2)

          Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

          Java 11 on both jenkins server and agents

          agent using remoting 4.10

          Websocket for the agent connection in k8s plugin

           

          I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

           

          I have also tried updating to non-lts 2.340 and the same issue was present.

           

          +Behavior:+

          We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

           

          But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended) about 300+ of them. It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

           !Screenshot 2022-03-25 at 10.23.23.png|!

           
          Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)


           !Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

           

          +Process space:+
          {code:java}
           /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}
           

          Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

           

          Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

           

          If you see this issue too, please comment / advise.
          Samuel Beaulieu made changes -
          Description Original: +Setup:+

          Jenkins core 2.332.1 (upgrade from 2.303.2)

          Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

          Java 11 on both jenkins server and agents

          agent using remoting 4.10

          Websocket for the agent connection in k8s plugin

           

          I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

           

          I have also tried updating to non-lts 2.340 and the same issue was present.

           

          +Behavior:+

          We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

           

          But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended) about 300+ of them. It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

           !Screenshot 2022-03-25 at 10.23.23.png|!

           
          Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)


           !Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

           

          +Process space:+
          {code:java}
           /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}
           

          Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

           

          Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

           

          If you see this issue too, please comment / advise.
          New: +Setup:+

          Jenkins core 2.332.1 (upgrade from 2.303.2)

          Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

          Java 11 on both jenkins server and agents

          agent using remoting 4.10

          Websocket for the agent connection in k8s plugin

           

          I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and I wanted to document we had to revert the LTS release.

           

          I have also tried updating to non-lts 2.340 and the same issue was present.

           

          +Behavior:+

          We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

           

          But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended) about 300+ of them. It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

           !Screenshot 2022-03-25 at 10.23.23.png|!

           
          Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)


           !Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

           

          +Process space:+
          {code:java}
           /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}
           

          Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

           

          Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

           

          If you see this issue too, please comment / advise.
          Samuel Beaulieu made changes -
          Attachment New: JENKINS-68126-threadDump.txt [ 57572 ]

          Samuel Beaulieu added a comment - - edited

          I was able to reproduce in our staging env by launching a matrix job with ~120 cells that run each in one jenkins-agent.

          The only step is an execute shell with essentially nothing:

          env | sort
          sleep 60
          touch foo
          ls -lrt
          sleep 600
          

          I will test with less to see if there is a breaking point. Here is the thread Dump if you can notice anything unusual

          JENKINS-68126-threadDump.txt

          Samuel Beaulieu added a comment - - edited I was able to reproduce in our staging env by launching a matrix job with ~120 cells that run each in one jenkins-agent. The only step is an execute shell with essentially nothing: env | sort sleep 60 touch foo ls -lrt sleep 600 I will test with less to see if there is a breaking point. Here is the thread Dump if you can notice anything unusual JENKINS-68126-threadDump.txt

          I tried starting with 10 in the queue and ramping up to 120. The issue seems to appear around the number 60, where about 14 jobs get executed while the other 46 are not running, even though we have connected nodes - showing as idle and suspended.

          I also tried upgrading to the latest non-lts release jenkins 2.341 and updated all the plugins but the issue is still there. My next try will be to remove all the plugins and install only the minimum and see if the issue persist, to check if there is some kind of lock/conflict with another plugin.

          Samuel Beaulieu added a comment - I tried starting with 10 in the queue and ramping up to 120. The issue seems to appear around the number 60, where about 14 jobs get executed while the other 46 are not running, even though we have connected nodes - showing as idle and suspended. I also tried upgrading to the latest non-lts release jenkins 2.341 and updated all the plugins but the issue is still there. My next try will be to remove all the plugins and install only the minimum and see if the issue persist, to check if there is some kind of lock/conflict with another plugin.
          Samuel Beaulieu made changes -
          Attachment New: Screenshot 2022-04-20 at 11.31.35.png [ 57731 ]

          I tried with most plugins disabled and it did not help.

           

          We tried setting
          -Dio.jenkins.plugins.kubernetes.disableNoDelayProvisioning=true
          but it still lags greatly behind what's available

           

          Samuel Beaulieu added a comment - I tried with most plugins disabled and it did not help.   We tried setting -Dio.jenkins.plugins.kubernetes.disableNoDelayProvisioning=true but it still lags greatly behind what's available  

            vlatombe Vincent Latombe
            sbeaulie Samuel Beaulieu
            Votes:
            5 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: