[JENKINS-68126] Jenkins agents in suspended state after upgrade to 2.332.1 with kubernetes agents, queued builds not executing

Type: Bug
Resolution: Fixed
Priority: Critical
Component/s: kubernetes-plugin
Labels:
None

Similar Issues:
Powered by SuggestiMate

Show

Setup:

Jenkins core 2.332.1 (upgrade from 2.303.2)

Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

Java 11 on both jenkins server and agents

agent using remoting 4.10

Websocket for the agent connection in k8s plugin

I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and I wanted to document we had to revert the LTS release.

I have also tried updating to non-lts 2.340 and the same issue was present.

Behavior:

We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended) about 300+ of them. It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

Process space:

 /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800

Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

If you see this issue too, please comment / advise.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screenshot 2022-04-25 at 11.42.37.png
301 kB
2022-04-25 18:05
Screenshot 2022-04-20 at 11.31.35.png
215 kB
2022-04-20 16:33
JENKINS-68126-threadDump.txt
2.99 MB
2022-03-28 17:18
Screenshot 2022-03-25 at 10.22.59.png
566 kB
2022-03-25 16:14
Screenshot 2022-03-25 at 10.23.23.png
115 kB
2022-03-25 16:13

links to

kubernetes-plugin #1192

Samuel Beaulieu created issue - 2022-03-25 16:22

Samuel Beaulieu made changes - 2022-03-25 16:32

Description

Original: +Setup:+

Jenkins core 2.332.1 (upgrade from 2.303.2)

Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

Java 11 on both jenkins server and agents

agent using remoting 4.10

Websocket for the agent connection in k8s plugin

I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

I have also tried updating to non-lts 2.340 and the same issue was present.

+Behavior:+

We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

+Process space:+
{code:java}
/etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}

Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

If you see this issue too, please comment / advise.

New: +Setup:+

Jenkins core 2.332.1 (upgrade from 2.303.2)

Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

Java 11 on both jenkins server and agents

agent using remoting 4.10

Websocket for the agent connection in k8s plugin

I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

I have also tried updating to non-lts 2.340 and the same issue was present.

+Behavior:+

We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

!Screenshot 2022-03-25 at 10.22.59.png|!

Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

!Screenshot 2022-03-25 at 10.23.23.png|thumbnail!

+Process space:+
{code:java}
/etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}

Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

If you see this issue too, please comment / advise.

Samuel Beaulieu made changes - 2022-03-25 16:33

Description

Original: +Setup:+

Jenkins core 2.332.1 (upgrade from 2.303.2)

Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

Java 11 on both jenkins server and agents

agent using remoting 4.10

Websocket for the agent connection in k8s plugin

I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

I have also tried updating to non-lts 2.340 and the same issue was present.

+Behavior:+

We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

!Screenshot 2022-03-25 at 10.22.59.png|!

Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

!Screenshot 2022-03-25 at 10.23.23.png|thumbnail!

+Process space:+
{code:java}
/etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}

Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

If you see this issue too, please comment / advise.

New: +Setup:+

Jenkins core 2.332.1 (upgrade from 2.303.2)

Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

Java 11 on both jenkins server and agents

agent using remoting 4.10

Websocket for the agent connection in k8s plugin

I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

I have also tried updating to non-lts 2.340 and the same issue was present.

+Behavior:+

We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

!Screenshot 2022-03-25 at 10.23.23.png|!

Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

!Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

+Process space:+
{code:java}
/etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}

Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

If you see this issue too, please comment / advise.

Samuel Beaulieu made changes - 2022-03-25 16:33

Description

Original: +Setup:+

Jenkins core 2.332.1 (upgrade from 2.303.2)

Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

Java 11 on both jenkins server and agents

agent using remoting 4.10

Websocket for the agent connection in k8s plugin

I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

I have also tried updating to non-lts 2.340 and the same issue was present.

+Behavior:+

We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended). It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

!Screenshot 2022-03-25 at 10.23.23.png|!

Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

!Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

+Process space:+
{code:java}
/etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}

Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

If you see this issue too, please comment / advise.

New: +Setup:+

Jenkins core 2.332.1 (upgrade from 2.303.2)

Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

Java 11 on both jenkins server and agents

agent using remoting 4.10

Websocket for the agent connection in k8s plugin

I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

I have also tried updating to non-lts 2.340 and the same issue was present.

+Behavior:+

We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended) about 300+ of them. It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

!Screenshot 2022-03-25 at 10.23.23.png|!

Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

!Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

+Process space:+
{code:java}
/etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}

Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

If you see this issue too, please comment / advise.

Samuel Beaulieu made changes - 2022-03-25 16:35

Description

Original: +Setup:+

Jenkins core 2.332.1 (upgrade from 2.303.2)

Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

Java 11 on both jenkins server and agents

agent using remoting 4.10

Websocket for the agent connection in k8s plugin

I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and tag this issue to the LTS release.

I have also tried updating to non-lts 2.340 and the same issue was present.

+Behavior:+

We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended) about 300+ of them. It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

!Screenshot 2022-03-25 at 10.23.23.png|!

Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

!Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

+Process space:+
{code:java}
/etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}

Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

If you see this issue too, please comment / advise.

New: +Setup:+

Jenkins core 2.332.1 (upgrade from 2.303.2)

Kubernetes plugin: 3568.vde94f6b_41b_c8 (upgrade from 1.29.4)

Java 11 on both jenkins server and agents

agent using remoting 4.10

Websocket for the agent connection in k8s plugin

I know this is probably hard to troubleshoot but I want to open this ticket to track and see if others are experiencing the same issue. We had to end up reverting to the previous version + plugins because this was not working and I wanted to document we had to revert the LTS release.

I have also tried updating to non-lts 2.340 and the same issue was present.

+Behavior:+

We have a LOT of jobs that start at the same time (400+) and usually get assigned to a k8s pod. After the upgrade, jenkins would still provision agents (300+) and I confirmed the pods were starting clean, and the agent logs would show them as being CONNECTED.

But from a jenkins perspective maybe 10-15 of them would actually be running, while the rest would show as node with an idle executor and the node as (suspended) about 300+ of them. It could stay in that state for 20 minutes+ and never actually run any of the queued jobs.

!Screenshot 2022-03-25 at 10.23.23.png|!

Adding a couple screenshots with the nodes showing as suspended and the ramp-up of one of the labels (you can see 80+ online executors but virtually none are running things after 6-8 minutes)

!Screenshot 2022-03-25 at 10.22.59.png|thumbnail!

+Process space:+
{code:java}
/etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Xms19853m -Xmx60161m -Dhudson.model.ParametersAction.keepUndefinedParameters=true -Djava.awt.headless=true -XX:+UseG1GC -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Djenkins.model.Nodes.enforceNameRestrictions=false -Djenkins.security.ApiTokenProperty.adminCanGenerateNewTokens=true -Xlog:gc:/var/lib/jenkins/log/jenkins-gc.log::filecount=5,filesize=20M -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=8080 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --sessionTimeout=480 --sessionEviction=28800 {code}

Jenkins logs look ok, k8s logs (via UI logs tracking org.csanchez.jenkins.plugins.kubernetes)

Nothing to report there. The k8s logs show that we are under the limit set eg 300 out of 600 max global pods. Again the pods show up as CONNECTED but the jenkins server never actually allocate a build to the idle executor.

If you see this issue too, please comment / advise.

Samuel Beaulieu made changes - 2022-03-28 17:18

Attachment

New: JENKINS-68126-threadDump.txt [ 57572 ]

Samuel Beaulieu added a comment - 2022-03-28 17:18 - edited

I was able to reproduce in our staging env by launching a matrix job with ~120 cells that run each in one jenkins-agent.

The only step is an execute shell with essentially nothing:

env | sort
sleep 60
touch foo
ls -lrt
sleep 600

I will test with less to see if there is a breaking point. Here is the thread Dump if you can notice anything unusual

JENKINS-68126-threadDump.txt

Samuel Beaulieu added a comment - 2022-03-28 17:18 - edited I was able to reproduce in our staging env by launching a matrix job with ~120 cells that run each in one jenkins-agent. The only step is an execute shell with essentially nothing: env | sort sleep 60 touch foo ls -lrt sleep 600 I will test with less to see if there is a breaking point. Here is the thread Dump if you can notice anything unusual JENKINS-68126-threadDump.txt

Samuel Beaulieu added a comment - 2022-04-04 23:18

I tried starting with 10 in the queue and ramping up to 120. The issue seems to appear around the number 60, where about 14 jobs get executed while the other 46 are not running, even though we have connected nodes - showing as idle and suspended.

I also tried upgrading to the latest non-lts release jenkins 2.341 and updated all the plugins but the issue is still there. My next try will be to remove all the plugins and install only the minimum and see if the issue persist, to check if there is some kind of lock/conflict with another plugin.

Samuel Beaulieu added a comment - 2022-04-04 23:18 I tried starting with 10 in the queue and ramping up to 120. The issue seems to appear around the number 60, where about 14 jobs get executed while the other 46 are not running, even though we have connected nodes - showing as idle and suspended. I also tried upgrading to the latest non-lts release jenkins 2.341 and updated all the plugins but the issue is still there. My next try will be to remove all the plugins and install only the minimum and see if the issue persist, to check if there is some kind of lock/conflict with another plugin.

Samuel Beaulieu made changes - 2022-04-20 16:33

Attachment

New: Screenshot 2022-04-20 at 11.31.35.png [ 57731 ]

Samuel Beaulieu added a comment - 2022-04-20 16:33

I tried with most plugins disabled and it did not help.

We tried setting
-Dio.jenkins.plugins.kubernetes.disableNoDelayProvisioning=true
but it still lags greatly behind what's available

Samuel Beaulieu added a comment - 2022-04-20 16:33 I tried with most plugins disabled and it did not help. We tried setting -Dio.jenkins.plugins.kubernetes.disableNoDelayProvisioning=true but it still lags greatly behind what's available

Assignee:: Vincent Latombe

Reporter:: Samuel Beaulieu

Votes:: 5 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2022-03-25 16:22

Updated:: 2022-08-01 12:22

Resolved:: 2022-08-01 12:22

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Samuel Beaulieu added a comment - 2022-03-28 17:18, Edited by Samuel Beaulieu - 2022-03-28 17:21

Expand comment: Samuel Beaulieu added a comment - 2022-03-28 17:18, Edited by Samuel Beaulieu - 2022-03-28 17:21

Collapse comment: Samuel Beaulieu added a comment - 2022-04-04 23:18

Expand comment: Samuel Beaulieu added a comment - 2022-04-04 23:18

Collapse comment: Samuel Beaulieu added a comment - 2022-04-20 16:33

Expand comment: Samuel Beaulieu added a comment - 2022-04-20 16:33

People

Dates