Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-74957

Jenkins eventually stops responding properly and running jobs since 2.479.1

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • core
    • None

      Jenkins 2.462.x was running fine for us since May. About two weeks ago we upgraded to 2.479.1 (and all plugins to latest compatible versions).

      A few days later, the Build Time Trend stopped loading properly (says "Computation in progress" forever). The following Saturday it stopped running jobs (parent kicked off downstream but didn't get woken up when downstream completed) and the restart timed out and it was killed by systemd.

      This week, the Build Time Trend has stopped loading properly again, and then it started to accumulate a growing build queue and I had to restart it (I downgraded back to 2.462.3 at the same time2, as it's our production instance and we can't afford much downtime).

      I know we are using a lot of plugins and the problem could potentially be there, but I don't know in which one and can't really experiment with disabling random plugins on the production server. I suspect that we wouldn't see the issue on a test instance. Anything that could point to which plugin could be causing the problem would be most helpful.

      Looking at the thread dump, only one thing stands out: we have 10 threads running the logfilesizechecker plugin, all busy trying to check if something is a Gzip stream, which seems like far too many of these threads:

      "jenkins.util.Timer [#10]" Id=67 Group=main RUNNABLE
      	at java.base@17.0.13/sun.nio.fs.UnixNativeDispatcher.open0(Native Method)
      	at java.base@17.0.13/sun.nio.fs.UnixNativeDispatcher.open(UnixNativeDispatcher.java:68)
      	at java.base@17.0.13/sun.nio.fs.UnixChannelFactory.open(UnixChannelFactory.java:258)
      	at java.base@17.0.13/sun.nio.fs.UnixChannelFactory.newFileChannel(UnixChannelFactory.java:133)
      	at java.base@17.0.13/sun.nio.fs.UnixChannelFactory.newFileChannel(UnixChannelFactory.java:146)
      	at java.base@17.0.13/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:216)
      	at java.base@17.0.13/java.nio.file.Files.newByteChannel(Files.java:380)
      	at java.base@17.0.13/java.nio.file.Files.newByteChannel(Files.java:432)
      	at java.base@17.0.13/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
      	at java.base@17.0.13/java.nio.file.Files.newInputStream(Files.java:160)
      	at org.kohsuke.stapler.framework.io.LargeText$GzipAwareSession.isGzipStream(LargeText.java:542)
      	at org.kohsuke.stapler.framework.io.LargeText.<init>(LargeText.java:110)
      	at hudson.console.AnnotatedLargeText.<init>(AnnotatedLargeText.java:88)
      	at hudson.model.Run.getLogText(Run.java:1505)
      	at PluginClassLoader for logfilesizechecker//hudson.plugins.logfilesizechecker.LogfilesizecheckerWrapper$LogSizeTimerTask.doRun(LogfilesizecheckerWrapper.java:108)
      	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:92)
      	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
      	at java.base@17.0.13/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
      	at java.base@17.0.13/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
      	at java.base@17.0.13/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
      	at java.base@17.0.13/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      	at java.base@17.0.13/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      	at java.base@17.0.13/java.lang.Thread.run(Thread.java:840)
      
      	Number of locked synchronizers = 1
      	- java.util.concurrent.ThreadPoolExecutor$Worker@89ed5b6
      

      I also have a job stuck waiting for the previous build to complete, but it already has:

      The previous build "finished" with these logs, but it's still showing the animated dots below that:

      Errors were encountered
      Build step 'Execute shell' marked build as failure
      Sending e-mails to: xxx@xxx.com
      Notifying upstream projects of job completion
      

      And the thread dump shows:

      "Executor #0 for lhr-vexec02-fast : executing nexus_db_replicate_roles #295805" Id=146231 Group=main WAITING on java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4c072eb0
      at java.base@17.0.13/jdk.internal.misc.Unsafe.park(Native Method)

      • waiting on java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4c072eb0
        at java.base@17.0.13/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
        at java.base@17.0.13/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
        at java.base@17.0.13/java.util.concurrent.FutureTask.get(FutureTask.java:190)
        at hudson.tasks.BuildTrigger.execute(BuildTrigger.java:268)
        at hudson.model.AbstractBuild$AbstractBuildExecution.cleanUp(AbstractBuild.java:728)
        at hudson.model.Build$BuildExecution.cleanUp(Build.java:194)
        at hudson.model.Run.execute(Run.java:1874)
        at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
        at hudson.model.ResourceController.execute(ResourceController.java:101)
        at hudson.model.Executor.run(Executor.java:445)

      Many jobs are stuck waiting on the same object, I'm going to have to restart Jenkins soon.

      • waiting on java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4c072eb0

      It seems from the logs that jobs start getting completed but not deleted (that usually are deleted afterwards) and then end up stuck in the build queue:

      Dec 05 09:54:59 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:54:59.420+0000 [id=145920]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: blackops_import_marketaxess_arm_respo>
      Dec 05 09:55:57 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:55:57.102+0000 [id=145912]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_alloc_barcap_fo #96214
      Dec 05 09:55:57 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:55:57.121+0000 [id=145912]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: intraday_alloc_barcap_fo #95820
      Dec 05 09:55:57 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:55:57.402+0000 [id=145916]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_recon_ms_futures #436179
      Dec 05 09:55:57 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:55:57.412+0000 [id=145916]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: intraday_recon_ms_futures #435398
      Dec 05 09:56:01 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:56:01.741+0000 [id=145882]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: report_blackops_margin_spot_consolida>
      Dec 05 09:56:01 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:56:01.752+0000 [id=145882]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: report_blackops_margin_spot_consolidate>
      Dec 05 09:56:10 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:56:10.959+0000 [id=140281]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: nexus_db_mirror_checker #171910
      Dec 05 09:56:10 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:56:10.969+0000 [id=140281]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: nexus_db_mirror_checker #171775
      Dec 05 09:57:02 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:02.155+0000 [id=145926]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: blackops_import_ubs_pb_cash_positions>
      Dec 05 09:57:02 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:02.165+0000 [id=145926]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: blackops_import_ubs_pb_cash_positions_c>
      Dec 05 09:57:12 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:12.472+0000 [id=138985]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: nexus_db_replicate_roles #295798
      Dec 05 09:57:12 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:12.489+0000 [id=138985]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: nexus_db_replicate_roles #295642
      Dec 05 09:57:20 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:20.796+0000 [id=145857]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_block_ubs_fx #228536
      Dec 05 09:57:20 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:20.810+0000 [id=145857]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: intraday_block_ubs_fx #228114
      Dec 05 09:57:21 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:21.544+0000 [id=138981]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_alloc_jpm_lme_futures #81195
      Dec 05 09:57:21 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:21.549+0000 [id=138981]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: intraday_alloc_jpm_lme_futures #81015
      Dec 05 09:59:52 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:59:52.735+0000 [id=145802]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: it_cam_make_file_windows-netapp-cifs >
      Dec 05 09:59:52 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:59:52.741+0000 [id=145802]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: it_cam_make_file_windows-netapp-cifs #2>
      Dec 05 10:00:02 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 10:00:02.271+0000 [id=145951]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: it_cam_change_file_linux-netapp-nfs #>
      Dec 05 10:00:02 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 10:00:02.288+0000 [id=145951]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: it_cam_change_file_linux-netapp-nfs #29>
      Dec 05 10:00:07 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 10:00:07.582+0000 [id=145977]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: blackops_import_marketaxess_arm_respo>
      Dec 05 10:00:09 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 10:00:09.267+0000 [id=145990]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_alloc_ne_lme #146288
      

          [JENKINS-74957] Jenkins eventually stops responding properly and running jobs since 2.479.1

          Mark Waite added a comment - - edited

          This week, the Build Time Trend has stopped loading properly again, and then it started to accumulate a growing build queue and I had to restart it (I downgraded back to 2.264.3 at the same time2, as it's our production instance and we can't afford much downtime).

          I see that you have over 295000 builds of a particular job. Are you retaining the history of that many builds or is the history a subset of the 295000 builds?

          I see slower response on the build time trend graph when I retain many build history records. In my case, I had 1300+ build history entries so the build history table was rendering the build history with 1300+ rows, even though the build history graph was showing less than 100 items. If you're keeping a lot of build history and the build time trend graph is being opened, maybe it is causing other things to run more slowly.

          Mark Waite added a comment - - edited This week, the Build Time Trend has stopped loading properly again, and then it started to accumulate a growing build queue and I had to restart it (I downgraded back to 2.264.3 at the same time2, as it's our production instance and we can't afford much downtime). I see that you have over 295000 builds of a particular job. Are you retaining the history of that many builds or is the history a subset of the 295000 builds? I see slower response on the build time trend graph when I retain many build history records. In my case, I had 1300+ build history entries so the build history table was rendering the build history with 1300+ rows, even though the build history graph was showing less than 100 items. If you're keeping a lot of build history and the build time trend graph is being opened, maybe it is causing other things to run more slowly.

          Mark Waite added a comment -

          The list of open issues for the logfile size checker plugin seems like that plugin could easily cause problems. If you're unable to adopt the plugin and fix the open issues, you may need to disable the plugin or remove it from your installation. I can't prove that it is the source of the problem, but it has a low plugin health score and is included in your stack traces.

          Mark Waite added a comment - The list of open issues for the logfile size checker plugin seems like that plugin could easily cause problems. If you're unable to adopt the plugin and fix the open issues, you may need to disable the plugin or remove it from your installation. I can't prove that it is the source of the problem, but it has a low plugin health score and is included in your stack traces.

          Chris Wilson added a comment -

          Thanks for the suggestion markewaite. We only keep 1 day of history for this job, which is ~155 builds. Will consider removing that plugin, but given that it's worked absolutely fine until now, I think that if the plugin is at fault, something must have changed in core to trigger the problem, which might have wider implications.

          Chris Wilson added a comment - Thanks for the suggestion markewaite . We only keep 1 day of history for this job, which is ~155 builds. Will consider removing that plugin, but given that it's worked absolutely fine until now, I think that if the plugin is at fault, something must have changed in core to trigger the problem, which might have wider implications.

          Mark Waite added a comment -

          Many, many things changed between 2.462.3 and 2.479.1. Upgrades in 2.479.1 include:

          • Spring Security 5 to Spring Security 6
          • Java EE 8 to Jakarta EE 9
          • Eclipse Jetty 10 to Eclipse Jetty 12
          • Require Java 11 as minimum Java to require Java 17 as minimum

          If you can find a way to duplicate the problem in a fresh installation, I'm confident there will be interest to investigate the change. I was unable to duplicate the issue in a fresh installation.

          Mark Waite added a comment - Many, many things changed between 2.462.3 and 2.479.1. Upgrades in 2.479.1 include: Spring Security 5 to Spring Security 6 Java EE 8 to Jakarta EE 9 Eclipse Jetty 10 to Eclipse Jetty 12 Require Java 11 as minimum Java to require Java 17 as minimum If you can find a way to duplicate the problem in a fresh installation, I'm confident there will be interest to investigate the change. I was unable to duplicate the issue in a fresh installation.

          Chris Wilson added a comment -

          The older LTS Jenkins (2.462.3), which had been fine for many months, just failed in the same way. So it seems like the upgrade to 2.479.1 was not the problem. I'm going to try disabling the logfile size checker plugin (thanks markewaite) to see if that helps.

          Chris Wilson added a comment - The older LTS Jenkins (2.462.3), which had been fine for many months, just failed in the same way. So it seems like the upgrade to 2.479.1 was not the problem. I'm going to try disabling the logfile size checker plugin (thanks markewaite ) to see if that helps.

          Chris Wilson added a comment - - edited

          More info for future reference/debugging:

          There were many threads stuck in the same state as before, waiting on the same object:

          waiting on java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4c072eb0
          at java.base@17.0.13/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
          at java.base@17.0.13/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
          at java.base@17.0.13/java.util.concurrent.FutureTask.get(FutureTask.java:190)
          at hudson.tasks.BuildTrigger.execute(BuildTrigger.java:268)
          at hudson.model.AbstractBuild$AbstractBuildExecution.cleanUp(AbstractBuild.java:728)
          at hudson.model.Build$BuildExecution.cleanUp(Build.java:194)
          at hudson.model.Run.execute(Run.java:1874)

          The first one that hung was a job that generates and loads Groovy DSL. I think this might be important as I have a feeling that this was involved in previous hangs.

          I killed that job using this script (inspired by a SO post):

          jenkins.model.Jenkins.instance
                               .computers.collect { c -> c.executors }
                               .collectMany { it.findAll{ it.isBusy() } }
                               .each { 
                                 // it.interrupt() 
                                 print(it.name)
                                 println(it.getDisplayName() == "Executor #13")
                                 if(it.getDisplayName() == "Executor #13") {
                                   it.interrupt() 
                                 }
                               }

          This did not immediately cause the other jobs to resume, but eventually (just after the hour), one job (newly triggered by a timer) started and completed, and that seemed to unstick all the blocked jobs at the same time (like a notifyAll), and the build queue went back to normal (without restarting Jenkins).

          The hung threads all seemed to be waiting for the Jenkins.getFutureDependencyGraph() to complete. Possibly recomputing the dependency graph was not hung but just very slow?

          I was hoping to use this script (again, inspired by an SO post) to diagnose who was holding the lock (matching against the many "waiting on" lines in the threadDump), but the queue disappeared before I got to it:

          import java.lang.management.ThreadMXBean
          import java.lang.management.ThreadInfo import java.lang.management.ManagementFactory   
          // Jenkins.get().getFutureDependencyGraph()//.get()
          ThreadMXBean bean = ManagementFactory.getThreadMXBean();
          ThreadInfo[] tis = bean.getThreadInfo(bean.getAllThreadIds(), true, true);
          for (ThreadInfo ti : tis) { 
            // print(ti.getLockedSynchronizers())    
            println(ti.getLockName())
          }

          Chris Wilson added a comment - - edited More info for future reference/debugging: There were many threads stuck in the same state as before, waiting on the same object: waiting on java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4c072eb0 at java.base@17.0.13/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211) at java.base@17.0.13/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447) at java.base@17.0.13/java.util.concurrent.FutureTask.get(FutureTask.java:190) at hudson.tasks.BuildTrigger.execute(BuildTrigger.java:268) at hudson.model.AbstractBuild$AbstractBuildExecution.cleanUp(AbstractBuild.java:728) at hudson.model.Build$BuildExecution.cleanUp(Build.java:194) at hudson.model.Run.execute(Run.java:1874) The first one that hung was a job that generates and loads Groovy DSL. I think this might be important as I have a feeling that this was involved in previous hangs. I killed that job using this script (inspired by a SO post ): jenkins.model.Jenkins.instance                      .computers.collect { c -> c.executors }                      .collectMany { it.findAll{ it.isBusy() } }                      .each {                          // it.interrupt()                         print(it.name)                        println(it.getDisplayName() == "Executor #13" )                         if (it.getDisplayName() == "Executor #13" ) {                          it.interrupt()                         }                      } This did not immediately cause the other jobs to resume, but eventually (just after the hour), one job (newly triggered by a timer) started and completed, and that seemed to unstick all the blocked jobs at the same time (like a notifyAll), and the build queue went back to normal (without restarting Jenkins). The hung threads all seemed to be waiting for the Jenkins.getFutureDependencyGraph() to complete. Possibly recomputing the dependency graph was not hung but just very slow? I was hoping to use this script (again, inspired by an SO post ) to diagnose who was holding the lock (matching against the many "waiting on" lines in the threadDump), but the queue disappeared before I got to it: import java.lang.management.ThreadMXBean import java.lang.management.ThreadInfo import java.lang.management.ManagementFactory   // Jenkins.get().getFutureDependencyGraph()//.get() ThreadMXBean bean = ManagementFactory.getThreadMXBean(); ThreadInfo[] tis = bean.getThreadInfo(bean.getAllThreadIds(), true , true ); for (ThreadInfo ti : tis) { // print(ti.getLockedSynchronizers())     println(ti.getLockName()) }

            Unassigned Unassigned
            qris Chris Wilson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: