Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-74957

Jenkins eventually stops responding properly and running jobs since 2.479.1

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • core
    • None

      Jenkins 4.462.x was running fine for us since May. About two weeks ago we upgraded to 2.479.1 (and all plugins to latest compatible versions).

      A few days later, the Build Time Trend stopped loading properly (says "Computation in progress" forever). The following Saturday it stopped running jobs (parent kicked off downstream but didn't get woken up when downstream completed) and the restart timed out and it was killed by systemd.

      This week, the Build Time Trend has stopped loading properly again, and then it started to accumulate a growing build queue and I had to restart it (I downgraded back to 2.264.3 at the same time2, as it's our production instance and we can't afford much downtime).

      I know we are using a lot of plugins and the problem could potentially be there, but I don't know in which one and can't really experiment with disabling random plugins on the production server. I suspect that we wouldn't see the issue on a test instance. Anything that could point to which plugin could be causing the problem would be most helpful.

      Looking at the thread dump, only one thing stands out: we have 10 threads running the logfilesizechecker plugin, all busy trying to check if something is a Gzip stream, which seems like far too many of these threads:

      "jenkins.util.Timer [#10]" Id=67 Group=main RUNNABLE
      	at java.base@17.0.13/sun.nio.fs.UnixNativeDispatcher.open0(Native Method)
      	at java.base@17.0.13/sun.nio.fs.UnixNativeDispatcher.open(UnixNativeDispatcher.java:68)
      	at java.base@17.0.13/sun.nio.fs.UnixChannelFactory.open(UnixChannelFactory.java:258)
      	at java.base@17.0.13/sun.nio.fs.UnixChannelFactory.newFileChannel(UnixChannelFactory.java:133)
      	at java.base@17.0.13/sun.nio.fs.UnixChannelFactory.newFileChannel(UnixChannelFactory.java:146)
      	at java.base@17.0.13/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:216)
      	at java.base@17.0.13/java.nio.file.Files.newByteChannel(Files.java:380)
      	at java.base@17.0.13/java.nio.file.Files.newByteChannel(Files.java:432)
      	at java.base@17.0.13/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
      	at java.base@17.0.13/java.nio.file.Files.newInputStream(Files.java:160)
      	at org.kohsuke.stapler.framework.io.LargeText$GzipAwareSession.isGzipStream(LargeText.java:542)
      	at org.kohsuke.stapler.framework.io.LargeText.<init>(LargeText.java:110)
      	at hudson.console.AnnotatedLargeText.<init>(AnnotatedLargeText.java:88)
      	at hudson.model.Run.getLogText(Run.java:1505)
      	at PluginClassLoader for logfilesizechecker//hudson.plugins.logfilesizechecker.LogfilesizecheckerWrapper$LogSizeTimerTask.doRun(LogfilesizecheckerWrapper.java:108)
      	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:92)
      	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
      	at java.base@17.0.13/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
      	at java.base@17.0.13/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
      	at java.base@17.0.13/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
      	at java.base@17.0.13/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      	at java.base@17.0.13/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      	at java.base@17.0.13/java.lang.Thread.run(Thread.java:840)
      
      	Number of locked synchronizers = 1
      	- java.util.concurrent.ThreadPoolExecutor$Worker@89ed5b6
      

      I also have a job stuck waiting for the previous build to complete, but it already has:

      The previous build "finished" with these logs, but it's still showing the animated dots below that:

      Errors were encountered
      Build step 'Execute shell' marked build as failure
      Sending e-mails to: xxx@xxx.com
      Notifying upstream projects of job completion
      

      And the thread dump shows:

      "Executor #0 for lhr-vexec02-fast : executing nexus_db_replicate_roles #295805" Id=146231 Group=main WAITING on java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4c072eb0
      at java.base@17.0.13/jdk.internal.misc.Unsafe.park(Native Method)

      • waiting on java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4c072eb0
        at java.base@17.0.13/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
        at java.base@17.0.13/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
        at java.base@17.0.13/java.util.concurrent.FutureTask.get(FutureTask.java:190)
        at hudson.tasks.BuildTrigger.execute(BuildTrigger.java:268)
        at hudson.model.AbstractBuild$AbstractBuildExecution.cleanUp(AbstractBuild.java:728)
        at hudson.model.Build$BuildExecution.cleanUp(Build.java:194)
        at hudson.model.Run.execute(Run.java:1874)
        at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
        at hudson.model.ResourceController.execute(ResourceController.java:101)
        at hudson.model.Executor.run(Executor.java:445)

      Many jobs are stuck waiting on the same object, I'm going to have to restart Jenkins soon.

      • waiting on java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4c072eb0

      It seems from the logs that jobs start getting completed but not deleted (that usually are deleted afterwards) and then end up stuck in the build queue:

      Dec 05 09:54:59 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:54:59.420+0000 [id=145920]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: blackops_import_marketaxess_arm_respo>
      Dec 05 09:55:57 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:55:57.102+0000 [id=145912]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_alloc_barcap_fo #96214
      Dec 05 09:55:57 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:55:57.121+0000 [id=145912]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: intraday_alloc_barcap_fo #95820
      Dec 05 09:55:57 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:55:57.402+0000 [id=145916]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_recon_ms_futures #436179
      Dec 05 09:55:57 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:55:57.412+0000 [id=145916]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: intraday_recon_ms_futures #435398
      Dec 05 09:56:01 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:56:01.741+0000 [id=145882]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: report_blackops_margin_spot_consolida>
      Dec 05 09:56:01 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:56:01.752+0000 [id=145882]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: report_blackops_margin_spot_consolidate>
      Dec 05 09:56:10 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:56:10.959+0000 [id=140281]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: nexus_db_mirror_checker #171910
      Dec 05 09:56:10 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:56:10.969+0000 [id=140281]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: nexus_db_mirror_checker #171775
      Dec 05 09:57:02 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:02.155+0000 [id=145926]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: blackops_import_ubs_pb_cash_positions>
      Dec 05 09:57:02 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:02.165+0000 [id=145926]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: blackops_import_ubs_pb_cash_positions_c>
      Dec 05 09:57:12 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:12.472+0000 [id=138985]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: nexus_db_replicate_roles #295798
      Dec 05 09:57:12 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:12.489+0000 [id=138985]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: nexus_db_replicate_roles #295642
      Dec 05 09:57:20 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:20.796+0000 [id=145857]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_block_ubs_fx #228536
      Dec 05 09:57:20 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:20.810+0000 [id=145857]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: intraday_block_ubs_fx #228114
      Dec 05 09:57:21 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:21.544+0000 [id=138981]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_alloc_jpm_lme_futures #81195
      Dec 05 09:57:21 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:57:21.549+0000 [id=138981]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: intraday_alloc_jpm_lme_futures #81015
      Dec 05 09:59:52 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:59:52.735+0000 [id=145802]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: it_cam_make_file_windows-netapp-cifs >
      Dec 05 09:59:52 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 09:59:52.741+0000 [id=145802]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: it_cam_make_file_windows-netapp-cifs #2>
      Dec 05 10:00:02 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 10:00:02.271+0000 [id=145951]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: it_cam_change_file_linux-netapp-nfs #>
      Dec 05 10:00:02 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 10:00:02.288+0000 [id=145951]        INFO        o.j.p.l.queue.LockRunListener#onDeleted: it_cam_change_file_linux-netapp-nfs #29>
      Dec 05 10:00:07 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 10:00:07.582+0000 [id=145977]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: blackops_import_marketaxess_arm_respo>
      Dec 05 10:00:09 jenkins-node-01 jenkins-prod[2672733]: 2024-12-05 10:00:09.267+0000 [id=145990]        INFO        o.j.p.l.queue.LockRunListener#onCompleted: intraday_alloc_ne_lme #146288
      

            Unassigned Unassigned
            qris Chris Wilson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: