Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-72678

Thread Deadlock during GitHub Organization Scan and WeatherColumn

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • Running Jenkins controllers on AWS ECS. Using EFS for Jenkins home storage.

      Jenkins Controller: 2.246.1, 2.426.3
      github-branch-source-plugin: 1752.vc201a_0235d80
      cloudbees-folder-plugin: 6.858.v898218f3609d

      We are seeing occasional deadlocked/blocked threads on the Github Organization Scans due to possible health checks/WeatherColumn.  

      From the catalina.log:

      Some health checks are reporting as unhealthy: [thread-deadlock : [Executor #-1 for Built-In Node : executing OrganizationScan[CP-Models] locked on hudson.model.RunMap@50446244 (owned by Handling GET /job/CP-Models/ from 192.168.1.143 : Jetty (winstone)-345663 View/index.jelly WeatherColumn/column.jelly):
           at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:526)
           at jenkins.model.lazy.AbstractLazyLoadRunMap.search(AbstractLazyLoadRunMap.java:502)
           at jenkins.model.lazy.AbstractLazyLoadRunMap.newestBuild(AbstractLazyLoadRunMap.java:435)
           at jenkins.model.lazy.LazyBuildMixIn.getLastBuild(LazyBuildMixIn.java:254)
           at org.jenkinsci.plugins.workflow.job.WorkflowJob.getLastBuild(WorkflowJob.java:242)
           at org.jenkinsci.plugins.workflow.job.WorkflowJob.getLastBuild(WorkflowJob.java:105)
           at hudson.model.Job.getBuildHealthReports(Job.java:1200)
           at hudson.model.Job.getBuildHealth(Job.java:1193)
           at com.cloudbees.hudson.plugins.folder.health.FolderHealthMetric.getHealthReport(FolderHealthMetric.java:48)
           at com.cloudbees.hudson.plugins.folder.health.WorstChildHealthMetric$ReporterImpl.observe(WorstChildHealthMetric.java:86)
           at com.cloudbees.hudson.plugins.folder.AbstractFolder.getBuildHealthReports(AbstractFolder.java:924)
           at java.base@11.0.21/java.lang.invoke.DirectMethodHandle$Holder.invokeVirtual(DirectMethodHandle$Holder)
           at java.base@11.0.21/java.lang.invoke.LambdaForm$MH/0x0000000800277840.invoke(LambdaForm$MH)
           at java.base@11.0.21/java.lang.invoke.LambdaForm$MH/0x00000008002a7440.invoke_MT(LambdaForm$MH)
           at org.kohsuke.stapler.export.MethodProperty.getValue(MethodProperty.java:76)
           at org.kohsuke.stapler.export.ExportInterceptor$1.getValue(ExportInterceptor.java:46)
           at io.jenkins.plugins.generic.event.json.ExportedBeanProcessor$IgnoreURLExportInterceptor.getValue(ExportedBeanProcessor.java:71)
           at org.kohsuke.stapler.export.Property.writeTo(Property.java:136)
           at org.kohsuke.stapler.export.Model.writeNestedObjectTo(Model.java:222)
           at org.kohsuke.stapler.export.Model.writeNestedObjectTo(Model.java:218)
           at org.kohsuke.stapler.export.Model.writeNestedObjectTo(Model.java:218)
           at org.kohsuke.stapler.export.Model.writeNestedObjectTo(Model.java:218)
           at org.kohsuke.stapler.export.Model.writeTo(Model.java:193)
           at org.kohsuke.stapler.export.Model.writeTo(Model.java:213)
           at org.kohsuke.stapler.export.Model.writeTo(Model.java:181)
           at io.jenkins.plugins.generic.event.json.ExportedBeanProcessor.processBean(ExportedBeanProcessor.java:43)
           at net.sf.json.JSONObject._fromBean(JSONObject.java:619)
           at net.sf.json.JSONObject.fromObject(JSONObject.java:169)
           at net.sf.json.AbstractJSON._processValue(AbstractJSON.java:250)
      ...

      See the attached full catalina.log stack trace.   I also attached the thread dump showing the BLOCKED threads.   

      This looked like the Weather health metric/folder performance issue as documented here by Cloudbees.  But we have disabled all health metrics on all folders including Github Organization folders and we still have deadlocks/blocked threads.

      How to disable the weather column to resolve instance slowness? (cloudbees.com)

      When this occurs, we have to restart the Jenkins controller.   This seems to have started after we upgraded to 2.246.1 back in December as we have had rock solid controllers on 2.387.3 which was the previous version.   We are currently running 2.246.3 and see the issue.

        1. 20240205_threaddump.txt
          679 kB
          John Lengeling
        2. 20240221_threaddump.txt
          2.71 MB
          John Lengeling
        3. catalina-stacktrace.log
          14 kB
          John Lengeling

            Unassigned Unassigned
            johnlengeling John Lengeling
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: