Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59793

Possible thread leak 'QueueSubTaskMetrics' in metrics - Allow finishing builds when SubTask badly fail

    • jenkins-2.205

      In one large instance we can see 3000 threads with this shape:

       

      "QueueSubTaskMetrics [#11342]" #5508106 daemon prio=5 os_prio=0 tid=0x00007efcf085a800 nid=0x52c7 in Object.wait() [0x00007efbccb32000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:75) - locked <0x0000000512c01d10> (a hudson.model.queue.FutureImpl) at jenkins.metrics.impl.JenkinsMetricProviderImpl.lambda$asSupplier$3(JenkinsMetricProviderImpl.java:1142) at jenkins.metrics.impl.JenkinsMetricProviderImpl$$Lambda$388/1851215464.get(Unknown Source) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000004dc401fb0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

       

      There is way less number of jobs running or waiting (130) and the number of these threads never decreases.

      After investigating the code:

      • Metrics listens to the queue with a QueueListener creating threads with the name QueueSubTaskMetrics
      • The 3000 threads shown are all waiting for a future to be completed.
      while(!completed)
         wait();

      I think it could be avoided calling the get method with a timeout instead of getting blocked there: https://github.com/jenkinsci/metrics-plugin/commit/e803bc3b82b54bfe27e66a6393dedf53bdf1896e#diff-b02885f3ba6b4982b5322b73e664c0b6R1049

       

       

          [JENKINS-59793] Possible thread leak 'QueueSubTaskMetrics' in metrics - Allow finishing builds when SubTask badly fail

          Ramon Leon created issue -

          Daniel Beck added a comment -

          Interesting, that might have been a source of too many threads on ci.jenkins.io recently as well.

          Daniel Beck added a comment - Interesting, that might have been a source of too many threads on ci.jenkins.io recently as well.
          Baptiste Mathus made changes -
          Priority Original: Major [ 3 ] New: Critical [ 2 ]

          Confirming that my instances get the same issue on Jenkins 2.176.2 with Metrics 4.0.2.4

          Raihaan Shouhell added a comment - Confirming that my instances get the same issue on Jenkins 2.176.2 with Metrics 4.0.2.4

          Ramon Leon added a comment -

          Another approach, I think the best candidate, but it requires fixing Jenkins core:

          I’m pondering whether we could solve the problem of so many QueueSubTaskMetrics threads waiting by ensuring the synchronizeEnd and synchronizeStart always set the future? For example by surrounding the code of these two methods with try-catch and setting the future in the finally clause before rethrowing the exception. I think this because if an exception is thrown in these methods, the result is not set, so the thread is always waiting because the complete flag is not set.

          WorkUnitContext.java#L132

          But this is a fix on Jenkins core, not in Metrics plugin.

          Ramon Leon added a comment - Another approach, I think the best candidate, but it requires fixing Jenkins core: I’m pondering whether we could solve the problem of so many QueueSubTaskMetrics threads waiting by ensuring the synchronizeEnd and synchronizeStart always set the future? For example by surrounding the code of these two methods with try-catch and setting the future in the finally clause before rethrowing the exception. I think this because if an exception is thrown in these methods, the result is not set, so the thread is always waiting because the complete flag is not set. WorkUnitContext.java#L132 But this is a fix on Jenkins core, not in Metrics plugin.
          Ramon Leon made changes -
          Assignee New: Ramon Leon [ mramonleon ]
          Ramon Leon made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]

          We are being impacted by this as well.

          mramonleon I believe the mistake is in the metrics plugin, that seems to presumes that when QueueListener#onLeft is called, its task will start eventually. On a first glance it appears this is not true when the task is canceled while in queue. There might be other reasons I am not aware of, too.

          Oliver Gondža added a comment - We are being impacted by this as well. mramonleon I believe the mistake is in the metrics plugin, that seems to presumes that when QueueListener#onLeft is called, its task will start eventually. On a first glance it appears this is not true when the task is canceled while in queue. There might be other reasons I am not aware of, too.
          CloudBees Foundation Security made changes -
          Comment [ I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks! ]

          Ramon Leon added a comment -

          I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks!

          Ramon Leon added a comment - I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks!

            mramonleon Ramon Leon
            mramonleon Ramon Leon
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: