Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59793

Possible thread leak 'QueueSubTaskMetrics' in metrics - Allow finishing builds when SubTask badly fail

    • jenkins-2.205

      In one large instance we can see 3000 threads with this shape:

       

      "QueueSubTaskMetrics [#11342]" #5508106 daemon prio=5 os_prio=0 tid=0x00007efcf085a800 nid=0x52c7 in Object.wait() [0x00007efbccb32000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:75) - locked <0x0000000512c01d10> (a hudson.model.queue.FutureImpl) at jenkins.metrics.impl.JenkinsMetricProviderImpl.lambda$asSupplier$3(JenkinsMetricProviderImpl.java:1142) at jenkins.metrics.impl.JenkinsMetricProviderImpl$$Lambda$388/1851215464.get(Unknown Source) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000004dc401fb0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

       

      There is way less number of jobs running or waiting (130) and the number of these threads never decreases.

      After investigating the code:

      • Metrics listens to the queue with a QueueListener creating threads with the name QueueSubTaskMetrics
      • The 3000 threads shown are all waiting for a future to be completed.
      while(!completed)
         wait();

      I think it could be avoided calling the get method with a timeout instead of getting blocked there: https://github.com/jenkinsci/metrics-plugin/commit/e803bc3b82b54bfe27e66a6393dedf53bdf1896e#diff-b02885f3ba6b4982b5322b73e664c0b6R1049

       

       

          [JENKINS-59793] Possible thread leak 'QueueSubTaskMetrics' in metrics - Allow finishing builds when SubTask badly fail

          Ramon Leon added a comment -

          Ramon Leon added a comment - PR:   https://github.com/jenkinsci/jenkins/pull/4346

          Ramon Leon added a comment -

          I've pushed the test failing, after the CI fails I will push the fix

          Ramon Leon added a comment - I've pushed the test failing, after the CI fails I will push the fix

          Ramon Leon added a comment - - edited

          Besides the code in Metrics:

           

          https://github.com/jenkinsci/metrics-plugin/blob/24bf92ebd59095d8f77b2552696b3f210a024bdc/src/main/java/jenkins/metrics/impl/JenkinsMetricProviderImpl.java#L1142

           

          try {
              return future.get();
          } catch (Throwable t) {
              sneakyThrow(t);
              throw new RuntimeException(t);
          }
          

           

          We have here another affected code in Pipeline:

          https://github.com/jenkinsci/workflow-cps-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L940

          try {
              return getCurrentExecutions(false).get(1, TimeUnit.SECONDS).stream().anyMatch(StepExecution::blocksRestart);
          } catch (Exception x) {
              // TODO RestartListener.Default.isReadyToRestart can throw checked exceptions, but AsynchronousExecution.blocksRestart does not currently allow it
              Level level = x.getCause() instanceof RejectedExecutionException ? /* stray Executor past program end? */ Level.FINE : Level.WARNING;
              LOGGER.log(level, "Not blocking restart due to problem checking running steps in " + this, x);
              return false;
          }
          

          We see in the log:

           

           

          2019-10-14 13:16:56.104+0000 [id=16386139]  WARNING o.j.p.w.cps.CpsFlowExecution#blocksRestart: Not blocking restart due to problem checking running steps in CpsFlowExecution[******]]java.util.concurrent.TimeoutException: Timeout waiting for task.
          at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:259)    
          at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91)    
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.blocksRestart(CpsFlowExecution.java:940)    
          at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.blocksRestart(WorkflowRun.java:372)    
          at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.displayCell(WorkflowRun.java:375)    
          at hudson.model.Executor.isDisplayCell(Executor.java:668)
          ...

           

          Fortunately, this code just waits 1 second, so it doesn't block anything, although it populates the log with these messages.

          Ramon Leon added a comment - - edited Besides the code in Metrics :   https://github.com/jenkinsci/metrics-plugin/blob/24bf92ebd59095d8f77b2552696b3f210a024bdc/src/main/java/jenkins/metrics/impl/JenkinsMetricProviderImpl.java#L1142   try { return future .get(); } catch (Throwable t) { sneakyThrow(t); throw new RuntimeException(t); }   We have here another affected code in Pipeline : https://github.com/jenkinsci/workflow-cps-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L940 try { return getCurrentExecutions( false ).get(1, TimeUnit.SECONDS).stream().anyMatch(StepExecution::blocksRestart); } catch (Exception x) { // TODO RestartListener.Default.isReadyToRestart can throw checked exceptions, but AsynchronousExecution.blocksRestart does not currently allow it Level level = x.getCause() instanceof RejectedExecutionException ? /* stray Executor past program end? */ Level.FINE : Level.WARNING; LOGGER.log(level, "Not blocking restart due to problem checking running steps in " + this , x); return false ; } We see in the log:     2019-10-14 13:16:56.104+0000 [id=16386139]  WARNING o.j.p.w.cps.CpsFlowExecution#blocksRestart: Not blocking restart due to problem checking running steps in CpsFlowExecution[******]]java.util.concurrent.TimeoutException: Timeout waiting  for  task. at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:259)     at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91)     at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.blocksRestart(CpsFlowExecution.java:940)     at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.blocksRestart(WorkflowRun.java:372)     at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.displayCell(WorkflowRun.java:375)     at hudson.model.Executor.isDisplayCell(Executor.java:668) ...   Fortunately, this code just waits 1 second, so it doesn't block anything, although it populates the log with these messages.

          Jesse Glick added a comment -

          I do not believe the workflow-cps code mentioned has any relation to this issue. Perhaps I am missing something.

          Jesse Glick added a comment - I do not believe the workflow-cps code mentioned has any relation to this issue. Perhaps I am missing something.

          Ramon Leon added a comment -

          I mentioned this because in the instance we detected the metric threads, there were a lot of log messages of this type. But maybe it's not related and it's just a performance issue.

          Ramon Leon added a comment - I mentioned this because in the instance we detected the metric threads, there were a lot of log messages of this type. But maybe it's not related and it's just a performance issue.

          Ramon Leon added a comment -

          Added what has been done to the title for the sake of searches

          Ramon Leon added a comment - Added what has been done to the title for the sake of searches

          mramonleon, thanks for the prompt fix of the issue. I would appreciate if you can report back whether this fully mediated the problem in affected production deployment.

          Oliver Gondža added a comment - mramonleon , thanks for the prompt fix of the issue. I would appreciate if you can report back whether this fully mediated the problem in affected production deployment.

          Ramon Leon added a comment -

          It will take some time to land, I can give you some feedback by the end of December

          Ramon Leon added a comment - It will take some time to land, I can give you some feedback by the end of December

          We seem to see this issue with metrics. Will this be backported to the LTS release? olivergondza

          We are currently on 2.176.2

          Raihaan Shouhell added a comment - We seem to see this issue with metrics. Will this be backported to the LTS release? olivergondza We are currently on 2.176.2

          This was backported to 2.204.1. RC is expected to be out on Dec 4th, final release on Dec 18th.

          Oliver Gondža added a comment - This was backported to 2.204.1. RC is expected to be out on Dec 4th, final release on Dec 18th.

            mramonleon Ramon Leon
            mramonleon Ramon Leon
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: