Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59793

Possible thread leak 'QueueSubTaskMetrics' in metrics - Allow finishing builds when SubTask badly fail

    • jenkins-2.205

      In one large instance we can see 3000 threads with this shape:

       

      "QueueSubTaskMetrics [#11342]" #5508106 daemon prio=5 os_prio=0 tid=0x00007efcf085a800 nid=0x52c7 in Object.wait() [0x00007efbccb32000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:75) - locked <0x0000000512c01d10> (a hudson.model.queue.FutureImpl) at jenkins.metrics.impl.JenkinsMetricProviderImpl.lambda$asSupplier$3(JenkinsMetricProviderImpl.java:1142) at jenkins.metrics.impl.JenkinsMetricProviderImpl$$Lambda$388/1851215464.get(Unknown Source) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000004dc401fb0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

       

      There is way less number of jobs running or waiting (130) and the number of these threads never decreases.

      After investigating the code:

      • Metrics listens to the queue with a QueueListener creating threads with the name QueueSubTaskMetrics
      • The 3000 threads shown are all waiting for a future to be completed.
      while(!completed)
         wait();

      I think it could be avoided calling the get method with a timeout instead of getting blocked there: https://github.com/jenkinsci/metrics-plugin/commit/e803bc3b82b54bfe27e66a6393dedf53bdf1896e#diff-b02885f3ba6b4982b5322b73e664c0b6R1049

       

       

          [JENKINS-59793] Possible thread leak 'QueueSubTaskMetrics' in metrics - Allow finishing builds when SubTask badly fail

          Daniel Beck added a comment -

          Interesting, that might have been a source of too many threads on ci.jenkins.io recently as well.

          Daniel Beck added a comment - Interesting, that might have been a source of too many threads on ci.jenkins.io recently as well.

          Confirming that my instances get the same issue on Jenkins 2.176.2 with Metrics 4.0.2.4

          Raihaan Shouhell added a comment - Confirming that my instances get the same issue on Jenkins 2.176.2 with Metrics 4.0.2.4

          Ramon Leon added a comment -

          Another approach, I think the best candidate, but it requires fixing Jenkins core:

          I’m pondering whether we could solve the problem of so many QueueSubTaskMetrics threads waiting by ensuring the synchronizeEnd and synchronizeStart always set the future? For example by surrounding the code of these two methods with try-catch and setting the future in the finally clause before rethrowing the exception. I think this because if an exception is thrown in these methods, the result is not set, so the thread is always waiting because the complete flag is not set.

          WorkUnitContext.java#L132

          But this is a fix on Jenkins core, not in Metrics plugin.

          Ramon Leon added a comment - Another approach, I think the best candidate, but it requires fixing Jenkins core: I’m pondering whether we could solve the problem of so many QueueSubTaskMetrics threads waiting by ensuring the synchronizeEnd and synchronizeStart always set the future? For example by surrounding the code of these two methods with try-catch and setting the future in the finally clause before rethrowing the exception. I think this because if an exception is thrown in these methods, the result is not set, so the thread is always waiting because the complete flag is not set. WorkUnitContext.java#L132 But this is a fix on Jenkins core, not in Metrics plugin.

          We are being impacted by this as well.

          mramonleon I believe the mistake is in the metrics plugin, that seems to presumes that when QueueListener#onLeft is called, its task will start eventually. On a first glance it appears this is not true when the task is canceled while in queue. There might be other reasons I am not aware of, too.

          Oliver Gondža added a comment - We are being impacted by this as well. mramonleon I believe the mistake is in the metrics plugin, that seems to presumes that when QueueListener#onLeft is called, its task will start eventually. On a first glance it appears this is not true when the task is canceled while in queue. There might be other reasons I am not aware of, too.

          Ramon Leon added a comment -

          I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks!

          Ramon Leon added a comment - I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks!

          Ramon Leon added a comment -

          I’m trying to cancel jobs while in the Queue (cancel) or once running (getExecutor().interrupt()), but in this way, the job is correctly finished so the metrics thread is freed. I need to figure out a way to kill a job in a way that the future is not set in the WorkUnitContext: https://github.com/jenkinsci/jenkins/blob/264980c707c6cafc01369378b1e90d247561df78/core/src/main/java/hudson/model/queue/WorkUnitContext.java#L132

          Most likely playing with several executors and subtasks and interrupting some of them.

          Still working, any clue on how to achieve that is welcome danielbeck jglick oleg_nenashev

          Ramon Leon added a comment - I’m trying to cancel jobs while in the Queue ( cancel ) or once running ( getExecutor().interrupt() ), but in this way, the job is correctly finished so the metrics thread is freed. I need to figure out a way to kill a job in a way that the future is not set in the WorkUnitContext:  https://github.com/jenkinsci/jenkins/blob/264980c707c6cafc01369378b1e90d247561df78/core/src/main/java/hudson/model/queue/WorkUnitContext.java#L132 Most likely playing with several executors and subtasks and interrupting some of them. Still working, any clue on how to achieve that is welcome danielbeck jglick oleg_nenashev

          Ramon Leon added a comment -

          I've managed to reproduce the problem in a test. I schedule a freestyle project with a long SleepBuilder and you add a SubTaskContributor throwing an exception in the run method of the Executable created. Then the WorkUnitContext#synchronizeEnd method exists with an InterruptedException and the future.set is not called. In addition, the next abort call also exists in if (aborted!=null) return; so the endLatch is not aborted properly. With this scenario, the wuc.future keeps waiting forever in the JenkinsMetricProviderImpl#asSupplier method.

          From my point of view the synchronizeEnd should guarantee the future is set even when throwing the InterruptedException, or the abort method shouldn't exit without aborting the latches.

          Ramon Leon added a comment - I've managed to reproduce the problem in a test. I schedule a freestyle project with a long SleepBuilder and you add a SubTaskContributor throwing an exception in the run method of the Executable created. Then the WorkUnitContext#synchronizeEnd method exists with an InterruptedException and the future.set is not called. In addition, the next abort call also exists in if (aborted!=null) return; so the endLatch is not aborted properly. With this scenario, the wuc.future keeps waiting forever in the JenkinsMetricProviderImpl#asSupplier  method. From my point of view the synchronizeEnd should guarantee the future is set even when throwing the InterruptedException, or the abort method shouldn't exit without aborting the latches.

          Ramon Leon added a comment - - edited

          If the exception is thrown in a Builder the wrong behavior is not showed up:

          FreeStyleProject p = j.createProject(FreeStyleProject.class);
          
          p.getBuildersList().add(new TestBuilder() {
           @Override
           public boolean perform(AbstractBuild<?, ?> build, Launcher launcher, BuildListener listener) throws InterruptedException {
           listener.getLogger().println("running failing builder");
           throw new IndexOutOfBoundsException("Unexpected exception while building");
           }
          });

          It happens because when it fails and the code enters the synchronizeEnd method, and then the endLatch.synchronize(), it exits without throwing an exception, which is different from when you get here from a SubTask#run

          Ramon Leon added a comment - - edited If the exception is thrown in a Builder the wrong behavior is not showed up: FreeStyleProject p = j.createProject(FreeStyleProject.class); p.getBuildersList().add( new TestBuilder() { @Override public boolean perform(AbstractBuild<?, ?> build, Launcher launcher, BuildListener listener) throws InterruptedException { listener.getLogger().println( "running failing builder" ); throw new IndexOutOfBoundsException( "Unexpected exception while building" ); } }); It happens because when it fails and the code enters the synchronizeEnd method, and then the endLatch.synchronize() , it exits without throwing an exception, which is different from when you get here from a SubTask#run

          Ramon Leon added a comment -

          Ramon Leon added a comment - PR:   https://github.com/jenkinsci/jenkins/pull/4346

          Ramon Leon added a comment -

          I've pushed the test failing, after the CI fails I will push the fix

          Ramon Leon added a comment - I've pushed the test failing, after the CI fails I will push the fix

          Ramon Leon added a comment - - edited

          Besides the code in Metrics:

           

          https://github.com/jenkinsci/metrics-plugin/blob/24bf92ebd59095d8f77b2552696b3f210a024bdc/src/main/java/jenkins/metrics/impl/JenkinsMetricProviderImpl.java#L1142

           

          try {
              return future.get();
          } catch (Throwable t) {
              sneakyThrow(t);
              throw new RuntimeException(t);
          }
          

           

          We have here another affected code in Pipeline:

          https://github.com/jenkinsci/workflow-cps-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L940

          try {
              return getCurrentExecutions(false).get(1, TimeUnit.SECONDS).stream().anyMatch(StepExecution::blocksRestart);
          } catch (Exception x) {
              // TODO RestartListener.Default.isReadyToRestart can throw checked exceptions, but AsynchronousExecution.blocksRestart does not currently allow it
              Level level = x.getCause() instanceof RejectedExecutionException ? /* stray Executor past program end? */ Level.FINE : Level.WARNING;
              LOGGER.log(level, "Not blocking restart due to problem checking running steps in " + this, x);
              return false;
          }
          

          We see in the log:

           

           

          2019-10-14 13:16:56.104+0000 [id=16386139]  WARNING o.j.p.w.cps.CpsFlowExecution#blocksRestart: Not blocking restart due to problem checking running steps in CpsFlowExecution[******]]java.util.concurrent.TimeoutException: Timeout waiting for task.
          at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:259)    
          at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91)    
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.blocksRestart(CpsFlowExecution.java:940)    
          at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.blocksRestart(WorkflowRun.java:372)    
          at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.displayCell(WorkflowRun.java:375)    
          at hudson.model.Executor.isDisplayCell(Executor.java:668)
          ...

           

          Fortunately, this code just waits 1 second, so it doesn't block anything, although it populates the log with these messages.

          Ramon Leon added a comment - - edited Besides the code in Metrics :   https://github.com/jenkinsci/metrics-plugin/blob/24bf92ebd59095d8f77b2552696b3f210a024bdc/src/main/java/jenkins/metrics/impl/JenkinsMetricProviderImpl.java#L1142   try { return future .get(); } catch (Throwable t) { sneakyThrow(t); throw new RuntimeException(t); }   We have here another affected code in Pipeline : https://github.com/jenkinsci/workflow-cps-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L940 try { return getCurrentExecutions( false ).get(1, TimeUnit.SECONDS).stream().anyMatch(StepExecution::blocksRestart); } catch (Exception x) { // TODO RestartListener.Default.isReadyToRestart can throw checked exceptions, but AsynchronousExecution.blocksRestart does not currently allow it Level level = x.getCause() instanceof RejectedExecutionException ? /* stray Executor past program end? */ Level.FINE : Level.WARNING; LOGGER.log(level, "Not blocking restart due to problem checking running steps in " + this , x); return false ; } We see in the log:     2019-10-14 13:16:56.104+0000 [id=16386139]  WARNING o.j.p.w.cps.CpsFlowExecution#blocksRestart: Not blocking restart due to problem checking running steps in CpsFlowExecution[******]]java.util.concurrent.TimeoutException: Timeout waiting  for  task. at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:259)     at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91)     at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.blocksRestart(CpsFlowExecution.java:940)     at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.blocksRestart(WorkflowRun.java:372)     at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.displayCell(WorkflowRun.java:375)     at hudson.model.Executor.isDisplayCell(Executor.java:668) ...   Fortunately, this code just waits 1 second, so it doesn't block anything, although it populates the log with these messages.

          Jesse Glick added a comment -

          I do not believe the workflow-cps code mentioned has any relation to this issue. Perhaps I am missing something.

          Jesse Glick added a comment - I do not believe the workflow-cps code mentioned has any relation to this issue. Perhaps I am missing something.

          Ramon Leon added a comment -

          I mentioned this because in the instance we detected the metric threads, there were a lot of log messages of this type. But maybe it's not related and it's just a performance issue.

          Ramon Leon added a comment - I mentioned this because in the instance we detected the metric threads, there were a lot of log messages of this type. But maybe it's not related and it's just a performance issue.

          Ramon Leon added a comment -

          Added what has been done to the title for the sake of searches

          Ramon Leon added a comment - Added what has been done to the title for the sake of searches

          mramonleon, thanks for the prompt fix of the issue. I would appreciate if you can report back whether this fully mediated the problem in affected production deployment.

          Oliver Gondža added a comment - mramonleon , thanks for the prompt fix of the issue. I would appreciate if you can report back whether this fully mediated the problem in affected production deployment.

          Ramon Leon added a comment -

          It will take some time to land, I can give you some feedback by the end of December

          Ramon Leon added a comment - It will take some time to land, I can give you some feedback by the end of December

          We seem to see this issue with metrics. Will this be backported to the LTS release? olivergondza

          We are currently on 2.176.2

          Raihaan Shouhell added a comment - We seem to see this issue with metrics. Will this be backported to the LTS release? olivergondza We are currently on 2.176.2

          This was backported to 2.204.1. RC is expected to be out on Dec 4th, final release on Dec 18th.

          Oliver Gondža added a comment - This was backported to 2.204.1. RC is expected to be out on Dec 4th, final release on Dec 18th.

            mramonleon Ramon Leon
            mramonleon Ramon Leon
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: