Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58899

ThreadLocal leak causing high CPU usage with DurableTaskStep threads

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • workflow-api-plugin
    • None
    • Jenkins 2.187
      Linux
      openjdk version "1.8.0_222"
    • workflow-api 2.37

      Our large Jenkins instance had a CPU usage spike last weekend.

      After some digging, we found that all `org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep` threads tended to get stuck in `java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry` with various stacks, which seems to be the consequence of a ThreadLocal leak:

       

      "org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#12]: checking <redacted> on i-0db24f1a6b300ea1f" #3845 daemon prio=5 os_prio=0 tid=0x00007f2dd8bd2800 nid=0x30bb runnable [0x00007f2d2a926000]
         java.lang.Thread.State: RUNNABLE
              at java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:617)
              at java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:499)
              at java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:298)
              at java.lang.ThreadLocal.remove(ThreadLocal.java:222)
              at hudson.remoting.Channel.setCurrent(Channel.java:1787)
              at hudson.remoting.UserRequest$NormalResponse.retrieve(UserRequest.java:328)
              at hudson.remoting.Channel.call(Channel.java:957)
              at hudson.FilePath.act(FilePath.java:1072)
              at hudson.FilePath.act(FilePath.java:1061)
              at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:209)
              at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:532)
              at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:489)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)

       

       

      Digged the content of the ThreadLocalMap of each of the live threads of the `ScheduledThreadPoolExecutor` and got something similar:

       

      Thread 3841 org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#8]: checking <redacted> on i-04a6c09bbdc596bbd
      
      ThreadLocalMap size: 65536
      
      Elements count per key type:
      [org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream$FlushControlledOutputStream$1:20704, java.lang.ThreadLocal:47, hudson.remoting.Channel$ThreadLastIoId:34, null:42550, org.codehaus.groovy.runtime.NullObject:2200, java.lang.ThreadLocal$SuppliedThreadLocal:1]

       

      It looks like the following ThreadLocal is never cleared since it gets attached to reused threads from a thread pool: 

      https://github.com/jenkinsci/workflow-api-plugin/blob/907cf64feb2f38e93aebbedf87946b0235c3dc93/src/main/java/org/jenkinsci/plugins/workflow/log/DelayBufferedOutputStream.java#L123

          [JENKINS-58899] ThreadLocal leak causing high CPU usage with DurableTaskStep threads

          Devin Nusbaum added a comment -

          I think we should call ThreadLocal.remove here instead of trying to reset the original value, since calls to DelayBufferedOutputStream.flushBuffer will always be from a thread in the jenkins.util.Timer thread pool, and so even if we did care about what the original value was (I don't think we do, the default is true, we set it to false just for the call to flush, and then set it back to true, and it is a ThreadLocal so we don't have to worry about interference from other threads, so it will always be true going into flushBuffer IIUC), it would only matter if we happened to use a thread from the pool that we had already used previously, which is not something we can count on.

          Devin Nusbaum added a comment - I think we should call ThreadLocal.remove here instead of trying to reset the original value, since calls to DelayBufferedOutputStream.flushBuffer will always be from a thread in the jenkins.util.Timer thread pool, and so even if we did care about what the original value was (I don't think we do, the default is true , we set it to false just for the call to flush , and then set it back to true , and it is a ThreadLocal so we don't have to worry about interference from other threads, so it will always be true going into flushBuffer IIUC), it would only matter if we happened to use a thread from the pool that we had already used previously, which is not something we can count on.

          As additional data to the comment above, I forgot to mention that all leaked values were boolean "true". 

          Gabriel Lavoie added a comment - As additional data to the comment above, I forgot to mention that all leaked values were boolean "true". 

          Devin Nusbaum added a comment -

          glavoie If you want to review https://github.com/jenkinsci/workflow-api-plugin/pull/100, and/or test it out on your instance to confirm whether it fixes the issue (Especially to see if we should modify FlushControlledOutputStream.flush in the same way based on real data), that would be appreciated, thanks!

          Devin Nusbaum added a comment - glavoie If you want to review https://github.com/jenkinsci/workflow-api-plugin/pull/100 , and/or test it out on your instance to confirm whether it fixes the issue (Especially to see if we should modify FlushControlledOutputStream.flush in the same way based on real data), that would be appreciated, thanks!

          Will get back to you by the end of the week.

          Gabriel Lavoie added a comment - Will get back to you by the end of the week.

          Devin Nusbaum added a comment -

          A fix for this issue was just released in version 2.37 of Pipeline API Plugin. glavoie thank you for the detailed bug report and for helping test the fix!

          Devin Nusbaum added a comment - A fix for this issue was just released in version 2.37 of Pipeline API Plugin. glavoie thank you for the detailed bug report and for helping test the fix!

            dnusbaum Devin Nusbaum
            glavoie Gabriel Lavoie
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: