-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Jenkins 2.187
Linux
openjdk version "1.8.0_222"
-
-
workflow-api 2.37
Our large Jenkins instance had a CPU usage spike last weekend.
After some digging, we found that all `org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep` threads tended to get stuck in `java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry` with various stacks, which seems to be the consequence of a ThreadLocal leak:
"org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#12]: checking <redacted> on i-0db24f1a6b300ea1f" #3845 daemon prio=5 os_prio=0 tid=0x00007f2dd8bd2800 nid=0x30bb runnable [0x00007f2d2a926000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:617) at java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:499) at java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:298) at java.lang.ThreadLocal.remove(ThreadLocal.java:222) at hudson.remoting.Channel.setCurrent(Channel.java:1787) at hudson.remoting.UserRequest$NormalResponse.retrieve(UserRequest.java:328) at hudson.remoting.Channel.call(Channel.java:957) at hudson.FilePath.act(FilePath.java:1072) at hudson.FilePath.act(FilePath.java:1061) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:209) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:532) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:489) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Digged the content of the ThreadLocalMap of each of the live threads of the `ScheduledThreadPoolExecutor` and got something similar:
Thread 3841 org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#8]: checking <redacted> on i-04a6c09bbdc596bbd ThreadLocalMap size: 65536 Elements count per key type: [org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream$FlushControlledOutputStream$1:20704, java.lang.ThreadLocal:47, hudson.remoting.Channel$ThreadLastIoId:34, null:42550, org.codehaus.groovy.runtime.NullObject:2200, java.lang.ThreadLocal$SuppliedThreadLocal:1]
It looks like the following ThreadLocal is never cleared since it gets attached to reused threads from a thread pool:
- links to
I think we should call ThreadLocal.remove here instead of trying to reset the original value, since calls to DelayBufferedOutputStream.flushBuffer will always be from a thread in the jenkins.util.Timer thread pool, and so even if we did care about what the original value was (I don't think we do, the default is true, we set it to false just for the call to flush, and then set it back to true, and it is a ThreadLocal so we don't have to worry about interference from other threads, so it will always be true going into flushBuffer IIUC), it would only matter if we happened to use a thread from the pool that we had already used previously, which is not something we can count on.