[JENKINS-57805] Exception thrown from ExecutorStepExecution.PlaceHolderTask.getAffinityKey breaks Queue maintenance

Type: Bug
Resolution: Fixed
Priority: Minor
Component/s: core, workflow-durable-task-step-plugin
Labels:
- robustness
Environment:
workflow-durable-task-step 2.29+

Similar Issues:
Powered by SuggestiMate

Show
Released As:
workflow-api 2.35, Jenkins 2.181

I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:

2019-04-29 14:24:14.748+0000 [id=118]	SEVERE	hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
	at java.util.Collections$EmptyList.get(Collections.java:4454)
	at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
	at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
	at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
	at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
	at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
	at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
	at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
	at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
	at hudson.model.Queue.maintain(Queue.java:1629)
	at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:

2019-04-29 14:24:14.745+0000 [id=118]	WARNING	o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
	at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
	at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
	at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException:  : null
	at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
	at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
	at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
	at hudson.XmlFile.read(XmlFile.java:149)
	at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
	at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
	at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
	at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
	at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
	at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
	at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
	at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
	at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
	at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
	at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
	at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
	at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
	at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
	at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
	at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
	at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
	at hudson.model.Queue.maintain(Queue.java:1629)
	at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against getAffinityKey that throw RuntimeException so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:

Harden StandardGraphLookupView.bruteForceScanForEnclosingBlock to make it return null if there is an error deserializing a node or we come upon a node with no parents. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt FlowNode elsewhere.
Add a try/catch inside of CpsFlowExecution.getNode and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in CpsFlowExecution.onLoad. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
Eagerly load all flow nodes in CpsFlowExecution.onLoad. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with many graph traversals anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.

is caused by

JENKINS-36547 Queue.Task.getFullDisplayName is a poor choice of key for LoadBalancer.CONSISTENT_HASH

Resolved

relates to

JENKINS-57913 Make errors thrown by CpsFlowExecution.getNode halt Pipeline execution

Open

links to

jenkinsci/jenkins#4055

jenkinsci/workflow-api-plugin#94

Devin Nusbaum created issue - 2019-05-31 20:38

Devin Nusbaum made changes - 2019-05-31 21:02

Remote Link

New: This issue links to "jenkinsci/jenkins#4055 (Web Link)" [ 23039 ]

Devin Nusbaum made changes - 2019-05-31 21:03

Status

Original: Open [ 1 ]

New: In Progress [ 3 ]

Devin Nusbaum made changes - 2019-06-03 21:24

Description

Original: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:

{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:

{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue. The most obvious thing to me would be to add a try/catch inside of {{CpsFlowExecution.getNode}}, and if an exception is caught, we treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}, but in this case the program is already loaded and potentially executing so I'm not sure if there would be a clean way to shut it down from that method. Another thing to try could be to eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}, but that would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.

New: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
* Make the parents field in a FlowNode a critical field in XStream, so if it can't be deserialized the whole node fails to be deserialized rather than returning a corrupt node. I think this is a good idea (maybe for other fields in FlowNode as well), but as far as the symptoms here think it would just change where the error is thrown in {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}}.
* Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
* Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.
* Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node. Would work fine, but it doesn't really seem right for the error to be handled here. In retrospect the method should probably throw IOException instead of rethrowing internal exceptions as a RuntimeException [here|https://github.com/jenkinsci/workflow-api-plugin/blob/c84dbab8cd35c90f70d84390dc711901fa73b7ad/src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java#L143] (even though that is not actually where the error is thrown in this case).

Devin Nusbaum made changes - 2019-06-03 21:29

Description

Original: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
* Make the parents field in a FlowNode a critical field in XStream, so if it can't be deserialized the whole node fails to be deserialized rather than returning a corrupt node. I think this is a good idea (maybe for other fields in FlowNode as well), but as far as the symptoms here think it would just change where the error is thrown in {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}}.
* Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
* Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.
* Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node. Would work fine, but it doesn't really seem right for the error to be handled here. In retrospect the method should probably throw IOException instead of rethrowing internal exceptions as a RuntimeException [here|https://github.com/jenkinsci/workflow-api-plugin/blob/c84dbab8cd35c90f70d84390dc711901fa73b7ad/src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java#L143] (even though that is not actually where the error is thrown in this case).

New: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
* Make the parents field in a FlowNode a critical field in XStream, so if it can't be deserialized the whole node fails to be deserialized rather than returning a corrupt node. I think this is a good idea (maybe for other fields in FlowNode as well), but as far as the symptoms here think it would just change where the error is thrown in {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}}.
* Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
* Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.
* Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt {{FlowNode}} elsewhere.

Devin Nusbaum made changes - 2019-06-05 18:56

Description

Original: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
* Make the parents field in a FlowNode a critical field in XStream, so if it can't be deserialized the whole node fails to be deserialized rather than returning a corrupt node. I think this is a good idea (maybe for other fields in FlowNode as well), but as far as the symptoms here think it would just change where the error is thrown in {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}}.
* Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
* Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.
* Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt {{FlowNode}} elsewhere.

New: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
# Make the parents field in a FlowNode a critical field in XStream, so if it can't be deserialized the whole node fails to be deserialized rather than returning a corrupt node. I think this is a good idea (maybe for other fields in FlowNode as well), but as far as the symptoms here think it would just change where the error is thrown in {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}}.
# Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt {{FlowNode}} elsewhere.
# Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
# Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.

Devin Nusbaum made changes - 2019-06-05 19:01

Description

Original: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
# Make the parents field in a FlowNode a critical field in XStream, so if it can't be deserialized the whole node fails to be deserialized rather than returning a corrupt node. I think this is a good idea (maybe for other fields in FlowNode as well), but as far as the symptoms here think it would just change where the error is thrown in {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}}.
# Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt {{FlowNode}} elsewhere.
# Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
# Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.

New: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
# Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt {{FlowNode}} elsewhere.
# Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
# Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.

Devin Nusbaum made changes - 2019-06-05 19:07

Description

Original: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
# Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt {{FlowNode}} elsewhere.
# Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
# Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.

New: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
# Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node or we come upon a node with no parents. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt {{FlowNode}} elsewhere.
# Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
# Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.

Devin Nusbaum made changes - 2019-06-05 19:12

Description

Original: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
# Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node or we come upon a node with no parents. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt {{FlowNode}} elsewhere.
# Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
# Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with any flow graph traversal is used anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.

New: I have seen the following exception break Queue maintenance and make it impossible for new jobs to be scheduled:
{noformat}
2019-04-29 14:24:14.748+0000 [id=118] SEVERE hudson.triggers.SafeTimerTask#run: Timer task hudson.model.Queue$MaintainTask@7edb0f64 failed
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
As best as I can tell, the root cause was that a channel reading a flow node's parent node XML file was interrupted, which threw an exception and left the flow graph corrupted. (I am not sure how or why the read was interrupted, but since the same thing would happen if the file was corrupted on disk it seems like a reasonable failure mode to handle.) Here is the stack trace for that issue:
{noformat}
2019-04-29 14:24:14.745+0000 [id=118] WARNING o.j.p.workflow.graph.FlowNode#loadParents: failed to load parents of 96
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.getHeader(XmlHeaderAwareReader.java:79)
at com.thoughtworks.xstream.core.util.XmlHeaderAwareReader.<init>(XmlHeaderAwareReader.java:61)
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:65)
Caused: com.thoughtworks.xstream.io.StreamException: : null
at com.thoughtworks.xstream.io.xml.AbstractXppDriver.createReader(AbstractXppDriver.java:69)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:1053)
at hudson.XmlFile.read(XmlFile.java:147)
Caused: java.io.IOException: Unable to read /var/jenkins_home/jobs/foo/jobs/bar/builds/163/workflow/95.xml
at hudson.XmlFile.read(XmlFile.java:149)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.load(SimpleXStreamFlowNodeStorage.java:204)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.access$000(SimpleXStreamFlowNodeStorage.java:71)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:76)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage$1.load(SimpleXStreamFlowNodeStorage.java:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:101)
Caused: java.io.IOException
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:109)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1791)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1179)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.bruteForceScanForEnclosingBlock(StandardGraphLookupView.java:150)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findEnclosingBlockStart(StandardGraphLookupView.java:197)
at org.jenkinsci.plugins.workflow.graph.StandardGraphLookupView.findAllEnclosingBlockStarts(StandardGraphLookupView.java:217)
at org.jenkinsci.plugins.workflow.flow.FlowExecution.findAllEnclosingBlockStarts(FlowExecution.java:299)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getEnclosingBlocks(FlowNode.java:195)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.concatenateAllEnclosingLabels(ExecutorStepExecution.java:527)
at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getAffinityKey(ExecutorStepExecution.java:545)
at hudson.model.LoadBalancer$1.assignGreedily(LoadBalancer.java:115)
at hudson.model.LoadBalancer$1.map(LoadBalancer.java:105)
at hudson.model.LoadBalancer$2.map(LoadBalancer.java:157)
at hudson.model.Queue.maintain(Queue.java:1629)
at hudson.model.Queue$MaintainTask.doRun(Queue.java:2887)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
We should make changes over on the Pipeline side so that a deserialization failure after the program is already loaded is handled more gracefully, but I think we should also harden core against {{getAffinityKey}} that throw {{RuntimeException}} so that they do not break Queue maintenance entirely.

On the Pipeline side, I'm not sure how best to handle this issue, since none of the call sites in the stack trace look like a great place to actually handle the error. Here are some ideas:
# Harden {{StandardGraphLookupView.bruteForceScanForEnclosingBlock}} to make it return null if there is an error deserializing a node or we come upon a node with no parents. Would work fine, but it doesn't really seem right for the error to be handled here, since other code would have to deal with the same corrupt {{FlowNode}} elsewhere.
# Add a try/catch inside of {{CpsFlowExecution.getNode}} and if an exception is caught, treat that as a fatal error for the execution and handle it like we would an error loading flow heads in {{CpsFlowExecution.onLoad}}. In this case the program is already loaded and potentially executing, so I'm not sure if there would be a clean way to shut it down from that method, and the method might be used in contexts where that is not the desired behavior.
# Eagerly load all flow nodes in {{CpsFlowExecution.onLoad}}. That would probably have bad performance implications (although maybe not as bad as I am imagining since that will happen with many graph traversals anyway), and since loaded nodes are cached with soft references, they could need to be loaded again from disk anyway in which case the same problem could occur.

Jesse Glick made changes - 2019-06-05 20:06

Link

New: This issue is caused by ~~JENKINS-36547~~ [ ~~JENKINS-36547~~ ]

Jesse Glick made changes - 2019-06-05 20:08

Labels

New: robustness

Assignee:: Devin Nusbaum

Reporter:: Devin Nusbaum

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2019-05-31 20:38

Updated:: 2019-09-05 09:53

Resolved:: 2019-07-22 19:34

Jenkins

Details

Description

Attachments

Issue Links

Activity

People

Dates