[JENKINS-46507] Parallel Pipeline random java.lang.InterruptedException

Andrew Bayer added a comment - 2017-08-28 20:55

svanoort - may be of interest to you.

Andrew Bayer added a comment - 2017-08-28 20:55 svanoort - may be of interest to you.

Kieran Webber added a comment - 2017-09-28 10:33

+1 to this issue. Experiencing a hang followed by this error when large amount of pipelines run concurrently

Kieran Webber added a comment - 2017-09-28 10:33 +1 to this issue. Experiencing a hang followed by this error when large amount of pipelines run concurrently

Ann Beeskau added a comment - 2017-10-24 15:41

+1 we're experiencing this as well.

Ann Beeskau added a comment - 2017-10-24 15:41 +1 we're experiencing this as well.

Syed Imran added a comment - 2017-12-08 10:02 - edited

+1 we're experiencing this as well. expecting a resolution

Syed Imran added a comment - 2017-12-08 10:02 - edited +1 we're experiencing this as well. expecting a resolution

Georg Henzler added a comment - 2018-01-11 11:31

+1 we're experiencing this as well.

Georg Henzler added a comment - 2018-01-11 11:31 +1 we're experiencing this as well.

Suriyanath added a comment - 2018-01-16 10:30

We are also experiencing the same, on jenkins version 2.69

Suriyanath added a comment - 2018-01-16 10:30 We are also experiencing the same, on jenkins version 2.69

Lorenzo Cubero added a comment - 2018-01-29 12:27

Same here using Jenkins 2.73.2

Lorenzo Cubero added a comment - 2018-01-29 12:27 Same here using Jenkins 2.73.2

shyam singh added a comment - 2018-02-08 07:04 - edited

+1 Same here, we are facing this issue very frequently

shyam singh added a comment - 2018-02-08 07:04 - edited +1 Same here, we are facing this issue very frequently

Bruno Sousa added a comment - 2018-04-11 06:19

+1 same here

Bruno Sousa added a comment - 2018-04-11 06:19 +1 same here

Jarrett B added a comment - 2018-04-18 18:29

+1 Same here

Jarrett B added a comment - 2018-04-18 18:29 +1 Same here

Georg Henzler added a comment - 2018-04-18 18:41

We also had this problem again today... reading jglick's comment in https://issues.jenkins-ci.org/browse/JENKINS-35710?focusedCommentId=268398&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-268398 I suspect adding more slaves does not help (as the stacktrace clearly points to CPS, and according the comment that always runs on master)... but I suppose using https://github.com/jenkinsci/throttle-concurrent-builds-plugin to ensure only a maximum of builds run in parallel globally could help. svanoort before trying, do you think this is a valid way of getting rid of the problem? (asking before trying as this is not easy to reproduce)

Georg Henzler added a comment - 2018-04-18 18:41 We also had this problem again today... reading jglick 's comment in https://issues.jenkins-ci.org/browse/JENKINS-35710?focusedCommentId=268398&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-268398 I suspect adding more slaves does not help (as the stacktrace clearly points to CPS, and according the comment that always runs on master)... but I suppose using https://github.com/jenkinsci/throttle-concurrent-builds-plugin to ensure only a maximum of builds run in parallel globally could help. svanoort before trying, do you think this is a valid way of getting rid of the problem? (asking before trying as this is not easy to reproduce)

Sam Van Oort added a comment - 2018-04-18 19:39

ghenzler Executors have nothing to do with this, and this issue is generally within the scope of a single build, so limiting the maximum number should have no impact – except when restarting Jenkins, where there might be enough of a bottleneck due to loading FlowExecutions to hit the 1 minute timeout. Maybe if you had maybe 10x as many Pipelines as CPU cores on the master, or really complex Program state – in this case we'd be blocking on the code here: https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L1024

Do you have a message like this in your logs? WARNING, failed to wait for $someRun to be loaded (and then a stack trace).

More likely there's something subtly wrong with the concurrency/error-handling model here – some of what I'm doing right now to untangle some nasty issues in these guts may help – and there was a change in the last release of workflow-cps that might help if you don't have it already (it caught more exceptions and prevented threads from potentially waiting forever).

Sam Van Oort added a comment - 2018-04-18 19:39 ghenzler Executors have nothing to do with this, and this issue is generally within the scope of a single build, so limiting the maximum number should have no impact – except when restarting Jenkins, where there might be enough of a bottleneck due to loading FlowExecutions to hit the 1 minute timeout. Maybe if you had maybe 10x as many Pipelines as CPU cores on the master, or really complex Program state – in this case we'd be blocking on the code here: https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L1024 Do you have a message like this in your logs? WARNING, failed to wait for $someRun to be loaded (and then a stack trace). More likely there's something subtly wrong with the concurrency/error-handling model here – some of what I'm doing right now to untangle some nasty issues in these guts may help – and there was a change in the last release of workflow-cps that might help if you don't have it already (it caught more exceptions and prevented threads from potentially waiting forever).

Alexey Grigorov added a comment - 2018-09-04 15:53 - edited

Same here. I'm running a @NonCPS labeled method in library and if fails with this error randomly in the middle

import jenkins.*
import jenkins.model.*
import hudson.*
import hudson.model.*

/* This method will find all jobs and verify their disk usage.
Then it will remove job runs to fit  quota

That will help to save disk space

*/
@NonCPS
def call() {
    def maxRunSize = 21474836480 //20Gb
    def maxJobSize = 42949672960 //40Gb
    Jenkins.instance.getAllItems(hudson.model.Job).each
            {
                println it
                def runs = it.getBuilds()
                def job = it
                def totalSize = 0
                def jobSizeReached = false
                runs.each{
                    def currentRun = it
                    def currentRunSize = it.getRootDir().directorySize()
                    println it
                    println currentRunSize
                    switch (currentRunSize) {
                        case {it > maxRunSize} :
                            currentRun.delete()
                            println "deleted " + currentRun + " for having size " + it + " over allowed " +  maxRunSize
                            break
                        case {it + totalSize > maxJobSize || jobSizeReached } :
                            jobSizeReached = true
                            currentRun.delete()
                            println "deleted " + currentRun + " for not fitting total  " + job + " jobs size " + (totalSize+currentRunSize) +" to max  allowed " +  maxJobSize
                            break
                        default :
                            totalSize += currentRunSize
                            println 'leaving '  + currentRun + ". current total = " + totalSize
                            break
                    }
                    println '--------------------------'
                }
                println '=========================='
            }
}

<<Some output from the loop here sometime few lines sometime few hundred lines, i guess it fails after some random number>>
--------------------------
[Pipeline] echo
[Pipeline] End of Pipeline
java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:298)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61)
	at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:258)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:206)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:153)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.println(CpsScript.java:221)
	at sun.reflect.GeneratedMethodAccessor1023.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:925)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:908)
	at org.codehaus.groovy.runtime.DefaultGroovyMethods.println(DefaultGroovyMethods.java:678)
	at org.codehaus.groovy.runtime.dgm$500.doMethodInvoke(Unknown Source)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:294)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:69)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:166)

jenkins 2.107.3

Alexey Grigorov added a comment - 2018-09-04 15:53 - edited Same here. I'm running a @NonCPS labeled method in library and if fails with this error randomly in the middle import jenkins.* import jenkins.model.* import hudson.* import hudson.model.* /* This method will find all jobs and verify their disk usage. Then it will remove job runs to fit quota That will help to save disk space */ @NonCPS def call() { def maxRunSize = 21474836480 //20Gb def maxJobSize = 42949672960 //40Gb Jenkins.instance.getAllItems(hudson.model.Job).each { println it def runs = it.getBuilds() def job = it def totalSize = 0 def jobSizeReached = false runs.each{ def currentRun = it def currentRunSize = it.getRootDir().directorySize() println it println currentRunSize switch (currentRunSize) { case {it > maxRunSize} : currentRun.delete() println "deleted " + currentRun + " for having size " + it + " over allowed " + maxRunSize break case {it + totalSize > maxJobSize || jobSizeReached } : jobSizeReached = true currentRun.delete() println "deleted " + currentRun + " for not fitting total " + job + " jobs size " + (totalSize+currentRunSize) + " to max allowed " + maxJobSize break default : totalSize += currentRunSize println 'leaving ' + currentRun + ". current total = " + totalSize break } println '--------------------------' } println '==========================' } } <<Some output from the loop here sometime few lines sometime few hundred lines, i guess it fails after some random number>> -------------------------- [Pipeline] echo [Pipeline] End of Pipeline java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:298) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:258) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:206) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:153) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at org.jenkinsci.plugins.workflow.cps.CpsScript.println(CpsScript.java:221) at sun.reflect.GeneratedMethodAccessor1023.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:925) at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:908) at org.codehaus.groovy.runtime.DefaultGroovyMethods.println(DefaultGroovyMethods.java:678) at org.codehaus.groovy.runtime.dgm$500.doMethodInvoke(Unknown Source) at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:294) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:69) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:166) jenkins 2.107.3

Toby Broyles added a comment - 2018-09-05 12:31

We are seeing this error quite often lately. Is there anything I can do to help troubleshoot?

Toby Broyles added a comment - 2018-09-05 12:31 We are seeing this error quite often lately. Is there anything I can do to help troubleshoot?

Sam Van Oort added a comment - 2018-09-05 13:22

dnusbaum Please, could you take a look?

Sam Van Oort added a comment - 2018-09-05 13:22 dnusbaum Please, could you take a look?

Devin Nusbaum added a comment - 2018-09-14 21:16 - edited

I finally had a chance to take a look at this.

iceiceice: You are using println inside of an @NonCPS method. It might seem like this should be ok, but behind the scenes println ends up calling the echo step, which means that you are executing Pipeline Steps inside of an @NonCPS method, which should not be done even though it may appear to work correctly in some cases.

tobybro If you have a minimal pipeline that can reproduce the issue, or a scenario in which you can reproduce the problem consistently it would be very helpful if you could post the Pipeline script and the build folder of one of the builds that failed with the exception here. Otherwise, it would be great to see your actual pipeline that is hitting the error so we can check for any similarities with Alexey's or Rick's Pipelines that might give us an idea of what is happening.

EDIT: My original hypothesis was misleading and incorrect, so I've edited it based on new information.

Notably, the exception being thrown in all these cases comes from AbstractQueuedSynchronizer#acquireSharedInterruptibly Line 1302, which checks the interrupt status before the thread tries to acquire the lock. This means that the thread was actually interrupted before the step started executing, and CpsStepContext#getThreadGroupSynchronously just happens to be the first piece of code to check the interrupted flag and throw an exception. I got access to the flow nodes of another user's pipeline hitting the same exception, and noticed that the duration between the last successfully executed node and the start of the step that threw the exception was just over 5 minutes, which leads me to believe that the code that is doing the interruption is the Timeout in CpsThread#runNextChunk.

None of the Pipelines that I've seen that have thrown this error look like they should have taken more than 5 minutes in runNextChunk, so I am suspicious that there is something going wrong inside of that method (maybe it is executing more code than it should), but I do not know how to validate my hypothesis.

Devin Nusbaum added a comment - 2018-09-14 21:16 - edited I finally had a chance to take a look at this. iceiceice : You are using println inside of an @NonCPS method. It might seem like this should be ok, but behind the scenes println ends up calling the echo step, which means that you are executing Pipeline Steps inside of an @NonCPS method, which should not be done even though it may appear to work correctly in some cases. tobybro If you have a minimal pipeline that can reproduce the issue, or a scenario in which you can reproduce the problem consistently it would be very helpful if you could post the Pipeline script and the build folder of one of the builds that failed with the exception here. Otherwise, it would be great to see your actual pipeline that is hitting the error so we can check for any similarities with Alexey's or Rick's Pipelines that might give us an idea of what is happening. EDIT: My original hypothesis was misleading and incorrect, so I've edited it based on new information. Notably, the exception being thrown in all these cases comes from AbstractQueuedSynchronizer#acquireSharedInterruptibly Line 1302, which checks the interrupt status before the thread tries to acquire the lock. This means that the thread was actually interrupted before the step started executing, and CpsStepContext#getThreadGroupSynchronously just happens to be the first piece of code to check the interrupted flag and throw an exception. I got access to the flow nodes of another user's pipeline hitting the same exception, and noticed that the duration between the last successfully executed node and the start of the step that threw the exception was just over 5 minutes, which leads me to believe that the code that is doing the interruption is the Timeout in CpsThread#runNextChunk . None of the Pipelines that I've seen that have thrown this error look like they should have taken more than 5 minutes in runNextChunk, so I am suspicious that there is something going wrong inside of that method (maybe it is executing more code than it should), but I do not know how to validate my hypothesis.

Andrew a added a comment - 2018-09-22 13:20

Hi everyone. A little info on this issue that we've been hitting as well. We're using jenkins and it's pipeline plugin to run our regression and ci builds for a SoC (System-on-chip) we're designing. As a side effect of what we're doing - some tests (under the parallel step) take a very long time to complete and are very CPU-intensive. Here's some statistics I've noticed:

CI builds that have a small amount (~10) of short tests under parallel step almost always complete successfully.
Regression runs that take a few days to complete and have a huge list of longer tests almost always hit this bug.
If the jenkins was restarted fresh, chances to hit this bug are somewhat lower. I came to think it was related to a memory leak in jenkins test-results-analyzer plugin at first, which (due to our huge, sometime over 500MiB log files) crashed jenkins often at fist, until I started filtering logs to mitigate the issue (https://issues.jenkins-ci.org/browse/JENKINS-34134). Tuning GC options seemed to further improve things.
Since we're running builds on NFS filesystem to later distribute and execute tests on different servers, thus balancing the load, I first considered filesystem lags due to other tasks being run on our servers. Adding this seemed to improve things a little bit. The disk load is not evenly distributed. It's heavy on the start of the test (when the model is loaded to memory and during reset due to extensive logging), later it's usually very low throughout the whole run.

/* Workaround laggy filesystem */
script {
   System.setProperty("org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL", "3800");
}

5. Another thing I've noticed, some tests seem to randomly fail. According to the simulation log, the simulation process was just killed (by something other than OOM), rather that getting stuck, and that happens quite fast (e.g. the simulation haven't even completed the reset sequence, which takes a few seconds of real time, but usually spews a lot of info into the log).

6. At the Blue Ocean UI all fails in parallel steps seem to be grouped to the very bottom half of the screen, after 150+ of successfully executed tests. Since these are mostly in the order they're being started (I assume), it's making me believe, that it might be somehow related to either GC taking a lot of time or some memory leaking here and there.

A grand total would be of 5-10 tests of 300+ randomly crashing, and almost always the annoying java.lang.interruptedException with the whole run (That took 3-4 days to complete!) just freezing.

A full regression that takes a few days to complete almost always triggers the bug. I'm willing to help solve this issue, but since it requires some sacred knowledge about jenkins' guts I do not have, I can either post my pipeline script (which will most likely be useless, due to a huge amount of our internal tools being called), or give some of the proposed fixes a try and post the results, since I have the right environment.

Andrew a added a comment - 2018-09-22 13:20 Hi everyone. A little info on this issue that we've been hitting as well. We're using jenkins and it's pipeline plugin to run our regression and ci builds for a SoC (System-on-chip) we're designing. As a side effect of what we're doing - some tests (under the parallel step) take a very long time to complete and are very CPU-intensive. Here's some statistics I've noticed: CI builds that have a small amount (~10) of short tests under parallel step almost always complete successfully. Regression runs that take a few days to complete and have a huge list of longer tests almost always hit this bug. If the jenkins was restarted fresh, chances to hit this bug are somewhat lower. I came to think it was related to a memory leak in jenkins test-results-analyzer plugin at first, which (due to our huge, sometime over 500MiB log files) crashed jenkins often at fist, until I started filtering logs to mitigate the issue ( https://issues.jenkins-ci.org/browse/JENKINS-34134 ). Tuning GC options seemed to further improve things. Since we're running builds on NFS filesystem to later distribute and execute tests on different servers, thus balancing the load, I first considered filesystem lags due to other tasks being run on our servers. Adding this seemed to improve things a little bit. The disk load is not evenly distributed. It's heavy on the start of the test (when the model is loaded to memory and during reset due to extensive logging), later it's usually very low throughout the whole run. /* Workaround laggy filesystem */ script { System .setProperty( "org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL" , "3800" ); } 5. Another thing I've noticed, some tests seem to randomly fail. According to the simulation log, the simulation process was just killed (by something other than OOM), rather that getting stuck, and that happens quite fast (e.g. the simulation haven't even completed the reset sequence, which takes a few seconds of real time, but usually spews a lot of info into the log). 6. At the Blue Ocean UI all fails in parallel steps seem to be grouped to the very bottom half of the screen, after 150+ of successfully executed tests. Since these are mostly in the order they're being started (I assume), it's making me believe, that it might be somehow related to either GC taking a lot of time or some memory leaking here and there. A grand total would be of 5-10 tests of 300+ randomly crashing, and almost always the annoying java.lang.interruptedException with the whole run (That took 3-4 days to complete!) just freezing. A full regression that takes a few days to complete almost always triggers the bug. I'm willing to help solve this issue, but since it requires some sacred knowledge about jenkins' guts I do not have, I can either post my pipeline script (which will most likely be useless, due to a huge amount of our internal tools being called), or give some of the proposed fixes a try and post the results, since I have the right environment.

Devin Nusbaum added a comment - 2018-09-25 14:53

ncrmnt Thank you for including all of those details in your comment! It's interesting that you are hitting the issue consistently on multi-day builds. How much of your build is running external processes sh/bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)?

I am not aware of any workarounds for the issue. One thing that would be helpful would be if you could upload the build folder for one of your builds which failed with this issue. In particular, I am interested in the XML files in the workflow/ directory to see if the serialized flow nodes can tell us what was executing when the issue occurred which may give us some insight.

Devin Nusbaum added a comment - 2018-09-25 14:53 ncrmnt Thank you for including all of those details in your comment! It's interesting that you are hitting the issue consistently on multi-day builds. How much of your build is running external processes sh / bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)? I am not aware of any workarounds for the issue. One thing that would be helpful would be if you could upload the build folder for one of your builds which failed with this issue. In particular, I am interested in the XML files in the workflow/ directory to see if the serialized flow nodes can tell us what was executing when the issue occurred which may give us some insight.

Andrew a added a comment - 2018-09-26 12:01

dnusbaum, my last run was inconsistent due to the fact that the server with jenkins was out of disk space (sic!), I'll start the next regression run today and post results once it's done or hangs. This will take a few days though, depending on the load. Once it's done I'll try to provide your with as much data as I can.

> How much of your build is running external processes sh/bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)?

Basically it all comes down to running a single "cmake's ctest" process with one test in the list that may take anything from a few minutes to a few days with little to no stdout. Here's my pipeline script:

Jenkinsfile

Please note, that it may seem weird in places due to our internal server config and me trying to workaround this bug. It configures and builds the project on one select node (the build is done on an NFS share with the exact same path on all nodes), touchstone simulation step creates simulation checkpoints and, finally, the result is executed in parallel during regression phase with throttling enabled with load distributed over our servers.

Andrew a added a comment - 2018-09-26 12:01 dnusbaum , my last run was inconsistent due to the fact that the server with jenkins was out of disk space (sic!), I'll start the next regression run today and post results once it's done or hangs. This will take a few days though, depending on the load. Once it's done I'll try to provide your with as much data as I can. > How much of your build is running external processes sh / bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)? Basically it all comes down to running a single "cmake's ctest" process with one test in the list that may take anything from a few minutes to a few days with little to no stdout. Here's my pipeline script: Jenkinsfile Please note, that it may seem weird in places due to our internal server config and me trying to workaround this bug. It configures and builds the project on one select node (the build is done on an NFS share with the exact same path on all nodes), touchstone simulation step creates simulation checkpoints and, finally, the result is executed in parallel during regression phase with throttling enabled with load distributed over our servers.

Andrew a added a comment - 2018-09-30 20:38 - edited

dnusbaum, an update. It looks like I've got the hit the first try and now see a handful of

[rumboot-default-rumboot-basis-Debug-basis-stress-test-uart-eth-dma] Cannot contact oribi: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_1] Cannot contact bison: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_6] Cannot contact oribi: java.lang.InterruptedException

in the log. I now have the build in the stuck state and can provide the data. Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them?

EDIT: Hm... I've been running tests with a maximum timeout of 260000 s and it seems it has failed once it stumbled upon the first test that actually timed out. Actually, all tests that have an interrupted exception are those that are currently timing out and being worked on! They shoud've hung for at least 260000 seconds consuming 100% of one CPU core with no stdout before being killed internally by ctest process. InterruptedException arrives earlier.

I wonder if wrapping up 'sleep 260000' in groovy will produce the same results.

Andrew a added a comment - 2018-09-30 20:38 - edited dnusbaum , an update. It looks like I've got the hit the first try and now see a handful of [rumboot- default -rumboot-basis-Debug-basis-stress-test-uart-eth-dma] Cannot contact oribi: java.lang.InterruptedException [rumboot- default -rumboot-basis-Debug-simple-iram-mirror-hello-arinc_1] Cannot contact bison: java.lang.InterruptedException [rumboot- default -rumboot-basis-Debug-simple-iram-mirror-hello-arinc_6] Cannot contact oribi: java.lang.InterruptedException in the log. I now have the build in the stuck state and can provide the data. Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them? EDIT: Hm... I've been running tests with a maximum timeout of 260000 s and it seems it has failed once it stumbled upon the first test that actually timed out. Actually, all tests that have an interrupted exception are those that are currently timing out and being worked on! They shoud've hung for at least 260000 seconds consuming 100% of one CPU core with no stdout before being killed internally by ctest process. InterruptedException arrives earlier. I wonder if wrapping up 'sleep 260000' in groovy will produce the same results.

Devin Nusbaum added a comment - 2018-10-01 13:19

Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them?

Yep, sorry for not specifying earlier. They will be on the master, not on the agents. The workspace path will be $JENKINS_HOME/jobs/<path-to-job>/builds/<failing_build_number>/. Inside, the main files that would be interesting are build.xml and everything in workflow/.

If it is reproducible with sleep 260000 that would be interesting. At those lengths of time I would start to suspect something going wrong with a network call from the master to the agent causing code somewhere to hang that Pipeline does not handle gracefully. Do your overall Jenkins master logs have any warnings/errors around the time the InterruptedException was thrown?

Devin Nusbaum added a comment - 2018-10-01 13:19 Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them? Yep, sorry for not specifying earlier. They will be on the master, not on the agents. The workspace path will be $JENKINS_HOME/jobs/<path-to-job>/builds/<failing_build_number>/ . Inside, the main files that would be interesting are build.xml and everything in workflow/ . If it is reproducible with sleep 260000 that would be interesting. At those lengths of time I would start to suspect something going wrong with a network call from the master to the agent causing code somewhere to hang that Pipeline does not handle gracefully. Do your overall Jenkins master logs have any warnings/errors around the time the InterruptedException was thrown?

Andrew a added a comment - 2018-10-01 15:04

stuff.tgz Here goes the tarball with all the xml files. Be careful, there's a ton of them there

I do see some exceptions in jenkins master logs like this:

Oct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed
INFO: new failure
org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125)
        at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378)
        at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
        Suppressed: java.lang.InterruptedException
                at java.lang.Object.wait(Native Method)
                at hudson.remoting.Request.call(Request.java:177)
                at hudson.remoting.Channel.call(Channel.java:954)
                at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084)
                at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304)
                at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258)
                at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296)
                at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139)
                at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128)
                at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906)
                at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
                at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                ... 3 moreOct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed
INFO: previously delivered here
java.lang.Throwable
        at org.jenkinsci.plugins.workflow.cps.CpsStepContext.completed(CpsStepContext.java:341)
        at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:329)
        at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:254)
        at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

But I'm not sure if these are due to the build failing, or the restarting the master earlier today. Our instance has become quite bloated recently.

Btw, I restarted the build with an insanely huge heartbeat interval that should match the tests timeout. Will see how it goes now.

Andrew a added a comment - 2018-10-01 15:04 stuff.tgz Here goes the tarball with all the xml files. Be careful, there's a ton of them there I do see some exceptions in jenkins master logs like this: Oct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed INFO: new failure org.jenkinsci.plugins.workflow.steps.FlowInterruptedException at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125) at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Suppressed: java.lang.InterruptedException at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258) at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 moreOct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed INFO: previously delivered here java.lang.Throwable at org.jenkinsci.plugins.workflow.cps.CpsStepContext.completed(CpsStepContext.java:341) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:329) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:254) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) But I'm not sure if these are due to the build failing, or the restarting the master earlier today. Our instance has become quite bloated recently. Btw, I restarted the build with an insanely huge heartbeat interval that should match the tests timeout. Will see how it goes now.

Christoph Amshoff added a comment - 2018-10-10 19:24

Same issue here...

We're having a quite small Jenkinsfile that calls a RESTful service, which takes (depending on parameters) few seconds up to an hour or more to complete. The JSON returned by the service should be attached to the build, that's why it's called synchronously. The longer the service call takes, the more likely the InterruptedException occurs. There are no parallel steps in our pipeline, so I'm pretty sure it's related to the long-running step.

The REST call is done by some selfmade Groovy function and classes in our shared library, basically setting up a HttpURLConnection instance. However, that works nicely in fast service calls, so I don't think there is an issue in this code.

Let me know if I can help with any files/logs.

Christoph Amshoff added a comment - 2018-10-10 19:24 Same issue here... We're having a quite small Jenkinsfile that calls a RESTful service, which takes (depending on parameters) few seconds up to an hour or more to complete. The JSON returned by the service should be attached to the build, that's why it's called synchronously. The longer the service call takes, the more likely the InterruptedException occurs. There are no parallel steps in our pipeline, so I'm pretty sure it's related to the long-running step. The REST call is done by some selfmade Groovy function and classes in our shared library, basically setting up a HttpURLConnection instance. However, that works nicely in fast service calls, so I don't think there is an issue in this code. Let me know if I can help with any files/logs.

Devin Nusbaum added a comment - 2018-10-10 19:50 - edited

chamshoff Without seeing your Jenkinsfile, it's hard to say, but my guess would be that you are hitting the 5 minute timeout here because your code is waiting for a network response or similar in a method that doesn't respond to being interrupted (so it doesn't fail until the next step is invoked which notices that the thread was interrupted, assuming your stack trace is original to the OP's). I would highly recommend not making direct HTTP calls in your Jenkinsfile/Shared Library even if it seems like it works fine in some cases. You can create a bash script that does whatever you want (curl, java -jar yourRestClient.jar, etc.) and call that script using the sh or bat steps which run asynchronously and are not subject to a 5 minute timeout.

ncrmnt Based on those logs, there does appear to be a network issue with the remoting layer involved as well. I will check through the build files you sent to see if they give any indication of a root cause that we can address.

Devin Nusbaum added a comment - 2018-10-10 19:50 - edited chamshoff Without seeing your Jenkinsfile, it's hard to say, but my guess would be that you are hitting the 5 minute timeout here because your code is waiting for a network response or similar in a method that doesn't respond to being interrupted (so it doesn't fail until the next step is invoked which notices that the thread was interrupted, assuming your stack trace is original to the OP's). I would highly recommend not making direct HTTP calls in your Jenkinsfile/Shared Library even if it seems like it works fine in some cases. You can create a bash script that does whatever you want (curl, java -jar yourRestClient.jar, etc.) and call that script using the sh or bat steps which run asynchronously and are not subject to a 5 minute timeout. ncrmnt Based on those logs, there does appear to be a network issue with the remoting layer involved as well. I will check through the build files you sent to see if they give any indication of a root cause that we can address.

Andrew a added a comment - 2018-10-11 12:39

dnusbaum Thanks a lot for looking into it. Let me know if I can be of any help with this issue.

P.S. Another run with a super-high heartbeat interval ran for a little longer, but ended up with another InterruptedException as well on one of our tests that is timing out.

Andrew a added a comment - 2018-10-11 12:39 dnusbaum Thanks a lot for looking into it. Let me know if I can be of any help with this issue. P.S. Another run with a super-high heartbeat interval ran for a little longer, but ended up with another InterruptedException as well on one of our tests that is timing out.

Devin Nusbaum added a comment - 2018-10-23 18:07

ncrmnt The exception in your most recent build looks like a networking issue after you manually aborted the build. Here is the stack trace:

org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378)
    at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    Suppressed: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at hudson.remoting.Request.call(Request.java:177)
        at hudson.remoting.Channel.call(Channel.java:954)
        at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084)
        at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304)
        at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258)
        at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906)
        at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
        at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
        at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    ... Suppressed exception repeated for each parallel block

And the serialized FlowInterruptedException has:

<causes class="java.util.Arrays$ArrayList">
  <a class="jenkins.model.CauseOfInterruption-array">
    <jenkins.model.CauseOfInterruption_-UserInterruption>
      <user>andrianov</user>
    </jenkins.model.CauseOfInterruption_-UserInterruption>
  </a>
</causes>

Have you been manually aborting the builds, or is something else happening? Either way, it seems like you might have a distinct issue, so might be better to move the discussion to a new ticket or to the mailing list to help diagnose your remoting problems.

Devin Nusbaum added a comment - 2018-10-23 18:07 ncrmnt The exception in your most recent build looks like a networking issue after you manually aborted the build. Here is the stack trace: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125) at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Suppressed: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258) at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ... Suppressed exception repeated for each parallel block And the serialized FlowInterruptedException has: <causes class="java.util.Arrays$ArrayList"> <a class="jenkins.model.CauseOfInterruption-array"> <jenkins.model.CauseOfInterruption_-UserInterruption> <user>andrianov</user> </jenkins.model.CauseOfInterruption_-UserInterruption> </a> </causes> Have you been manually aborting the builds, or is something else happening? Either way, it seems like you might have a distinct issue, so might be better to move the discussion to a new ticket or to the mailing list to help diagnose your remoting problems.

Devin Nusbaum added a comment - 2018-10-23 18:40

For anyone still seeing the original exception in the description, try adding a custom logger (Manage Jenkins -> System Log -> Add new log recorder) at FINE level for org.jenkinsci.plugins.workflow.support.concurrent.Timeout, so that we see these log messages, and then next time you see the issue, add the output of the custom logger here, which will tell us if my hypothesis is correct, and if so, it will tell us what code was actually running at the time of the issue.

Devin Nusbaum added a comment - 2018-10-23 18:40 For anyone still seeing the original exception in the description, try adding a custom logger (Manage Jenkins -> System Log -> Add new log recorder) at FINE level for org.jenkinsci.plugins.workflow.support.concurrent.Timeout , so that we see these log messages , and then next time you see the issue, add the output of the custom logger here, which will tell us if my hypothesis is correct, and if so, it will tell us what code was actually running at the time of the issue.

Andrew a added a comment - 2018-10-24 15:42 - edited

Yes, I have been manually aborting the build, but already when they were stuck with the exception. I'll clean up all logs and fire up a fresh regression run this weekend with fine logging you've mentioned. Let's see if that gives us any better results.

Andrew a added a comment - 2018-10-24 15:42 - edited Yes, I have been manually aborting the build, but already when they were stuck with the exception. I'll clean up all logs and fire up a fresh regression run this weekend with fine logging you've mentioned. Let's see if that gives us any better results.

Devin Nusbaum added a comment - 2018-10-24 15:44

Sounds good. When you abort the build, it can change the stored info, so I would collect an archive of the build files prior to aborting it.

Devin Nusbaum added a comment - 2018-10-24 15:44 Sounds good. When you abort the build, it can change the stored info, so I would collect an archive of the build files prior to aborting it.

Andrew a added a comment - 2018-10-28 09:21 - edited

Before actually starting the job for this weekend, I went to the folks at our IT department and talked about the issue. We looked at the zabbix logs and could see no problem with network connectivity during the previous runs, and the nodes still had plenty of free resources. The VM running jenkins was NOT being backed up (VM backup might break/freeze TCP connections). The only thing that might have interferred - NFS home directories backup that is done every evening, since that will introduce some file system lags. We'll have a good look at the zabbix logs once again on monday. Meanwhile, here are the FINE exception logs:

Oct 28, 2018 12:58:13 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#222]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=541659 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 12:58:14 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#159]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=541661 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 1:26:27 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#227]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=546106 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#235]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=556575 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#234]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556578 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556579 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#238]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=559586 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#236]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559587 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#230]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559588 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#232]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561098 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561099 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#221]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561100 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#242]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561455 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#245]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561456 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#228]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561457 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Here goes the jenkins.log full of exceptions

jenkins.log

P.S. I haven't yet interrupted the build, so I can collect some more info if you need any. I'll fetch files from nodes later.

Andrew a added a comment - 2018-10-28 09:21 - edited Before actually starting the job for this weekend, I went to the folks at our IT department and talked about the issue. We looked at the zabbix logs and could see no problem with network connectivity during the previous runs, and the nodes still had plenty of free resources. The VM running jenkins was NOT being backed up (VM backup might break/freeze TCP connections). The only thing that might have interferred - NFS home directories backup that is done every evening, since that will introduce some file system lags. We'll have a good look at the zabbix logs once again on monday. Meanwhile, here are the FINE exception logs: Oct 28, 2018 12:58:13 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#222]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=541659 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 12:58:14 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#159]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=541661 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 1:26:27 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#227]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=546106 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#235]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=556575 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#234]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556578 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556579 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#238]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=559586 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#236]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559587 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#230]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559588 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#232]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561098 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561099 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#221]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561100 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#242]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561455 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#245]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561456 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#228]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561457 after 10 SECONDS java.lang.Throwable at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:177) at hudson.remoting.Channel.call(Channel.java:954) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Here goes the jenkins.log full of exceptions jenkins.log P.S. I haven't yet interrupted the build, so I can collect some more info if you need any. I'll fetch files from nodes later.

Andrew a added a comment - 2018-11-03 07:18 - edited

A little follow-up. Carefully inspecting the zabbix logs it turned out that the issue was during the time when we were performing routine backups of the NFS home directories at night. iowait is still very low, but a little more than usual. We didn't hit this bug every night, but almost always on the second night. My hypothesis: As the builds are performed on the NFS share, this may have resulted in filesystem lags. I have applied the following patch (600 s is quite an overkill, but let's see if it helps) and restarted the regression.

+++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
@@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                 return; // slave not yet ready, wait for another day
             }
             TaskListener listener = listener();
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                 if (watching) {
                     Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                     if (exitCode == null) {

A question to other folks experiencing the issue: How much RAM/disk io do your processes take? May it be that a huge memory-hungry process being terminated will freeze for a while (>10 seconds), syncing data to the disk?

Andrew a added a comment - 2018-11-03 07:18 - edited A little follow-up. Carefully inspecting the zabbix logs it turned out that the issue was during the time when we were performing routine backups of the NFS home directories at night. iowait is still very low, but a little more than usual. We didn't hit this bug every night, but almost always on the second night. My hypothesis: As the builds are performed on the NFS share, this may have resulted in filesystem lags. I have applied the following patch (600 s is quite an overkill, but let's see if it helps) and restarted the regression. +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step { return ; // slave not yet ready, wait for another day } TaskListener listener = listener(); - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) { if (watching) { Integer exitCode = controller.exitStatus(workspace, launcher(), listener); if (exitCode == null ) { A question to other folks experiencing the issue: How much RAM/disk io do your processes take? May it be that a huge memory-hungry process being terminated will freeze for a while (>10 seconds), syncing data to the disk?

Devin Nusbaum added a comment - 2018-11-05 15:00

ncrmnt Are you able to upload the content of the custom logger for org.jenkinsci.plugins.workflow.support.concurrent.Timeout (or was it blank?)? Custom loggers are not included in the normal jenkins.log, so you'll have to add it separately. Along with your patch, it could be interesting to add another custom logger at FINE level for org.jenkinsci.plugins.workflow.steps.durable_task.DurableTask.

Devin Nusbaum added a comment - 2018-11-05 15:00 ncrmnt Are you able to upload the content of the custom logger for org.jenkinsci.plugins.workflow.support.concurrent.Timeout (or was it blank?)? Custom loggers are not included in the normal jenkins.log, so you'll have to add it separately. Along with your patch, it could be interesting to add another custom logger at FINE level for org.jenkinsci.plugins.workflow.steps.durable_task.DurableTask .

Ian Cotton added a comment - 2018-11-05 15:31

I have the custom logger turned on. I have just added additional logging to it for DurableTask, so no results from that yet, just for the Timeout class. We are getting many exceptions as below. Our Jenkins master runs on an NFS file system, which ought to be reasonably quick but apparently may have issues.

Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 78969: checking /var/lib/jenkins/workspace/anches_ert_gw-ert-custard-3-GARHGYOSDUCJ4EHKSORKR3E2RAFWD3A5ZVBAKP6LQ5ZKQZBMWUQA on RedHatSlave3 / waiting for RedHatSlave3 id=14810179,5,main] after 10 SECONDS
java.lang.Throwable
at java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:169)
at hudson.remoting.Channel.call(Channel.java:904)
at hudson.FilePath.act(FilePath.java:987)
at hudson.FilePath.act(FilePath.java:976)
at hudson.FilePath.isDirectory(FilePath.java:1514)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Ian Cotton added a comment - 2018-11-05 15:31 I have the custom logger turned on. I have just added additional logging to it for DurableTask, so no results from that yet, just for the Timeout class. We are getting many exceptions as below. Our Jenkins master runs on an NFS file system, which ought to be reasonably quick but apparently may have issues. Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 78969 : checking /var/lib/jenkins/workspace/anches_ert_gw-ert-custard-3-GARHGYOSDUCJ4EHKSORKR3E2RAFWD3A5ZVBAKP6LQ5ZKQZBMWUQA on RedHatSlave3 / waiting for RedHatSlave3 id=14810179,5,main] after 10 SECONDS java.lang.Throwable at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:169) at hudson.remoting.Channel.call(Channel.java:904) at hudson.FilePath.act(FilePath.java:987) at hudson.FilePath.act(FilePath.java:976) at hudson.FilePath.isDirectory(FilePath.java:1514) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Devin Nusbaum added a comment - 2018-11-05 16:04 - edited

icotton64 Thanks for adding the stack trace! If ncrmnt is seeing identical stack traces, then the patch they are testing will not have any affect, because it looks like the timeout that is timing out is in DurableTaskStep#getWorkspace here, before the timeout in DurableTaskStep#check itself. It could be worth trying to increase that timeout just to see if it changes the behavior (though my initial guess though is that if we don't get a response after 10 seconds we will never get one), but given that the stuck operation is a remoting call, my guess is that this is a networking issue (perhaps an issue in Jenkins' networking stack (remoting) rather than something wrong with the network itself if your system monitoring shows no network issues). Did you see any network-related issues on the machine on which Jenkins was running around the time of the issue?

Devin Nusbaum added a comment - 2018-11-05 16:04 - edited icotton64 Thanks for adding the stack trace! If ncrmnt is seeing identical stack traces, then the patch they are testing will not have any affect, because it looks like the timeout that is timing out is in DurableTaskStep#getWorkspace here , before the timeout in DurableTaskStep#check itself. It could be worth trying to increase that timeout just to see if it changes the behavior (though my initial guess though is that if we don't get a response after 10 seconds we will never get one), but given that the stuck operation is a remoting call, my guess is that this is a networking issue (perhaps an issue in Jenkins' networking stack (remoting) rather than something wrong with the network itself if your system monitoring shows no network issues). Did you see any network-related issues on the machine on which Jenkins was running around the time of the issue?

Ian Cotton added a comment - 2018-11-05 16:18

Thanks dnusbaum. No, I didn't observe any networking related issues. I monitor both machines using both Nagios and Prometheus. the granularity of monitoring on Prometheus is every minute and it didn't see any disconnect.

Ian Cotton added a comment - 2018-11-05 16:18 Thanks dnusbaum . No, I didn't observe any networking related issues. I monitor both machines using both Nagios and Prometheus. the granularity of monitoring on Prometheus is every minute and it didn't see any disconnect.

Andrew a added a comment - 2018-11-06 08:01

dnusbaum, yes, I've just included those in the code section, see my comments above (apparently JIRA has hidden those already). All of them are related to file operations.

Anyways, I think I have resolved the issue on my setup with this patch (600 seconds is an overkill, but I just had to test it):

@@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step {
                 }
             }
             boolean directory;
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                 directory = ws.isDirectory();
             } catch (Exception x) {
                 getWorkspaceProblem(x);
@@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                 return; // slave not yet ready, wait for another day
             }
             TaskListener listener = listener();
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                 if (watching) {
                     Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                     if (exitCode == null) {

It seems that the 10 second timeout for file-based operations is something causing the problem under heavy disc/network load. During all three runs with FINE exception logging those occurred during routine backups.

I also found the following workaround: A restart of the master will cause pipeline to resume and correctly reconnect with all the nodes, effectively unfreezing the job.

Andrew a added a comment - 2018-11-06 08:01 dnusbaum , yes, I've just included those in the code section, see my comments above (apparently JIRA has hidden those already). All of them are related to file operations. Anyways, I think I have resolved the issue on my setup with this patch (600 seconds is an overkill, but I just had to test it): @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step { } } boolean directory; - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) { directory = ws.isDirectory(); } catch (Exception x) { getWorkspaceProblem(x); @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step { return ; // slave not yet ready, wait for another day } TaskListener listener = listener(); - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) { if (watching) { Integer exitCode = controller.exitStatus(workspace, launcher(), listener); if (exitCode == null ) { It seems that the 10 second timeout for file-based operations is something causing the problem under heavy disc/network load. During all three runs with FINE exception logging those occurred during routine backups. I also found the following workaround: A restart of the master will cause pipeline to resume and correctly reconnect with all the nodes, effectively unfreezing the job.

Ian Cotton added a comment - 2018-11-06 12:06

I now get the following exception from the DurableTaskStep log:

Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 88150: checking /var/lib/jenkins/workspace/GW_GW_Pipeline_Build_trunk-KY5NRFS7BWLOEMDFRPJGO5IAG4Y6DFVRUT333ENYFD5NOOFQINTA on RedHatSlave10 / waiting for RedHatSlave10 id=16567391,5,main] after 10 SECONDS
java.lang.Throwable
at java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:169)
at hudson.remoting.Channel.call(Channel.java:904)
at hudson.FilePath.act(FilePath.java:987)
at hudson.FilePath.act(FilePath.java:976)
at hudson.FilePath.isDirectory(FilePath.java:1514)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

So it looks like it is timing out calling the node.

Ian Cotton added a comment - 2018-11-06 12:06 I now get the following exception from the DurableTaskStep log: Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 88150 : checking /var/lib/jenkins/workspace/GW_GW_Pipeline_Build_trunk-KY5NRFS7BWLOEMDFRPJGO5IAG4Y6DFVRUT333ENYFD5NOOFQINTA on RedHatSlave10 / waiting for RedHatSlave10 id=16567391,5,main] after 10 SECONDS java.lang.Throwable at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:169) at hudson.remoting.Channel.call(Channel.java:904) at hudson.FilePath.act(FilePath.java:987) at hudson.FilePath.act(FilePath.java:976) at hudson.FilePath.isDirectory(FilePath.java:1514) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) So it looks like it is timing out calling the node.

Andrew a added a comment - 2018-11-06 14:38 - edited

icotton64, can you give my proposed patch with increased timeouts a try?

Here's my pre-compiled version of the plugin for your convenience. (I hope sharing a precompiled plugin is allowed here, if not - sorry) workflow-durable-task-step.hpi

Andrew a added a comment - 2018-11-06 14:38 - edited icotton64 , can you give my proposed patch with increased timeouts a try? Here's my pre-compiled version of the plugin for your convenience. (I hope sharing a precompiled plugin is allowed here, if not - sorry) workflow-durable-task-step.hpi

Devin Nusbaum added a comment - 2018-11-06 14:43 - edited

ncrmnt Thanks for working on the issue! Any chance you are able to play with the timeout to figure out a lower bound of what works for you? I would be fine increasing the limit to maybe 30-60 seconds by default, and providing a system property to increase it further, but I'm not sure about increasing the default further than that.

Devin Nusbaum added a comment - 2018-11-06 14:43 - edited ncrmnt Thanks for working on the issue! Any chance you are able to play with the timeout to figure out a lower bound of what works for you? I would be fine increasing the limit to maybe 30-60 seconds by default, and providing a system property to increase it further, but I'm not sure about increasing the default further than that.

Ian Cotton added a comment - 2018-11-06 16:06

Sorry ncrmnt the system we are seeing the issue on is our production system and I really don't want to risk installing a pre-compiled plugin on it.

Ian Cotton added a comment - 2018-11-06 16:06 Sorry ncrmnt the system we are seeing the issue on is our production system and I really don't want to risk installing a pre-compiled plugin on it.

Andrew a added a comment - 2018-11-06 16:09 - edited

dnusbaum I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section.

icotton64 No problem, I fully understand. Can you recompile it yourself with a 30 second timeout (see my patch above) and give it a try? This way we'll provide dnusbaum some results about a suitable timeout faster.

Andrew a added a comment - 2018-11-06 16:09 - edited dnusbaum I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section. icotton64 No problem, I fully understand. Can you recompile it yourself with a 30 second timeout (see my patch above) and give it a try? This way we'll provide dnusbaum some results about a suitable timeout faster.

Ian Cotton added a comment - 2018-11-06 16:55

ncrmnt I am not able to build the plugin at the moment. My Jenkins server doesn't have the required plugins. We are setting up some new server and hopefully I can use one of them.

Ian Cotton added a comment - 2018-11-06 16:55 ncrmnt I am not able to build the plugin at the moment. My Jenkins server doesn't have the required plugins. We are setting up some new server and hopefully I can use one of them.

Devin Nusbaum added a comment - 2018-11-06 17:00

For now, I went ahead and filed https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 to allow the timeout to be configured by system property. Feel free to pull down the incremental build of that PR once it is complete for testing if you are already running workflow-durable-task-step 2.26, or review/comment on the PR. Thanks!

Devin Nusbaum added a comment - 2018-11-06 17:00 For now, I went ahead and filed https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 to allow the timeout to be configured by system property. Feel free to pull down the incremental build of that PR once it is complete for testing if you are already running workflow-durable-task-step 2.26, or review/comment on the PR. Thanks!

Ian Cotton added a comment - 2018-11-07 11:16

I realised we are running an old version of the plugin (2.17). unfortunately we are also running Jenkins 2.73.2 and the plugin versions beyond 2.22 require 2.73.3. We can try the plugin on our newer replacement Jenkins but we don't have the failing builds on those Jenkins. We are working to improve matters but I don't think we will be in a position to test this properly for at least a few days.

Ian Cotton added a comment - 2018-11-07 11:16 I realised we are running an old version of the plugin (2.17). unfortunately we are also running Jenkins 2.73.2 and the plugin versions beyond 2.22 require 2.73.3. We can try the plugin on our newer replacement Jenkins but we don't have the failing builds on those Jenkins. We are working to improve matters but I don't think we will be in a position to test this properly for at least a few days.

Andrew a added a comment - 2018-11-13 06:55 - edited

I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now)

First one:
I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning:

Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling

But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, ~~perhaps due to a master node restart~~ (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.

Second one:

In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly) I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem:

 Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop
 java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy
 at sun.nio.fs.UnixException.translateToIOException(Unknown Source)
 at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
 at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
 at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source)
 at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source)
 at java.nio.file.Files.deleteIfExists(Unknown Source)
 at hudson.Util.tryOnceDeleteFile(Util.java:316)
 at hudson.Util.deleteFile(Util.java:272)
 Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca
 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
 at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
 at hudson.remoting.Channel.call(Channel.java:955)
 at hudson.FilePath.act(FilePath.java:1070)
 at hudson.FilePath.act(FilePath.java:1059)
 at hudson.FilePath.deleteRecursive(FilePath.java:1266)
 at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
 at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
 at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
 at hudson.Util.deleteFile(Util.java:277)
 at hudson.FilePath.deleteRecursive(FilePath.java:1303)
 at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
 at hudson.FilePath.deleteRecursive(FilePath.java:1302)
 at hudson.FilePath.access$1600(FilePath.java:211)
 at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
 at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
 at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
 at hudson.remoting.UserRequest.perform(UserRequest.java:212)
 at hudson.remoting.UserRequest.perform(UserRequest.java:54)
 at hudson.remoting.Request$2.run(Request.java:369)
 at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)Sending interrupt signal to process
 After 10s process did not stop
 java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy
 at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
 at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
 at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
 at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
 at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
 at java.nio.file.Files.deleteIfExists(Files.java:1165)
 at hudson.Util.tryOnceDeleteFile(Util.java:316)
 at hudson.Util.deleteFile(Util.java:272)
 Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx
 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
 at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
 at hudson.remoting.Channel.call(Channel.java:955)
 at hudson.FilePath.act(FilePath.java:1070)
 at hudson.FilePath.act(FilePath.java:1059)
 at hudson.FilePath.deleteRecursive(FilePath.java:1266)
 at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
 at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
 at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
 at hudson.Util.deleteFile(Util.java:277)
 at hudson.FilePath.deleteRecursive(FilePath.java:1303)
 at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
 at hudson.FilePath.deleteRecursive(FilePath.java:1302)
 at hudson.FilePath.access$1600(FilePath.java:211)
 at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
 at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
 at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
 at hudson.remoting.UserRequest.perform(UserRequest.java:212)
 at hudson.remoting.UserRequest.perform(UserRequest.java:54)
 at hudson.remoting.Request$2.run(Request.java:369)
 at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:748)[Pipeline] }[Pipeline] }[Pipeline] // timeout[Pipeline] // timeout[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] // dir[Pipeline] // dir[Pipeline] }[Pipeline] }[Pipeline] // node[Pipeline] // node[Pipeline] }[Pipeline] }sh: line 1: 104849 Terminated sleep 3sh: line 1: 163732 Terminated { while [ ( -d /proc/$pid -o ! -d /proc/$$ ) -a -d '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03' -a ! -f '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-result.txt' ]; do
 touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt'; sleep 3;
 done; }
 sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot-default-rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec

It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes.

P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely.

diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
index 9b449d7..b338690 100644
--- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
+++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
@@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step {
                 }
             }
             boolean directory;
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
                 directory = ws.isDirectory();
             } catch (Exception x) {
                 getWorkspaceProblem(x);
@@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step {
                         stopTask = null;
                         if (recurrencePeriod > 0) {
                             recurrencePeriod = 0;
-                            listener().getLogger().println("After 10s process did not stop");
+                            listener().getLogger().println("After 60s process did not stop");
                             getContext().onFailure(cause);
                             try {
                                 FilePath workspace = getWorkspace();
@@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step {
                             }
                         }
                     }
-                }, 10, TimeUnit.SECONDS);
+                }, 60, TimeUnit.SECONDS);
                 controller.stop(workspace, launcher());
             } else {
                 listener().getLogger().println("Could not connect to " + node + " to send interrupt signal to process");
@@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                 return; // slave not yet ready, wait for another day
             }
             TaskListener listener = listener();
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
                 if (watching) {
                     Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                     if (exitCode == null) {

Andrew a added a comment - 2018-11-13 06:55 - edited I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now) First one: I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning: Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one. Second one: In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly) I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem: Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy at sun.nio.fs.UnixException.translateToIOException(Unknown Source) at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source) at java.nio.file.Files.deleteIfExists(Unknown Source) at hudson.Util.tryOnceDeleteFile(Util.java:316) at hudson.Util.deleteFile(Util.java:272) Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:955) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at hudson.FilePath.deleteRecursive(FilePath.java:1266) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716' . Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. at hudson.Util.deleteFile(Util.java:277) at hudson.FilePath.deleteRecursive(FilePath.java:1303) at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312) at hudson.FilePath.deleteRecursive(FilePath.java:1302) at hudson.FilePath.access$1600(FilePath.java:211) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084) at hudson.remoting.UserRequest.perform(UserRequest.java:212) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source)Sending interrupt signal to process After 10s process did not stop java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108) at java.nio.file.Files.deleteIfExists(Files.java:1165) at hudson.Util.tryOnceDeleteFile(Util.java:316) at hudson.Util.deleteFile(Util.java:272) Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:955) at hudson.FilePath.act(FilePath.java:1070) at hudson.FilePath.act(FilePath.java:1059) at hudson.FilePath.deleteRecursive(FilePath.java:1266) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597' . Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. at hudson.Util.deleteFile(Util.java:277) at hudson.FilePath.deleteRecursive(FilePath.java:1303) at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312) at hudson.FilePath.deleteRecursive(FilePath.java:1302) at hudson.FilePath.access$1600(FilePath.java:211) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272) at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084) at hudson.remoting.UserRequest.perform(UserRequest.java:212) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:748)[Pipeline] }[Pipeline] }[Pipeline] // timeout[Pipeline] // timeout[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn 't honor timeout setting?[Pipeline] }[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn' t honor timeout setting?[Pipeline] }[Pipeline] // dir[Pipeline] // dir[Pipeline] }[Pipeline] }[Pipeline] // node[Pipeline] // node[Pipeline] }[Pipeline] }sh: line 1: 104849 Terminated sleep 3sh: line 1: 163732 Terminated { while [ ( -d /proc/$pid -o ! -d /proc/$$ ) -a -d '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03' -a ! -f '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-result.txt' ]; do touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' ; sleep 3; done; } sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot- default -rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes. P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely. diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java index 9b449d7..b338690 100644 --- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java +++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java @@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step { } } boolean directory; - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) { directory = ws.isDirectory(); } catch (Exception x) { getWorkspaceProblem(x); @@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step { stopTask = null ; if (recurrencePeriod > 0) { recurrencePeriod = 0; - listener().getLogger().println( "After 10s process did not stop" ); + listener().getLogger().println( "After 60s process did not stop" ); getContext().onFailure(cause); try { FilePath workspace = getWorkspace(); @@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step { } } } - }, 10, TimeUnit.SECONDS); + }, 60, TimeUnit.SECONDS); controller.stop(workspace, launcher()); } else { listener().getLogger().println( "Could not connect to " + node + " to send interrupt signal to process" ); @@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step { return ; // slave not yet ready, wait for another day } TaskListener listener = listener(); - try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) { + try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) { if (watching) { Integer exitCode = controller.exitStatus(workspace, launcher(), listener); if (exitCode == null ) {

Devin Nusbaum added a comment - 2019-01-14 21:38

ncrmnt did those timeouts end up helping? If so, I can roll them up into https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 and release that so they can be configured without needing to run custom code.

Devin Nusbaum added a comment - 2019-01-14 21:38 ncrmnt did those timeouts end up helping? If so, I can roll them up into https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 and release that so they can be configured without needing to run custom code.

Andrew a added a comment - 2019-01-15 10:12

dnusbaum Sorry for not reporting earlier. 60 seconds seem to have fixed all issues for me. The rest of the problems were due to ctest (and our numa scheduler wrapped within it before the actual simulator) not correctly dying when jenkins asked them to do so.

Andrew a added a comment - 2019-01-15 10:12 dnusbaum Sorry for not reporting earlier. 60 seconds seem to have fixed all issues for me. The rest of the problems were due to ctest (and our numa scheduler wrapped within it before the actual simulator) not correctly dying when jenkins asked them to do so.

Devin Nusbaum added a comment - 2019-01-15 14:19

ncrmnt No problem! I will move forward with my PR (adding an additional timeout), thanks so much for interactively debugging the issue!

Devin Nusbaum added a comment - 2019-01-15 14:19 ncrmnt No problem! I will move forward with my PR (adding an additional timeout), thanks so much for interactively debugging the issue!

Devin Nusbaum added a comment - 2019-01-31 21:32

As of version 2.29 of the Pipeline Nodes and Process Plugin, the default timeout for remote calls is 20 seconds, and the value can be configured using the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT.

I am marking this ticket as closed since that is the main cause of the issue identified from discussion in the comments (thanks ncrmnt!). If this issue is still occurring frequently for someone after increasing that value, please comment and we can investigate further.

Devin Nusbaum added a comment - 2019-01-31 21:32 As of version 2.29 of the Pipeline Nodes and Process Plugin, the default timeout for remote calls is 20 seconds, and the value can be configured using the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT . I am marking this ticket as closed since that is the main cause of the issue identified from discussion in the comments (thanks ncrmnt !). If this issue is still occurring frequently for someone after increasing that value, please comment and we can investigate further.

Tony Poerio added a comment - 2019-10-18 18:30

Hi dnusbaum – when will this fix be released? My team needs it. (Or at least we think we do.)

And I saw that this message is from about 9 months back, at time of writing.

As of right now, the current release is only `2.200`, (released 10-14-2019).

However, the post immediately above is referencing version `2.29`.

Is it possible that this update is already present in `2.200`?

If not, when will it become available in a stable release? Many thanks for the help.

Tony Poerio added a comment - 2019-10-18 18:30 Hi dnusbaum – when will this fix be released? My team needs it. (Or at least we think we do.) And I saw that this message is from about 9 months back, at time of writing. As of right now, the current release is only `2.200`, (released 10-14-2019). However, the post immediately above is referencing version `2.29`. Is it possible that this update is already present in `2.200`? If not, when will it become available in a stable release? Many thanks for the help.

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Andrew Bayer added a comment - 2017-08-28 20:55

Expand comment: Andrew Bayer added a comment - 2017-08-28 20:55

Collapse comment: Kieran Webber added a comment - 2017-09-28 10:33

Expand comment: Kieran Webber added a comment - 2017-09-28 10:33

Collapse comment: Ann Beeskau added a comment - 2017-10-24 15:41

Expand comment: Ann Beeskau added a comment - 2017-10-24 15:41

Collapse comment: Syed Imran added a comment - 2017-12-08 10:02, Edited by Syed Imran - 2017-12-08 10:02

Expand comment: Syed Imran added a comment - 2017-12-08 10:02, Edited by Syed Imran - 2017-12-08 10:02

Collapse comment: Georg Henzler added a comment - 2018-01-11 11:31

Expand comment: Georg Henzler added a comment - 2018-01-11 11:31

Collapse comment: Suriyanath added a comment - 2018-01-16 10:30

Expand comment: Suriyanath added a comment - 2018-01-16 10:30

Collapse comment: Lorenzo Cubero added a comment - 2018-01-29 12:27

Expand comment: Lorenzo Cubero added a comment - 2018-01-29 12:27

Collapse comment: shyam singh added a comment - 2018-02-08 07:04, Edited by shyam singh - 2018-02-08 07:05

Expand comment: shyam singh added a comment - 2018-02-08 07:04, Edited by shyam singh - 2018-02-08 07:05

Collapse comment: Bruno Sousa added a comment - 2018-04-11 06:19

Expand comment: Bruno Sousa added a comment - 2018-04-11 06:19

Collapse comment: Jarrett B added a comment - 2018-04-18 18:29

Expand comment: Jarrett B added a comment - 2018-04-18 18:29

Collapse comment: Georg Henzler added a comment - 2018-04-18 18:41

Expand comment: Georg Henzler added a comment - 2018-04-18 18:41

Collapse comment: Sam Van Oort added a comment - 2018-04-18 19:39

Expand comment: Sam Van Oort added a comment - 2018-04-18 19:39

Collapse comment: Alexey Grigorov added a comment - 2018-09-04 15:53, Edited by Alexey Grigorov - 2018-09-05 14:00

Expand comment: Alexey Grigorov added a comment - 2018-09-04 15:53, Edited by Alexey Grigorov - 2018-09-05 14:00

Collapse comment: Toby Broyles added a comment - 2018-09-05 12:31

Expand comment: Toby Broyles added a comment - 2018-09-05 12:31

Collapse comment: Sam Van Oort added a comment - 2018-09-05 13:22

Expand comment: Sam Van Oort added a comment - 2018-09-05 13:22

Collapse comment: Devin Nusbaum added a comment - 2018-09-14 21:16, Edited by Devin Nusbaum - 2018-09-18 21:19

Expand comment: Devin Nusbaum added a comment - 2018-09-14 21:16, Edited by Devin Nusbaum - 2018-09-18 21:19

Collapse comment: Andrew a added a comment - 2018-09-22 13:20

Expand comment: Andrew a added a comment - 2018-09-22 13:20

Collapse comment: Devin Nusbaum added a comment - 2018-09-25 14:53

Expand comment: Devin Nusbaum added a comment - 2018-09-25 14:53

Collapse comment: Andrew a added a comment - 2018-09-26 12:01

Expand comment: Andrew a added a comment - 2018-09-26 12:01

Collapse comment: Andrew a added a comment - 2018-09-30 20:38, Edited by Andrew a - 2018-10-01 12:41

Expand comment: Andrew a added a comment - 2018-09-30 20:38, Edited by Andrew a - 2018-10-01 12:41

Collapse comment: Devin Nusbaum added a comment - 2018-10-01 13:19

Expand comment: Devin Nusbaum added a comment - 2018-10-01 13:19

Collapse comment: Andrew a added a comment - 2018-10-01 15:04

Expand comment: Andrew a added a comment - 2018-10-01 15:04

Collapse comment: Christoph Amshoff added a comment - 2018-10-10 19:24

Expand comment: Christoph Amshoff added a comment - 2018-10-10 19:24

Collapse comment: Devin Nusbaum added a comment - 2018-10-10 19:50, Edited by Devin Nusbaum - 2018-10-10 19:56

Expand comment: Devin Nusbaum added a comment - 2018-10-10 19:50, Edited by Devin Nusbaum - 2018-10-10 19:56

Collapse comment: Andrew a added a comment - 2018-10-11 12:39

Expand comment: Andrew a added a comment - 2018-10-11 12:39

Collapse comment: Devin Nusbaum added a comment - 2018-10-23 18:07

Expand comment: Devin Nusbaum added a comment - 2018-10-23 18:07

Collapse comment: Devin Nusbaum added a comment - 2018-10-23 18:40

Expand comment: Devin Nusbaum added a comment - 2018-10-23 18:40

Collapse comment: Andrew a added a comment - 2018-10-24 15:42, Edited by Andrew a - 2018-10-24 15:42

Expand comment: Andrew a added a comment - 2018-10-24 15:42, Edited by Andrew a - 2018-10-24 15:42

Collapse comment: Devin Nusbaum added a comment - 2018-10-24 15:44

Expand comment: Devin Nusbaum added a comment - 2018-10-24 15:44

Collapse comment: Andrew a added a comment - 2018-10-28 09:21, Edited by Andrew a - 2018-10-28 09:24

Expand comment: Andrew a added a comment - 2018-10-28 09:21, Edited by Andrew a - 2018-10-28 09:24

Collapse comment: Andrew a added a comment - 2018-11-03 07:18, Edited by Andrew a - 2018-11-03 07:22

Expand comment: Andrew a added a comment - 2018-11-03 07:18, Edited by Andrew a - 2018-11-03 07:22

Collapse comment: Devin Nusbaum added a comment - 2018-11-05 15:00

Expand comment: Devin Nusbaum added a comment - 2018-11-05 15:00

Collapse comment: Ian Cotton added a comment - 2018-11-05 15:31

Expand comment: Ian Cotton added a comment - 2018-11-05 15:31

Collapse comment: Devin Nusbaum added a comment - 2018-11-05 16:04, Edited by Devin Nusbaum - 2018-11-05 16:05

Expand comment: Devin Nusbaum added a comment - 2018-11-05 16:04, Edited by Devin Nusbaum - 2018-11-05 16:05

Collapse comment: Ian Cotton added a comment - 2018-11-05 16:18

Expand comment: Ian Cotton added a comment - 2018-11-05 16:18

Collapse comment: Andrew a added a comment - 2018-11-06 08:01

Expand comment: Andrew a added a comment - 2018-11-06 08:01

Collapse comment: Ian Cotton added a comment - 2018-11-06 12:06