I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now)
First one:
I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning:
Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling
But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.
Second one:
In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly) I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem:
Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop
java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy
at sun.nio.fs.UnixException.translateToIOException(Unknown Source)
at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source)
at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source)
at java.nio.file.Files.deleteIfExists(Unknown Source)
at hudson.Util.tryOnceDeleteFile(Util.java:316)
at hudson.Util.deleteFile(Util.java:272)
Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
at hudson.remoting.Channel.call(Channel.java:955)
at hudson.FilePath.act(FilePath.java:1070)
at hudson.FilePath.act(FilePath.java:1059)
at hudson.FilePath.deleteRecursive(FilePath.java:1266)
at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
at hudson.Util.deleteFile(Util.java:277)
at hudson.FilePath.deleteRecursive(FilePath.java:1303)
at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
at hudson.FilePath.deleteRecursive(FilePath.java:1302)
at hudson.FilePath.access$1600(FilePath.java:211)
at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
at hudson.remoting.UserRequest.perform(UserRequest.java:212)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:369)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)Sending interrupt signal to process
After 10s process did not stop
java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
at java.nio.file.Files.deleteIfExists(Files.java:1165)
at hudson.Util.tryOnceDeleteFile(Util.java:316)
at hudson.Util.deleteFile(Util.java:272)
Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
at hudson.remoting.Channel.call(Channel.java:955)
at hudson.FilePath.act(FilePath.java:1070)
at hudson.FilePath.act(FilePath.java:1059)
at hudson.FilePath.deleteRecursive(FilePath.java:1266)
at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
at hudson.Util.deleteFile(Util.java:277)
at hudson.FilePath.deleteRecursive(FilePath.java:1303)
at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
at hudson.FilePath.deleteRecursive(FilePath.java:1302)
at hudson.FilePath.access$1600(FilePath.java:211)
at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
at hudson.remoting.UserRequest.perform(UserRequest.java:212)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:369)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)[Pipeline] }[Pipeline] }[Pipeline] touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt'; sleep 3;
done; }
sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot-default-rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec
It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes.
P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely.
diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
index 9b449d7..b338690 100644
--- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
+++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
@@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step {
}
}
boolean directory;
- try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+ try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
directory = ws.isDirectory();
} catch (Exception x) {
getWorkspaceProblem(x);
@@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step {
stopTask = null;
if (recurrencePeriod > 0) {
recurrencePeriod = 0;
- listener().getLogger().println("After 10s process did not stop");
+ listener().getLogger().println("After 60s process did not stop");
getContext().onFailure(cause);
try {
FilePath workspace = getWorkspace();
@@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step {
}
}
}
- }, 10, TimeUnit.SECONDS);
+ }, 60, TimeUnit.SECONDS);
controller.stop(workspace, launcher());
} else {
listener().getLogger().println("Could not connect to " + node + " to send interrupt signal to process");
@@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
return; }
TaskListener listener = listener();
- try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+ try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {
if (watching) {
Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
if (exitCode == null) {
dnusbaum I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section.
icotton64 No problem, I fully understand. Can you recompile it yourself with a 30 second timeout (see my patch above) and give it a try? This way we'll provide dnusbaum some results about a suitable timeout faster.