Investigated yesterday and sharing my findings below. Here's an example of a wrapper script running:
In my case on the slave, we have 2 folders:
1. the workplace folder /home/ubuntu/workspace/xxx, which is hosting the files downloaded from the git repo
2. the control folder for the durable plugin: /home/ubuntu/workspace/xxx@tmp/durable-yyy containing:
- script.sh which is the command line we are executing, took from the `sh` command of the Jenkinsfile
- jenkins-log.txt which is the output of the command
- jenkins-result.txt which will contain the exit code of the command (when finished)
- pid which is not created by default but could contain the pid of the command
On my environment it will throw the error message after 605-915 seconds when having both conditions met:
1. no exit code in the jenkins-result.txt file
2. the log file (jenkins-log.txt) not modified since more than 300 seconds
(1) happens when the wrapper script is not running any more
(2) happens when the wrapper script is not running, or when the command is not sending messages to logs since more than 300 seconds
The source code is here:
https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java
tested scenarios:
- when adding an exit code to the jenkins-result.txt file, it stops successfully
- when only creating jenkins-result.txt file, it continues to wait till having a code
- long script (30mins) without output with wrapper script running = no issue
- long script (30mins) with output without wrapper script running = no issue
- long script (30mins) without output without wrapper script running = failing after 608s
- long script (30mins) without output without wrapper script running, but with a pid file = no issue
That means that a _workaround_could be to manually create a pid file before starting the command and then remove when finished. When the command is quite long, it will kindly throw this warning every 5 minutes but wait for the process to finish:
It could be done by a simple shell script like the one below. However, we should follow the best practices to setup pipeline/step timeouts and avoid pipelines to run forever.
# workaround for the wrapper issue, create a pid file
MYCONTROLDIR=`echo "$PWD@tmp/durable"*`
echo $$ >$MYCONTROLDIR/pid
# ./my_longcommand_to_run.sh
exitcode=$?
# lets clean up what we did
rm -f $MYCONTROLDIR/pid
exit $exitcode
In my case, the root cause is still unknown. Suspecting that the wrapper script is killed, the system OOM killer (caused by the pipelines filling the slave's memory) is a great candidate, but needing some evidence of it.
A good suggestion to improve the durable-task-plugin: if we could understand in which scenario we are:
1. the wrapper script is not running any more (killed?)
2. or, the wrapper script is running but with some error (and if it writes an error to stderr, please log it somewhere...)
3. or, the wrapper script is running, but so slow due to some CPU/storage/filesystem slowness,
We have the same issue, during the JS build job execute a "ng build" command and the job after 32 minutes is killed because seems to not respond.