-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Jenkins ver. 2.107.1 on CentOS 7
-
Powered by SuggestiMate
I have a Jenkins pipeline that runs a shell script that takes about 5 minutes and generates no output. The job fails and I'm seeing the following in the output:
wrapper script does not seem to be touching the log file in /home/jenkins/workspace/job_Pipeline@2@tmp/durable-595950a5
(--JENKINS-48300--: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)
script returned exit code -1
Based on JENKINS-48300 it seems that Jenkins is intentionally killing my script while it is still running. IMHO it is a bug for Jenkins to assume that a shell script will generate output every n seconds for any finite n. As a workaround I've set -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL to one hour. But what happens when I have a script that takes an hour and one minute!?
- relates to
-
JENKINS-25503 Use setsid instead of nohup
-
- Resolved
-
-
JENKINS-48300 Pipeline shell step aborts prematurely with ERROR: script returned exit code -1
-
- Resolved
-
[JENKINS-50379] Jenkins kills long running sh script with no output
Travis does this sort of thing too, if there's no output for a while it just assumes the process is hung then stops the build.
If you don't want to mess with Jenkins something like the following shell snippet can help.
It forks your long running process & echo's dots to the console as long as it's still running:
# suppress command output unless there is a failure function quiet() { if [[ $- =~ x ]]; then set +x; XTRACE=1; fi if [[ $- =~ e ]]; then set +e; ERREXIT=1; fi tmp=$(mktemp) || return # this will be the temp file w/ the output echo -ne "quiet running: ${@} " ts_elapsed=0 ts_start=$(date +%s) "${@}" > "${tmp}" 2>&1 & cmd_pid=$! while [ 1 ]; do if [ `uname` == 'Linux' ]; then ps -q ${cmd_pid} > /dev/null 2>&1 running=${?} else ps -ef ${cmd_pid} > /dev/null 2>&1 running=${?} fi if [ "${running}" -eq 0 ]; then echo -ne '.' sleep 3 continue fi break done wait ${cmd_pid} ret=${?} ts_end=$(date +%s) let "ts_elapsed = ${ts_end} - ${ts_start}" if [ "${ret}" -eq 0 ]; then echo -ne " finished with code ${ret} in ${ts_elapsed} secs, last lines were:\n" tail -n 4 "${tmp}" else cat "${tmp}" fi rm -f "${tmp}" if [ "${ERREXIT}" ]; then unset ERREXIT; set -e; fi if [ "${XTRACE}" ]; then unset XTRACE; set -x; fi return "${ret}" # return the exit status of the command }
danielbeck the script initially generates some output to show that it started and then generates no output for a long time. I think this has the same effect as your suggestion of using echo.
I see this issue on scripts which do generate some output, but it happens that parts of the script take some time to run: in my case I'm compiling a kernel module and even when the make output is sent to the console, sometimes individual steps take longer than the timeout...
Hi evanward1,
We are also facing the same issue, can you please help us to know how to change Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL interval.
Hi there,
Same issue I run task with high load disk tasks. I put durable plugin in "none" option, I tried HEARTBEAT_CHECK_INTERVAL but it doesn't work for me. To have solution I have created additional mount point in jenkins slave. But IMHO I would prefer to have option to disable it at all.
I have the same problem some times, and how to change the HEARTBEAT_CHECK_INTERVAL?
Hi There,
I am also facing this issue now in our environment. If there is any work around for this.
```wrapper script does not seem to be touching the log file in /home/****/workspace/demo@tmp/durable-549a8a8c
(JENKINS-48300: if on a laggy filesystem, consider -Dorg.****ci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)```
Same for us. Out of nowhere jobs are killed on our Jenkins nodes. Manually setting the heartbeat check interval to 300 seems to work for now.
Btw. on Debian-like machines, you need to edit `/var/default/jenkins` and add the above mentioned variable setting to the line starting with JAVA_ARGS=
It should then look something like this:
JAVA_ARGS="-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 -Djava.awt.headless=true ..."
Hi Guys,
This issue is due to Durable Task Pluggin, In the latest release of Durable task Pluggin 1.25 this has been resolved.
reference Link : https://issues.jenkins-ci.org/browse/JENKINS-52881
More likely related to JENKINS-48300. Impossible to diagnose merely from this message.
The problem is not that your script stops producing output for a while. That is perfectly normal and supported. The problem is that a side process which is supposed to be detecting this fact and touching the log file every three seconds is either not running, or not producing the right timestamp as observed by the Jenkins agent JVM.
After having updated Jenkins and its plugins, we're experiencing this issue too now.
We're now using Jenkins 2.193 and the Durable Task Plugin has version 1.30.
wrapper script does not seem to be touching the log file in /local/user_data/s__t/jenkins/workspace/S___K@7@tmp/durable-ad608bf9
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
This 'extremely laggy filesystem' is a local hard disk which isn't laggy whatsoever.
About 50% of our jobs get aborted due to this.
Do you have any suggestions how this can get solved without the workaround to redefine the HEARTBEAT_CHECK_INTERVAL?
Only by diagnosing and figuring out how to reproduce, so the issue can be fixed.
> The problem is not that your script stops producing output for a while. That is perfectly normal and supported. The problem is that a side process which is supposed to be detecting this fact and touching the log file every three seconds is either not running, or not producing the right timestamp as observed by the Jenkins agent JVM.
Right, so it sounds like we need to investigate why the side process that should be touching the log file isn't working properly.
I should have mentioned that JENKINS-25503 would completely reimplement the code involved here, possibly solving this issue (possibly introducing others).
We have the same issue, during the JS build job execute a "ng build" command and the job after 32 minutes is killed because seems to not respond.
Cannot contact Node 02: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@70fad4d7:JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702": Remote call on JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702 failed. The channel is closing down or has closed down wrapper script does not seem to be touching the log file in /var/lib/jenkins/workspace/xxx@tmp/durable-476d6be2 (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
For the record, I was affected by this issue a while ago and in my case, I was running Jenkins agents on k8s, and increasing the pod memory limit solve it at least in my case.
I met this issue.
[2021-05-25T13:42:16.469Z] wrapper script does not seem to be touching the log file in @tmp/durable-c284507c
[2021-05-25T13:42:16.469Z] (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
the reason is " No space left on device"
Running into this. One of my playbooks will restart Jenkins service if it needs to reload init.groovy.d scripts. One Jenkins comes back the job fails with this error. This was working fine for months then stopped working with this same error.
- Plenty of memory and space on device.
- Durable task plugin is fully up to date.
- Jenkins 2.360
we have made all the possible upgrades and changes, even then we are facing the issue.
This issue started occurring a few days ago. Never happened before.
(Jenkins 2.361.1)
Investigated yesterday and sharing my findings below. Here's an example of a wrapper script running:
10:48:01 root 15345 0.0 0.0 4632 92 ? S 08:47 0:00 sh -c ({ while [ -d '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da' -a \! -f '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt' ]; do touch '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt'; sleep 3; done } & jsc=durable-bffc1fdb28b03efb823f6939b2ccaf2e; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/script.sh' > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt' 2>&1; echo $? > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp'; mv '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp' '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt'; wait) >&- 2>&- &
In my case on the slave, we have 2 folders:
1. the workplace folder /home/ubuntu/workspace/xxx, which is hosting the files downloaded from the git repo
2. the control folder for the durable plugin: /home/ubuntu/workspace/xxx@tmp/durable-yyy containing:
- script.sh which is the command line we are executing, took from the `sh` command of the Jenkinsfile
- jenkins-log.txt which is the output of the command
- jenkins-result.txt which will contain the exit code of the command (when finished)
- pid which is not created by default but could contain the pid of the command
On my environment it will throw the error message after 605-915 seconds when having both conditions met:
1. no exit code in the jenkins-result.txt file
2. the log file (jenkins-log.txt) not modified since more than 300 seconds
(1) happens when the wrapper script is not running any more
(2) happens when the wrapper script is not running, or when the command is not sending messages to logs since more than 300 seconds
The source code is here:
https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java
tested scenarios:
- when adding an exit code to the jenkins-result.txt file, it stops successfully
- when only creating jenkins-result.txt file, it continues to wait till having a code
- long script (30mins) without output with wrapper script running = no issue
- long script (30mins) with output without wrapper script running = no issue
- long script (30mins) without output without wrapper script running = failing after 608s
- long script (30mins) without output without wrapper script running, but with a pid file = no issue
That means that a _workaround_could be to manually create a pid file before starting the command and then remove when finished. When the command is quite long, it will kindly throw this warning every 5 minutes but wait for the process to finish:
still have /home/ubuntu/workspace/xxxx@tmp/durable-yyy/pid so heartbeat checks unreliable; process may or may not be alive
It could be done by a simple shell script like the one below. However, we should follow the best practices to setup pipeline/step timeouts and avoid pipelines to run forever.
# workaround for the wrapper issue, create a pid file MYCONTROLDIR=`echo "$PWD@tmp/durable"*` echo $$ >$MYCONTROLDIR/pid # ./my_longcommand_to_run.sh exitcode=$? # lets clean up what we did rm -f $MYCONTROLDIR/pid exit $exitcode
In my case, the root cause is still unknown. Suspecting that the wrapper script is killed, the system OOM killer (caused by the pipelines filling the slave's memory) is a great candidate, but needing some evidence of it.
A good suggestion to improve the durable-task-plugin: if we could understand in which scenario we are:
1. the wrapper script is not running any more (killed?)
2. or, the wrapper script is running but with some error (and if it writes an error to stderr, please log it somewhere...)
3. or, the wrapper script is running, but so slow due to some CPU/storage/filesystem slowness,
OOMKiller is a likely culprit. In general there is not going to be a reliable record of when the wrapper was killed or why, nor would there be any error messages from the wrapper script to log.
The newer binary wrapper may behave differently, but it could still be killed.
Originally the wrapper’s PID was recorded, and the agent would periodically check to see whether that process was still alive. This was switched to a heartbeat because it was more portable (no need for JNI-based system calls) and also worked without special considerations when Launcher was wrapped in a container (such as by the withDockerContainer step, or container in the kubernetes plugin). The only requirement is that the agent JVM and the user process and (if applicable) the filesystem server can agree on clock skew to within a few minutes. I doubt there is much difference either way in diagnosability: in either case, if the wrapper is killed, the agent JVM manages to detect this sooner or later, but does not have any information as to the ultimate reason for the error.
(Freestyle projects do not survive controller or agent JVM restarts, because the user script is a child of the agent JVM. Thus if e.g. OOMKiller kicks in, the build will simply abort right away—again with no particularly informative message beyond the fact that it got a SIGKILL.)
After more testing and investigation, we identified two possible root causes for this common issue:
- a race condition: when the agent terminates before that the wrapper script could complete and report the status/log of the running command. For example, using the docker workflow plugin, when a docker container agent suddently stopped:
- when having a quick command as entrypoint.
- or when caused by an OutOfMemory.
- when the wrapper script would fail or being killed for whatever reason, but the command keeps running. It should be very rare, never observed an evidence of this situation.
Cause 1.1: quick command as entrypoint
When using a docker agent, Jenkins starts the following commands in that order:
- docker run: to start the container
- docker top: to validate that the container correctly started with the expected command
- docker exec: to start, inside the container, a wrapper script which will start a command which you have defined.
The default ENTRYPOINT of the docker image could be a quick command (mdspell for example). So the container could:
- stop before the docker top command, causing the container started but didn't run the expected command error.
- stop before the end of the docker exec command , causing the wrapper script error.
Resolution: Force the entrypoint even to an empty value, by using --entrypoint to the docker agent.
Cause 1.2: when caused by an OutOfMemory
I could not see any way to control the cgroup of jobs, not limit the capacity by executor. When using docker, the containers are created in a dedicated cgroup with no memory limit. We only rely on best practices, like writing pipelines with the --memory option to docker.
When the memory consumption reaches a critical level on the slave, the OOMKiller can randomly kill processes, and cause termination of docker agents.
Resolution: Best practices on memory usage and better sizing. We created a script running every minute to force a memory limit for all running docker containers, that could solve most of the issues.
1.1 sounds like a bug in the docker-workflow plugin which may be fixable, though I do not advise use of that plugin to begin with.
1.2 is basically outside the control of Jenkins, but the durable-task plugin could suggest this as a possible root cause when printing error messages. (An older version of durable-task recorded the PID of the wrapper process and then checked the process list to see if it was still running. This proved to be hard to maintain in various environments, however; the heartbeat system is more portable.)
Does it work when you echo whatever ; yourscript.sh instead of just the latter?