[JENKINS-50379] Jenkins kills long running sh script with no output

Daniel Beck added a comment - 2018-04-01 17:31

Does it work when you echo whatever ; yourscript.sh instead of just the latter?

Daniel Beck added a comment - 2018-04-01 17:31 Does it work when you echo whatever ; yourscript.sh instead of just the latter?

Lee Webb added a comment - 2018-04-03 23:48

Travis does this sort of thing too, if there's no output for a while it just assumes the process is hung then stops the build.

If you don't want to mess with Jenkins something like the following shell snippet can help.

It forks your long running process & echo's dots to the console as long as it's still running:

# suppress command output unless there is a failure
function quiet() {
if [[ $- =~ x ]]; then set +x; XTRACE=1; fi
if [[ $- =~ e ]]; then set +e; ERREXIT=1; fi
tmp=$(mktemp) || return # this will be the temp file w/ the output
echo -ne "quiet running: ${@} "
ts_elapsed=0
ts_start=$(date +%s)
"${@}" > "${tmp}" 2>&1 &
cmd_pid=$!
while [ 1 ]; do
if [ `uname` == 'Linux' ]; then
ps -q ${cmd_pid} > /dev/null 2>&1
running=${?}
else
ps -ef ${cmd_pid} > /dev/null 2>&1
running=${?}
fi
if [ "${running}" -eq 0 ]; then
echo -ne '.'
sleep 3
continue
fi
break
done
wait ${cmd_pid}
ret=${?}
ts_end=$(date +%s)
let "ts_elapsed = ${ts_end} - ${ts_start}"
if [ "${ret}" -eq 0 ]; then
echo -ne " finished with code ${ret} in ${ts_elapsed} secs, last lines were:\n"
tail -n 4 "${tmp}"
else
cat "${tmp}"
fi
rm -f "${tmp}"
if [ "${ERREXIT}" ]; then unset ERREXIT; set -e; fi
if [ "${XTRACE}" ]; then unset XTRACE; set -x; fi
return "${ret}" # return the exit status of the command
}

Lee Webb added a comment - 2018-04-03 23:48 Travis does this sort of thing too, if there's no output for a while it just assumes the process is hung then stops the build. If you don't want to mess with Jenkins something like the following shell snippet can help. It forks your long running process & echo's dots to the console as long as it's still running: # suppress command output unless there is a failure function quiet() { if [[ $- =~ x ]]; then set +x; XTRACE=1; fi if [[ $- =~ e ]]; then set +e; ERREXIT=1; fi tmp=$(mktemp) || return # this will be the temp file w/ the output echo -ne "quiet running: ${@} " ts_elapsed=0 ts_start=$(date +%s) "${@}" > "${tmp}" 2>&1 & cmd_pid=$! while [ 1 ]; do if [ `uname` == 'Linux' ]; then ps -q ${cmd_pid} > /dev/ null 2>&1 running=${?} else ps -ef ${cmd_pid} > /dev/ null 2>&1 running=${?} fi if [ "${running}" -eq 0 ]; then echo -ne '.' sleep 3 continue fi break done wait ${cmd_pid} ret=${?} ts_end=$(date +%s) let "ts_elapsed = ${ts_end} - ${ts_start}" if [ "${ret}" -eq 0 ]; then echo -ne " finished with code ${ret} in ${ts_elapsed} secs, last lines were:\n" tail -n 4 "${tmp}" else cat "${tmp}" fi rm -f "${tmp}" if [ "${ERREXIT}" ]; then unset ERREXIT; set -e; fi if [ "${XTRACE}" ]; then unset XTRACE; set -x; fi return "${ret}" # return the exit status of the command }

Evan Ward added a comment - 2018-04-04 13:05

danielbeck the script initially generates some output to show that it started and then generates no output for a long time. I think this has the same effect as your suggestion of using echo.

Evan Ward added a comment - 2018-04-04 13:05 danielbeck the script initially generates some output to show that it started and then generates no output for a long time. I think this has the same effect as your suggestion of using echo.

Daniel Beck added a comment - 2018-04-04 13:16

evanward1 I expect so. Thanks for the clarification.

Daniel Beck added a comment - 2018-04-04 13:16 evanward1 I expect so. Thanks for the clarification.

Jacob Keller added a comment - 2018-05-18 16:16

I see this issue on scripts which do generate some output, but it happens that parts of the script take some time to run: in my case I'm compiling a kernel module and even when the make output is sent to the console, sometimes individual steps take longer than the timeout...

Jacob Keller added a comment - 2018-05-18 16:16 I see this issue on scripts which do generate some output, but it happens that parts of the script take some time to run: in my case I'm compiling a kernel module and even when the make output is sent to the console, sometimes individual steps take longer than the timeout...

shreedhara H added a comment - 2018-06-21 07:16

Hi evanward1,
We are also facing the same issue, can you please help us to know how to change Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL interval.

shreedhara H added a comment - 2018-06-21 07:16 Hi evanward1 , We are also facing the same issue, can you please help us to know how to change Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL interval.

Evan Ward added a comment - 2018-06-21 11:59

Set it in the JVM arg line on master.

Evan Ward added a comment - 2018-06-21 11:59 Set it in the JVM arg line on master.

Max Ivanch added a comment - 2018-06-27 08:57

Hi there,

Same issue I run task with high load disk tasks. I put durable plugin in "none" option, I tried HEARTBEAT_CHECK_INTERVAL but it doesn't work for me. To have solution I have created additional mount point in jenkins slave. But IMHO I would prefer to have option to disable it at all.

Max Ivanch added a comment - 2018-06-27 08:57 Hi there, Same issue I run task with high load disk tasks. I put durable plugin in "none" option, I tried HEARTBEAT_CHECK_INTERVAL but it doesn't work for me. To have solution I have created additional mount point in jenkins slave. But IMHO I would prefer to have option to disable it at all.

lei rou added a comment - 2018-07-04 02:15

I have the same problem some times, and how to change the HEARTBEAT_CHECK_INTERVAL?

lei rou added a comment - 2018-07-04 02:15 I have the same problem some times, and how to change the HEARTBEAT_CHECK_INTERVAL?

swarnendu roy added a comment - 2018-08-08 12:53

Hi There,

I am also facing this issue now in our environment. If there is any work around for this.

```wrapper script does not seem to be touching the log file in /home/****/workspace/demo@tmp/durable-549a8a8c
(~~JENKINS-48300~~: if on a laggy filesystem, consider -Dorg.****ci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)```

swarnendu roy added a comment - 2018-08-08 12:53 Hi There, I am also facing this issue now in our environment. If there is any work around for this. ```wrapper script does not seem to be touching the log file in /home/****/workspace/demo@tmp/durable-549a8a8c ( JENKINS-48300 : if on a laggy filesystem, consider -Dorg.****ci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)```

Tamas Gal added a comment - 2018-08-23 18:08 - edited

Same for us. Out of nowhere jobs are killed on our Jenkins nodes. Manually setting the heartbeat check interval to 300 seems to work for now.

Btw. on Debian-like machines, you need to edit `/var/default/jenkins` and add the above mentioned variable setting to the line starting with JAVA_ARGS=

It should then look something like this:

JAVA_ARGS="-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 -Djava.awt.headless=true ..."

Tamas Gal added a comment - 2018-08-23 18:08 - edited Same for us. Out of nowhere jobs are killed on our Jenkins nodes. Manually setting the heartbeat check interval to 300 seems to work for now. Btw. on Debian-like machines, you need to edit `/var/default/jenkins` and add the above mentioned variable setting to the line starting with JAVA_ARGS= It should then look something like this: JAVA_ARGS="-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 -Djava.awt.headless=true ..."

swarnendu roy added a comment - 2018-08-29 10:19

Hi Guys,

This issue is due to Durable Task Pluggin, In the latest release of Durable task Pluggin 1.25 this has been resolved.

reference Link : https://issues.jenkins-ci.org/browse/JENKINS-52881

swarnendu roy added a comment - 2018-08-29 10:19 Hi Guys, This issue is due to Durable Task Pluggin, In the latest release of Durable task Pluggin 1.25 this has been resolved. reference Link : https://issues.jenkins-ci.org/browse/JENKINS-52881

Jesse Glick added a comment - 2018-08-29 13:20

More likely related to ~~JENKINS-48300~~. Impossible to diagnose merely from this message.

The problem is not that your script stops producing output for a while. That is perfectly normal and supported. The problem is that a side process which is supposed to be detecting this fact and touching the log file every three seconds is either not running, or not producing the right timestamp as observed by the Jenkins agent JVM.

Jesse Glick added a comment - 2018-08-29 13:20 More likely related to JENKINS-48300 . Impossible to diagnose merely from this message. The problem is not that your script stops producing output for a while. That is perfectly normal and supported. The problem is that a side process which is supposed to be detecting this fact and touching the log file every three seconds is either not running, or not producing the right timestamp as observed by the Jenkins agent JVM.

Hermann Boeken added a comment - 2019-09-10 12:52

After having updated Jenkins and its plugins, we're experiencing this issue too now.

We're now using Jenkins 2.193 and the Durable Task Plugin has version 1.30.

wrapper script does not seem to be touching the log file in /local/user_data/s__t/jenkins/workspace/S___K@7@tmp/durable-ad608bf9
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

This 'extremely laggy filesystem' is a local hard disk which isn't laggy whatsoever.

About 50% of our jobs get aborted due to this.

Do you have any suggestions how this can get solved without the workaround to redefine the HEARTBEAT_CHECK_INTERVAL?

Hermann Boeken added a comment - 2019-09-10 12:52 After having updated Jenkins and its plugins, we're experiencing this issue too now. We're now using Jenkins 2.193 and the Durable Task Plugin has version 1.30. wrapper script does not seem to be touching the log file in /local/user_data/s__t/jenkins/workspace/S___K@7@tmp/durable-ad608bf9 (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400) This 'extremely laggy filesystem' is a local hard disk which isn't laggy whatsoever. About 50% of our jobs get aborted due to this. Do you have any suggestions how this can get solved without the workaround to redefine the HEARTBEAT_CHECK_INTERVAL?

Jesse Glick added a comment - 2019-09-10 14:26

Only by diagnosing and figuring out how to reproduce, so the issue can be fixed.

Jesse Glick added a comment - 2019-09-10 14:26 Only by diagnosing and figuring out how to reproduce, so the issue can be fixed.

Jacob Keller added a comment - 2019-09-10 19:39

> The problem is not that your script stops producing output for a while. That is perfectly normal and supported. The problem is that a side process which is supposed to be detecting this fact and touching the log file every three seconds is either not running, or not producing the right timestamp as observed by the Jenkins agent JVM.

Right, so it sounds like we need to investigate why the side process that should be touching the log file isn't working properly.

Jacob Keller added a comment - 2019-09-10 19:39 > The problem is not that your script stops producing output for a while. That is perfectly normal and supported. The problem is that a side process which is supposed to be detecting this fact and touching the log file every three seconds is either not running, or not producing the right timestamp as observed by the Jenkins agent JVM. Right, so it sounds like we need to investigate why the side process that should be touching the log file isn't working properly.

Jesse Glick added a comment - 2019-09-10 19:48

I should have mentioned that ~~JENKINS-25503~~ would completely reimplement the code involved here, possibly solving this issue (possibly introducing others).

Jesse Glick added a comment - 2019-09-10 19:48 I should have mentioned that JENKINS-25503 would completely reimplement the code involved here, possibly solving this issue (possibly introducing others).

Nikolas Falco added a comment - 2020-10-18 18:55 - edited

We have the same issue, during the JS build job execute a "ng build" command and the job after 32 minutes is killed because seems to not respond.

Cannot contact Node 02: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@70fad4d7:JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702": Remote call on JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702 failed. The channel is closing down or has closed down
wrapper script does not seem to be touching the log file in /var/lib/jenkins/workspace/xxx@tmp/durable-476d6be2
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

Nikolas Falco added a comment - 2020-10-18 18:55 - edited We have the same issue, during the JS build job execute a "ng build" command and the job after 32 minutes is killed because seems to not respond. Cannot contact Node 02: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@70fad4d7:JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702": Remote call on JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702 failed. The channel is closing down or has closed down wrapper script does not seem to be touching the log file in /var/lib/jenkins/workspace/xxx@tmp/durable-476d6be2 (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

Olivier Vernin added a comment - 2020-10-19 06:58

For the record, I was affected by this issue a while ago and in my case, I was running Jenkins agents on k8s, and increasing the pod memory limit solve it at least in my case.

Olivier Vernin added a comment - 2020-10-19 06:58 For the record, I was affected by this issue a while ago and in my case, I was running Jenkins agents on k8s, and increasing the pod memory limit solve it at least in my case.

bright.ma added a comment - 2021-05-25 13:47

I met this issue.

[2021-05-25T13:42:16.469Z] wrapper script does not seem to be touching the log file in @tmp/durable-c284507c

[2021-05-25T13:42:16.469Z] (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

the reason is " No space left on device"

bright.ma added a comment - 2021-05-25 13:47 I met this issue. [2021-05-25T13:42:16.469Z] wrapper script does not seem to be touching the log file in @tmp/durable-c284507c [2021-05-25T13:42:16.469Z] (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400) the reason is " No space left on device"

Matt Dee added a comment - 2022-06-23 23:10 - edited

Running into this. One of my playbooks will restart Jenkins service if it needs to reload init.groovy.d scripts. One Jenkins comes back the job fails with this error. This was working fine for months then stopped working with this same error.

Plenty of memory and space on device.
Durable task plugin is fully up to date.
Jenkins 2.360

Matt Dee added a comment - 2022-06-23 23:10 - edited Running into this. One of my playbooks will restart Jenkins service if it needs to reload init.groovy.d scripts. One Jenkins comes back the job fails with this error. This was working fine for months then stopped working with this same error. Plenty of memory and space on device. Durable task plugin is fully up to date. Jenkins 2.360

Tejaswi Battuwar added a comment - 2022-09-14 10:02

we have made all the possible upgrades and changes, even then we are facing the issue.

Tejaswi Battuwar added a comment - 2022-09-14 10:02 we have made all the possible upgrades and changes, even then we are facing the issue.

Giacomo Boccardo added a comment - 2022-09-16 06:46

This issue started occurring a few days ago. Never happened before.

(Jenkins 2.361.1)

Giacomo Boccardo added a comment - 2022-09-16 06:46 This issue started occurring a few days ago. Never happened before. (Jenkins 2.361.1)

Benoit Bourdin added a comment - 2022-09-23 14:48 - edited

Investigated yesterday and sharing my findings below. Here's an example of a wrapper script running:

10:48:01  root     15345  0.0  0.0   4632    92 ?        S    08:47   0:00 sh -c ({ while [ -d '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da' -a \! -f '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt' ]; do touch '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt'; sleep 3; done } & jsc=durable-bffc1fdb28b03efb823f6939b2ccaf2e; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/script.sh' > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt' 2>&1; echo $? > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp'; mv '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp' '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt'; wait) >&- 2>&- &

In my case on the slave, we have 2 folders:
1. the workplace folder /home/ubuntu/workspace/xxx, which is hosting the files downloaded from the git repo
2. the control folder for the durable plugin: /home/ubuntu/workspace/xxx@tmp/durable-yyy containing:

script.sh which is the command line we are executing, took from the `sh` command of the Jenkinsfile
jenkins-log.txt which is the output of the command
jenkins-result.txt which will contain the exit code of the command (when finished)
pid which is not created by default but could contain the pid of the command

On my environment it will throw the error message after 605-915 seconds when having both conditions met:
1. no exit code in the jenkins-result.txt file
2. the log file (jenkins-log.txt) not modified since more than 300 seconds
(1) happens when the wrapper script is not running any more
(2) happens when the wrapper script is not running, or when the command is not sending messages to logs since more than 300 seconds

The source code is here:
https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java

tested scenarios:

when adding an exit code to the jenkins-result.txt file, it stops successfully
when only creating jenkins-result.txt file, it continues to wait till having a code
long script (30mins) without output with wrapper script running = no issue
long script (30mins) with output without wrapper script running = no issue
long script (30mins) without output without wrapper script running = failing after 608s
long script (30mins) without output without wrapper script running, but with a pid file = no issue

That means that a _workaround_could be to manually create a pid file before starting the command and then remove when finished. When the command is quite long, it will kindly throw this warning every 5 minutes but wait for the process to finish:

still have /home/ubuntu/workspace/xxxx@tmp/durable-yyy/pid so heartbeat checks unreliable; process may or may not be alive

It could be done by a simple shell script like the one below. However, we should follow the best practices to setup pipeline/step timeouts and avoid pipelines to run forever.

# workaround for the wrapper issue, create a pid file
MYCONTROLDIR=`echo "$PWD@tmp/durable"*`
echo $$ >$MYCONTROLDIR/pid

# ./my_longcommand_to_run.sh
exitcode=$?

# lets clean up what we did
rm -f $MYCONTROLDIR/pid
exit $exitcode

In my case, the root cause is still unknown. Suspecting that the wrapper script is killed, the system OOM killer (caused by the pipelines filling the slave's memory) is a great candidate, but needing some evidence of it.

A good suggestion to improve the durable-task-plugin: if we could understand in which scenario we are:
1. the wrapper script is not running any more (killed?)
2. or, the wrapper script is running but with some error (and if it writes an error to stderr, please log it somewhere...)
3. or, the wrapper script is running, but so slow due to some CPU/storage/filesystem slowness,

Benoit Bourdin added a comment - 2022-09-23 14:48 - edited Investigated yesterday and sharing my findings below. Here's an example of a wrapper script running: 10:48:01 root 15345 0.0 0.0 4632 92 ? S 08:47 0:00 sh -c ({ while [ -d '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da' -a \! -f '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt' ]; do touch '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt'; sleep 3; done } & jsc=durable-bffc1fdb28b03efb823f6939b2ccaf2e; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/script.sh' > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt' 2>&1; echo $? > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp'; mv '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp' '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt'; wait) >&- 2>&- & In my case on the slave, we have 2 folders: 1. the workplace folder /home/ubuntu/workspace/xxx , which is hosting the files downloaded from the git repo 2. the control folder for the durable plugin: /home/ubuntu/workspace/xxx@tmp/durable-yyy containing: script.sh which is the command line we are executing, took from the `sh` command of the Jenkinsfile jenkins-log.txt which is the output of the command jenkins-result.txt which will contain the exit code of the command (when finished) pid which is not created by default but could contain the pid of the command On my environment it will throw the error message after 605-915 seconds when having both conditions met: 1. no exit code in the jenkins-result.txt file 2. the log file (jenkins-log.txt) not modified since more than 300 seconds (1) happens when the wrapper script is not running any more (2) happens when the wrapper script is not running, or when the command is not sending messages to logs since more than 300 seconds The source code is here: https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java tested scenarios: when adding an exit code to the jenkins-result.txt file, it stops successfully when only creating jenkins-result.txt file, it continues to wait till having a code long script (30mins) without output with wrapper script running = no issue long script (30mins) with output without wrapper script running = no issue long script (30mins) without output without wrapper script running = failing after 608s long script (30mins) without output without wrapper script running, but with a pid file = no issue That means that a _workaround_could be to manually create a pid file before starting the command and then remove when finished. When the command is quite long, it will kindly throw this warning every 5 minutes but wait for the process to finish: still have /home/ubuntu/workspace/xxxx@tmp/durable-yyy/pid so heartbeat checks unreliable; process may or may not be alive It could be done by a simple shell script like the one below. However, we should follow the best practices to setup pipeline/step timeouts and avoid pipelines to run forever. # workaround for the wrapper issue, create a pid file MYCONTROLDIR=`echo "$PWD@tmp/durable" *` echo $$ >$MYCONTROLDIR/pid # ./my_longcommand_to_run.sh exitcode=$? # lets clean up what we did rm -f $MYCONTROLDIR/pid exit $exitcode In my case, the root cause is still unknown. Suspecting that the wrapper script is killed, the system OOM killer (caused by the pipelines filling the slave's memory) is a great candidate, but needing some evidence of it. A good suggestion to improve the durable-task-plugin : if we could understand in which scenario we are: 1. the wrapper script is not running any more (killed?) 2. or, the wrapper script is running but with some error (and if it writes an error to stderr, please log it somewhere...) 3. or, the wrapper script is running, but so slow due to some CPU/storage/filesystem slowness,

Jesse Glick added a comment - 2022-09-23 15:27

OOMKiller is a likely culprit. In general there is not going to be a reliable record of when the wrapper was killed or why, nor would there be any error messages from the wrapper script to log.

The newer binary wrapper may behave differently, but it could still be killed.

Originally the wrapper’s PID was recorded, and the agent would periodically check to see whether that process was still alive. This was switched to a heartbeat because it was more portable (no need for JNI-based system calls) and also worked without special considerations when Launcher was wrapped in a container (such as by the withDockerContainer step, or container in the kubernetes plugin). The only requirement is that the agent JVM and the user process and (if applicable) the filesystem server can agree on clock skew to within a few minutes. I doubt there is much difference either way in diagnosability: in either case, if the wrapper is killed, the agent JVM manages to detect this sooner or later, but does not have any information as to the ultimate reason for the error.

(Freestyle projects do not survive controller or agent JVM restarts, because the user script is a child of the agent JVM. Thus if e.g. OOMKiller kicks in, the build will simply abort right away—again with no particularly informative message beyond the fact that it got a SIGKILL.)

Jesse Glick added a comment - 2022-09-23 15:27 OOMKiller is a likely culprit. In general there is not going to be a reliable record of when the wrapper was killed or why, nor would there be any error messages from the wrapper script to log. The newer binary wrapper may behave differently, but it could still be killed. Originally the wrapper’s PID was recorded, and the agent would periodically check to see whether that process was still alive. This was switched to a heartbeat because it was more portable (no need for JNI-based system calls) and also worked without special considerations when Launcher was wrapped in a container (such as by the withDockerContainer step, or container in the kubernetes plugin). The only requirement is that the agent JVM and the user process and (if applicable) the filesystem server can agree on clock skew to within a few minutes. I doubt there is much difference either way in diagnosability: in either case, if the wrapper is killed, the agent JVM manages to detect this sooner or later, but does not have any information as to the ultimate reason for the error. (Freestyle projects do not survive controller or agent JVM restarts, because the user script is a child of the agent JVM. Thus if e.g. OOMKiller kicks in, the build will simply abort right away—again with no particularly informative message beyond the fact that it got a SIGKILL .)

Benoit Bourdin added a comment - 2022-11-24 15:53

After more testing and investigation, we identified two possible root causes for this common issue:

a race condition: when the agent terminates before that the wrapper script could complete and report the status/log of the running command. For example, using the docker workflow plugin, when a docker container agent suddently stopped:
1. when having a quick command as entrypoint.
2. or when caused by an OutOfMemory.
when the wrapper script would fail or being killed for whatever reason, but the command keeps running. It should be very rare, never observed an evidence of this situation.

Cause 1.1: quick command as entrypoint
When using a docker agent, Jenkins starts the following commands in that order:

docker run: to start the container
docker top: to validate that the container correctly started with the expected command
docker exec: to start, inside the container, a wrapper script which will start a command which you have defined.

The default ENTRYPOINT of the docker image could be a quick command (mdspell for example). So the container could:

stop before the docker top command, causing the container started but didn't run the expected command error.
stop before the end of the docker exec command , causing the wrapper script error.

Resolution: Force the entrypoint even to an empty value, by using --entrypoint to the docker agent.

Cause 1.2: when caused by an OutOfMemory
I could not see any way to control the cgroup of jobs, not limit the capacity by executor. When using docker, the containers are created in a dedicated cgroup with no memory limit. We only rely on best practices, like writing pipelines with the --memory option to docker.
When the memory consumption reaches a critical level on the slave, the OOMKiller can randomly kill processes, and cause termination of docker agents.
Resolution: Best practices on memory usage and better sizing. We created a script running every minute to force a memory limit for all running docker containers, that could solve most of the issues.

Benoit Bourdin added a comment - 2022-11-24 15:53 After more testing and investigation, we identified two possible root causes for this common issue: a race condition: when the agent terminates before that the wrapper script could complete and report the status/log of the running command. For example, using the docker workflow plugin , when a docker container agent suddently stopped: when having a quick command as entrypoint. or when caused by an OutOfMemory. when the wrapper script would fail or being killed for whatever reason, but the command keeps running. It should be very rare, never observed an evidence of this situation. Cause 1.1: quick command as entrypoint When using a docker agent, Jenkins starts the following commands in that order: docker run: to start the container docker top: to validate that the container correctly started with the expected command docker exec: to start, inside the container, a wrapper script which will start a command which you have defined. The default ENTRYPOINT of the docker image could be a quick command (mdspell for example). So the container could: stop before the docker top command, causing the container started but didn't run the expected command error. stop before the end of the docker exec command , causing the wrapper script error. Resolution: Force the entrypoint even to an empty value, by using --entrypoint to the docker agent. Cause 1.2: when caused by an OutOfMemory I could not see any way to control the cgroup of jobs, not limit the capacity by executor. When using docker, the containers are created in a dedicated cgroup with no memory limit. We only rely on best practices, like writing pipelines with the --memory option to docker. When the memory consumption reaches a critical level on the slave, the OOMKiller can randomly kill processes, and cause termination of docker agents. Resolution: Best practices on memory usage and better sizing. We created a script running every minute to force a memory limit for all running docker containers, that could solve most of the issues.

Jesse Glick added a comment - 2022-11-29 13:43

1.1 sounds like a bug in the docker-workflow plugin which may be fixable, though I do not advise use of that plugin to begin with.

1.2 is basically outside the control of Jenkins, but the durable-task plugin could suggest this as a possible root cause when printing error messages. (An older version of durable-task recorded the PID of the wrapper process and then checked the process list to see if it was still running. This proved to be hard to maintain in various environments, however; the heartbeat system is more portable.)

Jesse Glick added a comment - 2022-11-29 13:43 1.1 sounds like a bug in the docker-workflow plugin which may be fixable, though I do not advise use of that plugin to begin with. 1.2 is basically outside the control of Jenkins, but the durable-task plugin could suggest this as a possible root cause when printing error messages. (An older version of durable-task recorded the PID of the wrapper process and then checked the process list to see if it was still running. This proved to be hard to maintain in various environments, however; the heartbeat system is more portable.)

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Daniel Beck added a comment - 2018-04-01 17:31

Expand comment: Daniel Beck added a comment - 2018-04-01 17:31

Collapse comment: Lee Webb added a comment - 2018-04-03 23:48

Expand comment: Lee Webb added a comment - 2018-04-03 23:48

Collapse comment: Evan Ward added a comment - 2018-04-04 13:05

Expand comment: Evan Ward added a comment - 2018-04-04 13:05

Collapse comment: Daniel Beck added a comment - 2018-04-04 13:16

Expand comment: Daniel Beck added a comment - 2018-04-04 13:16

Collapse comment: Jacob Keller added a comment - 2018-05-18 16:16

Expand comment: Jacob Keller added a comment - 2018-05-18 16:16

Collapse comment: shreedhara H added a comment - 2018-06-21 07:16

Expand comment: shreedhara H added a comment - 2018-06-21 07:16

Collapse comment: Evan Ward added a comment - 2018-06-21 11:59

Expand comment: Evan Ward added a comment - 2018-06-21 11:59

Collapse comment: Max Ivanch added a comment - 2018-06-27 08:57

Expand comment: Max Ivanch added a comment - 2018-06-27 08:57

Collapse comment: lei rou added a comment - 2018-07-04 02:15

Expand comment: lei rou added a comment - 2018-07-04 02:15

Collapse comment: swarnendu roy added a comment - 2018-08-08 12:53

Expand comment: swarnendu roy added a comment - 2018-08-08 12:53

Collapse comment: Tamas Gal added a comment - 2018-08-23 18:08, Edited by Tamas Gal - 2018-08-23 18:11

Expand comment: Tamas Gal added a comment - 2018-08-23 18:08, Edited by Tamas Gal - 2018-08-23 18:11

Collapse comment: swarnendu roy added a comment - 2018-08-29 10:19

Expand comment: swarnendu roy added a comment - 2018-08-29 10:19

Collapse comment: Jesse Glick added a comment - 2018-08-29 13:20

Expand comment: Jesse Glick added a comment - 2018-08-29 13:20

Collapse comment: Hermann Boeken added a comment - 2019-09-10 12:52

Expand comment: Hermann Boeken added a comment - 2019-09-10 12:52

Collapse comment: Jesse Glick added a comment - 2019-09-10 14:26

Expand comment: Jesse Glick added a comment - 2019-09-10 14:26

Collapse comment: Jacob Keller added a comment - 2019-09-10 19:39

Expand comment: Jacob Keller added a comment - 2019-09-10 19:39

Collapse comment: Jesse Glick added a comment - 2019-09-10 19:48

Expand comment: Jesse Glick added a comment - 2019-09-10 19:48

Collapse comment: Nikolas Falco added a comment - 2020-10-18 18:55, Edited by Nikolas Falco - 2020-10-18 19:14

Expand comment: Nikolas Falco added a comment - 2020-10-18 18:55, Edited by Nikolas Falco - 2020-10-18 19:14

Collapse comment: Olivier Vernin added a comment - 2020-10-19 06:58

Expand comment: Olivier Vernin added a comment - 2020-10-19 06:58

Collapse comment: bright.ma added a comment - 2021-05-25 13:47

Expand comment: bright.ma added a comment - 2021-05-25 13:47

Collapse comment: Matt Dee added a comment - 2022-06-23 23:10, Edited by Matt Dee - 2022-07-19 23:38

Expand comment: Matt Dee added a comment - 2022-06-23 23:10, Edited by Matt Dee - 2022-07-19 23:38

Collapse comment: Tejaswi Battuwar added a comment - 2022-09-14 10:02

Expand comment: Tejaswi Battuwar added a comment - 2022-09-14 10:02

Collapse comment: Giacomo Boccardo added a comment - 2022-09-16 06:46

Expand comment: Giacomo Boccardo added a comment - 2022-09-16 06:46

Collapse comment: Benoit Bourdin added a comment - 2022-09-23 14:48, Edited by Benoit Bourdin - 2022-11-24 15:30

Expand comment: Benoit Bourdin added a comment - 2022-09-23 14:48, Edited by Benoit Bourdin - 2022-11-24 15:30

Collapse comment: Jesse Glick added a comment - 2022-09-23 15:27

Expand comment: Jesse Glick added a comment - 2022-09-23 15:27

Collapse comment: Benoit Bourdin added a comment - 2022-11-24 15:53

Expand comment: Benoit Bourdin added a comment - 2022-11-24 15:53

Collapse comment: Jesse Glick added a comment - 2022-11-29 13:43

Expand comment: Jesse Glick added a comment - 2022-11-29 13:43

People

Dates