-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Fedora 26, Jenkins ver. 2.60.2, java-1.8.0-openjdk
-
Powered by SuggestiMate
Simple pipeline below sporadically hangs after completing last "sh" step. Command already completed and can't be seen in process list, but pipeline is still in running state and won't finish.
There are 2 nodes "builder" and "runner", which (for testing) were both setup to run on localhost (via ssh). Jenkins did ~15 builds of pipeline below before running into this problem, there are no other jobs/builds.
I'll try to keep the system in this failed state for couple days, in case anyone has tips what further data would be useful to gather:
Jenkinsfile:
node('builder') { stage('Build/Fetch') { git ... sh '''curl -O http://file/skt/sktrc curl -O http://file/skt/default.config''' } stage('Build') { sh '''./skt.py -vv --state --rc sktrc merge ./skt.py -vv --state --rc sktrc build ./skt.py -vv --state --rc sktrc publish''' sktrc = readFile 'sktrc' sh '''./skt.py -vv --state --rc sktrc cleanup''' } } node('runner') { stage('Test/Fetch') { git 'http://git/skt.git' sh '''curl -O http://filejob.xml''' writeFile file: 'sktrc', text: "${sktrc}" } stage('Test') { sh '''PATH="/home/worker/bin:$PATH" ./skt.py -vv --state --rc sktrc run --wait ./skt.py -vv --state --rc sktrc cleanup''' } }
Pipeline threadDump:
Thread #12 at WorkflowScript.run(WorkflowScript:36) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:35) at DSL.node(running on runner_localhost) at WorkflowScript.run(WorkflowScript:29)
- relates to
-
JENKINS-50892 Pipeline jobs stuck after restart
-
- Closed
-
-
JENKINS-51568 Pipeline jobs hanging in Build Executor even if it is finished
-
- Open
-
[JENKINS-46283] pipeline hangs after executing sh step command
I am not sure what jstancek’s issue is; nothing apparently to do with the sh step. reinholdfuereder’s issue is unrelated and I think a duplicate of something open in workflow-cps-plugin.
I haven't seen this issue in last couple months - not sure if it got fixed or I've just been lucky.
I'm OK if you want to close this as "insufficient data".
Same for me: (a) not seen anymore – admittedly I also did not try to provoke it; and (b) OK for closing
I can confirm this is still happening in Jenkins 2.89.4 with all plugins up to date at the time of writing.
In our situation we have one step to start a web server, followed by a step to wait until the web server is ready.
node { stage('Running Tests') { setupWebserver() checkStart('http://webserver:8080') } } def checkStart(String url) { sh """ ... ...curl command in a for loop etc... ... echo 'Web server up and running!!!' """ } def setupWebserver() { sh """#!/bin/bash ... ...some pre-steps... ... webServer/bin/server.sh restart """ }
In the logs you can see that the server is started...
18:20:20 Starting jetty using port 8080 18:20:20 Stopping Jetty: OK 18:20:25 Starting Jetty: STARTED Jetty Sat Mar 3 18:20:24 CET 2018 under PID 21653
On the agent, you can see that the process was started successfully and that the web server is available:
# check for PID jenkins@p-agent:~$ ps aux | grep [2]1653 | awk '{ print $1" "$2 }' jenkins 21653 # check for url jenkins@p-agent:~$ curl --write-out '%{http_code}\n' -o /dev/null -qsSL http://localhost:8080 200
Looking for durable tasks, I could not find any processes containing the word 'durable':
jenkins@p-agent:~$ ps auxww | grep [d]urable | awk '{ print $1" "$2 }' jenkins@p-agent:~$
Question: would it be worth adding a pre-script to our sh scripts to:
- print out the current PID of the durable task
- print out the current parent directory of the durable task
- do a ps aux | grep <parent directory> for the running script
- add a trap to print when the script is exiting
- which signals should be best handled in this case? EXIT, INT, TERM, HUP?
Hi jglick / jstancek / reinholdfuereder
I think I have the cause, or at least one of the possible causes, and can reproduce the hanging agent in principle. It has to do with the parent process being killed. Why that happens I cannot say, but perhaps the OS does this when the resources are low, etc (in any dockerized agent environment for example)
I also have a bit of a hacky workaround - feedback and improvements welcome .
Disclaimer: I'm using the exit codes calculated for bash in my solution as well as the bash built-in for finding my current directory. You'd need to take account of this if using a different shell.
It's a bit of a long explanation but here goes...
Summary
- investigated how signals can affect shell scripts
- investigated the processes involved
- discovered an unexpected sibling process
- found I could reproduce the behaviour by killing the parent process
- scripted a workaround for the case that the parent process is missing
Long version
Investigated what signals can do to a shell script.
Using the following script I tested what signals did to a bash script. I wanted to find out:
- which signals would cause a script to exit immediately
- which signals would first send an EXIT signal before exiting
- which signals would do nothing, etc.
[jenkins@jenkins-server] ~ $ cat /tmp/test.sh #!/bin/bash set -euo pipefail latestSignalRc= # register all known traps typeset -i sig=1 while (( sig < 65 )); do trap "signum=${sig};test ${sig}" "$sig" let sig=sig+1 done trap "test EXIT" "EXIT" trap "test ERR" "ERR" test() { local rc=$? if [[ "$1" != "EXIT" ]] && [[ "$1" != "ERR" ]]; then #echo "Non EXIT or ERR. Making latestSignalRc from signum '$signum'." latestSignalRc=$(( $signum + 128 )) fi echo "Got sig: ${1:-n/a}, signum: '${signum:-n/a}', rc: $rc, latestSignalRc: ${latestSignalRc:-n/a}" # Reset to a default signal handler. trap - $1 unset signum if [[ "$1" != "EXIT" ]] && [[ "$1" != "ERR" ]]; then # kill process with signal kill -$1 $$ else # if we receive an error, reset the EXIT trap # because we are leaving now anyway [[ "$1" == "ERR" ]] && trap - EXIT exit ${latestSignalRc:-$rc} fi } sleep 0.1 if [[ "ERR" == "${1:-}" ]]; then ls /tmp/nnnn &> /dev/null elif [ -n "${1:-}" ]; then kill -$1 $$ fi echo "After kill..."
Testing looked something like this:
[jenkins@jenkins-server] ~ $ for i in ERR 1 2 3 6 15; do echo "-----------------------------"; (/tmp/test.sh $i; echo "Exited: $?"); done ----------------------------- Got sig: ERR, signum: 'n/a', rc: 2, latestSignalRc: n/a Exited: 2 ----------------------------- Got sig: 1, signum: '1', rc: 0, latestSignalRc: 129 Exited: 129 ----------------------------- Got sig: 2, signum: '2', rc: 0, latestSignalRc: 130 Got sig: EXIT, signum: 'n/a', rc: 0, latestSignalRc: 130 Exited: 130 ----------------------------- Got sig: 3, signum: '3', rc: 0, latestSignalRc: 131 After kill... Got sig: EXIT, signum: 'n/a', rc: 0, latestSignalRc: 131 Exited: 131 ----------------------------- Got sig: 6, signum: '6', rc: 0, latestSignalRc: 134 Exited: 134 ----------------------------- Got sig: 15, signum: '15', rc: 0, latestSignalRc: 143 Exited: 143
I settled on catching
1) SIGHUP
2) SIGINT
6) SIGABRT
15) SIGTERM
since they (1) could be caught, and (2) caused the script to exit.
Investigated the processes involved
At first I simply grep'ed the processes using derived values
node('agent') { sh '''#!/bin/bash set -euo pipefail set +x function onExit() { local exitCode=\$? echo ">>>> Exiting now with exitCode \$exitCode" echo ">>>> Could place exitCode in \$myResultFile" echo "one last log" >> "\$myDir/jenkins-log.txt" echo \$exitCode > \$myResultFile sleep 5 echo "Still running..." exit \$exitCode } trap onExit EXIT myPid=\$\$ myDir="\$( cd "\$( dirname "\${BASH_SOURCE[0]}" )" && pwd )" myResultFile="\$myDir/jenkins-result.txt" myResultFileGrep="\$myDir/jenkins-result.tx[t]" echo "my pid is \$myPid" echo "my result file is \$myResultFile" echo '----------------------------------' ps -eaf | head -n 1 echo '--------- script PID -------------' ps -eaf | grep [d]urable | grep \$myPid echo '--------- script result file -------------' myCmdsParentPid=\$(ps aux | grep "\$myResultFileGrep" | awk '{ print \$3 }' | sort -u ) ps -eaf | grep "\$myResultFileGrep" echo '----------- script commands parent pid -----------' ps -eaf | grep "[j]enkins.*\$myCmdsParentPid" echo '---------------------------------' echo "hello from \$(hostname)" ''' }
Resulting in something like:
... my pid is 10781 my result file is /var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt UID PID PPID C STIME TTY TIME CMD --------- script PID ------------- jenkins 10781 10778 0 14:02 ? 00:00:00 /bin/bash /var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/script.sh --------- script result file ------------- jenkins 10778 12979 0 14:02 ? 00:00:00 sh -c { while [ -d '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0' -a \! -f '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt' ]; do touch '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt'; sleep 3; done } & jsc=durable-14f8a02757bd1625e0536d94affe2a93; JENKINS_SERVER_COOKIE=$jsc '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/script.sh' > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt' 2>&1; echo $? > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt'; wait jenkins 10780 10778 0 14:02 ? 00:00:00 sh -c { while [ -d '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0' -a \! -f '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt' ]; do touch '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt'; sleep 3; done } & jsc=durable-14f8a02757bd1625e0536d94affe2a93; JENKINS_SERVER_COOKIE=$jsc '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/script.sh' > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt' 2>&1; echo $? > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt'; wait ...
Question: I still do not know why there are two instances of the same command "sh -c ...."
I used a recursePid function to follow the trail and realised that the second sh -c ... process was actually a sibling of the script.sh process rather a parent.
function recursePid() { local currentPid=$1 local pidEntry=$(ps --no-headers -f --pid $currentPid) pidParent=$(ps --no-headers -o ppid --pid $currentPid | xargs) echo "--------- Current: $currentPid-------------" echo "$pidEntry" if [ $pidParent -ne 1 ]; then local pidSiblings=$(ps --no-headers -f --ppid $pidParent | grep -v $currentPid) if [ -n "$pidSiblings" ]; then echo "--------- has following siblings -------------" echo "$pidSiblings" pidSibling=$(echo "$pidSiblings" | awk '{ print $2 }') echo "Siblings pid = $pidSibling" fi echo "Sending signal to parent ${SIGNAL}" sleep 3 kill -${SIGNAL:-0} $pidParent sleep 3 echo "Parent killed..." if ps --no-headers -$pidParent; then echo "Parent still there" else echo "Parent gone" fi #recursePid $pidParent fi }
So, the process tree looks like:
slave.jar process |__ parent "sh -c ..." process |__ sibling "sh -c ..." process |__ "script.sh" process
Found I could reproduce the behaviour by killing the parent process
I tested sending various signals to the script.sh and sibling sh -c ... processes but:
- script.sh - worked as expected
- sibling sh -c ... didn't seem to have an effect.
Moving up one level, I sent the TERM signal to the parent which causes the agent to hang.
In the build log...
14:37:15 Single pid = 90 14:37:15 Sending signal to parent TERM 14:37:21 So far, I've got signal: 'EXIT', signum: 'n/a', exitCode: '0', latestSignalRc: 'n/a' 14:37:21 >>>> Exiting now with exitCode 0 14:37:21 >>>> Could place exitCode in /home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt 14:37:21 I am placing this log directly into the log file... <spinning-wheel>
On the agent only the sibling "sh -c ..." process remains...
jenkins@cd8c03e15e58:~$ ps aux | grep "[s]h -c" jenkins 90 0.0 0.0 4512 928 ? S 14:37 0:00 sh -c { while [ -d '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a' -a \! -f '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-log.txt'; sleep 3; done } & jsc=durable-190a2420fb163bce1cd2a8d2213b499c; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/script.sh' > '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt'; wait
The sibling process is touching the jenkins-log.txt every 3 seconds meaning Jenkins can't determine that it is hung.
Entering a exit code into jenkins-result.txt causes the build to continue (see the timestamps)
14:37:21 >>>> Could place exitCode in /home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt 14:37:21 I am placing this log directly into the log file... ... placed the exit code into the jenkins-result.txt now ... [Pipeline] } [Pipeline] // node [Pipeline] node 14:44:14 Running on Jenkins in /var/lib/jenkins/workspace/EXP-signal-test [Pipeline] { [Pipeline] tool [Pipeline] retry [Pipeline] { [Pipeline] node 14:44:14 Running on Jenkins in /var/lib/jenkins/workspace/EXP-signal-tester@2 [Pipeline] { [Pipeline] tool [Pipeline] sh
Scripted a workaround for the case that the parent is missing
I experimented further by catching and resending the signal as with the examples above. However, some signals such as TERM did not allow the script to write it's exit code to the jenkins-result.txt file.
So, with the knowledge that
- we can catch the common signals in the script which could cause the script to exit
- we can determine the parent pid and its status
I scripted the following to echo the 'would-be' exit code into the appropriate jenkins-result.txt in the case of the parent process being killed.
The best way to explain how it works is by using an example. Please check the following Jenkinsfile job:
def bashPreText(def script, def quiet = false, def login = false) { String verboseFlag = quiet ? '' : 'set -x' String loginFlag = login ? '-l' : '' return '''#!/bin/bash ''' + loginFlag + ''' set -euo pipefail # fail fast and fail on unset variables # don't print the pre-text stuff set +x # some shell options shopt -s globstar shopt -s expand_aliases # traps typeset -i sig=1 for sig in 1 2 6 15; do trap "signum=\${sig};handleSignal \${sig}" "\$sig" done trap "handleSignal EXIT" "EXIT" trap "handleSignal ERR" "ERR" function handleSignal() { local exitCode=\$? local signal=\$1 if [[ "\$signal" != "EXIT" ]] && [[ "\$signal" != "ERR" ]]; then # TODO: account for non-bash exit codes (ksh = signum + 256) latestSignalRc=\$(( \$signum + 128 )) fi local finalExitCode=\${latestSignalRc:-\$exitCode} echo "SIGNAL STATUS: signal: '\${signal:-n/a}' signum: '\${signum:-n/a}' exitCode: '\$exitCode' latestSignalRc: '\${latestSignalRc:-n/a}' finalExitCode: '\${finalExitCode}'" # don't trap EXIT if already in ERR [[ "\$signal" != "EXIT" ]] && trap - EXIT # React if no parent found local currentPidParent= currentPidParent=$(ps --no-headers -o ppid --pid \$myPid | xargs) if [ \$pidParent -ne \$currentPidParent ]; then echo "WARNING: Parent process missing..." if [[ "true" == "\$ACTIVATE_WORKAROUND" ]]; then echo "Activating workaround - writing the exitCode directly into the '\$myResultFile'." echo \$finalExitCode > \$myResultFile else echo "Not activating workaround - script has probably hung by now. Fix by aborting or by writing the exitCode directly into the '\$myResultFile'." fi fi exit \$finalExitCode } myPid=\$\$ pidParent=\$(ps --no-headers -o ppid --pid \$myPid | xargs) myDir="\$( cd "\$( dirname "\${BASH_SOURCE[0]}" )" && pwd )" myResultFile="\$myDir/jenkins-result.txt" # verbose flag ''' + verboseFlag + ''' ''' + script + ''' '''.trim().stripIndent() } def bash(Map vars = [:]) { vars.script = bashPreText(vars.script, vars.quiet, vars.login) sh(vars) } /* Convenience overload */ def bash(String script) { return bash(script: script) } pipeline { agent any options { skipDefaultCheckout() timestamps() disableConcurrentBuilds() buildDiscarder(logRotator(numToKeepStr:'30')) } parameters { booleanParam(defaultValue: true, description: 'Activate the workaround', name: 'ACTIVATE_WORKAROUND') string(defaultValue: '', description: 'The signal to send to the SCRIPT (int or HUP, TERM, etc)', name: 'SIGNAL_SCRIPT') string(defaultValue: 'TERM', description: 'The signal to send to the PARENT (int or HUP, TERM, etc)', name: 'SIGNAL_PARENT') } stages { stage('Test') { steps { script { bash quiet: true, script: ''' echo "Starting script with... myPid=\$myPid pidParent=\$pidParent myDir="\$myDir" myResultFile="\$myResultFile" " if [ -n "\${SIGNAL_PARENT:-}" ]; then echo "Sending signal \${SIGNAL_PARENT} to parent" sleep 0.1 kill -\${SIGNAL_PARENT:-0} \$pidParent sleep 0.1 fi echo "Middle of script..." if [ -n "\${SIGNAL_SCRIPT:-}" ]; then if [[ "ERR" == "\${SIGNAL_SCRIPT}" ]]; then echo "Failing with an ERR" ls /bla/bla/bla else echo "Sending signal \${SIGNAL_SCRIPT} to script" kill -\${SIGNAL_SCRIPT:-0} \$myPid fi fi echo "End of script..." ''' } } } } }
Final Workaround
The final workaround for me was to put the traps and handleSignal function as a type of pretext in a vars/bash.groovy as in the job above (NOTE: don't forget to remove the ACTIVATE_WORKAROUND == true condition) and using it in my global library.
Hope this maybe helps find a solution to the problem though rather just a hacky than a workaround .
One thing I noted was that the first part of the script in our case was executed in one workspace and the command that was hanging, was executed in another workzpace, with the @tmp suffix. Since the @tmp workspace didn't contain the right stuff the comand in the shell script failed
Skimming this, sounds like it could be a dupe of JENKINS-50892. Needs to be determined if there is a non-contrived way to reproduce part of the controller script being killed but not the rest of it; and, either way, whether there is a safe way to ensure that the controller script lives or dies atomically. It seems that use of { rather than ( does not suffice to avoid creation of a cloned sh process.
Most likely a dupe. Please use JENKINS-50892 for discussion.
lostinberlin for some background: the second copy of sh is the stuff inside curly braces which is touching the log file. (That is a way for Jenkins to tell the difference between a process which just declines to produce output for a long time, as opposed to the whole computer having been rebooted and all these processes are dead.)
As to why the first copy of sh is getting killed to begin with, your guess is as good as mine. You suggested that low resources in a container could trigger some processes to be killed, but why one and not the other?
I may have experienced this issue ("pipeline hangs after executing sh step command") as well – jglick maybe it is more similar to
though (because the thread dump looks more similar: emphasizing "DSL.sh(completed process...") – and in my case an admittedly accidental unmotivated but at least gentle Jenkins restart happened DURING the 'sh' step execution (thus without prior 'Manage Jenkins > Prepare for Shutdown'):JENKINS-37730