Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-46283

pipeline hangs after executing sh step command

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Simple pipeline below sporadically hangs after completing last "sh" step. Command already completed and can't be seen in process list, but pipeline is still in running state and won't finish.

      There are 2 nodes "builder" and "runner", which (for testing) were both setup to run on localhost (via ssh). Jenkins did ~15 builds of pipeline below before running into this problem, there are no other jobs/builds.

      I'll try to keep the system in this failed state for couple days, in case anyone has tips what further data would be useful to gather:

      Jenkinsfile:

      node('builder') {
          stage('Build/Fetch') {
      	git ...
      	sh '''curl -O http://file/skt/sktrc
                    curl -O http://file/skt/default.config'''
          }
          stage('Build') {
      	sh '''./skt.py -vv --state --rc sktrc merge
      	      ./skt.py -vv --state --rc sktrc build
      	      ./skt.py -vv --state --rc sktrc publish'''
      	sktrc = readFile 'sktrc'
      	sh '''./skt.py -vv --state --rc sktrc cleanup'''
          }
      }
      
      node('runner') {
          stage('Test/Fetch') {
      	git 'http://git/skt.git'
      	sh '''curl -O http://filejob.xml'''
      	writeFile file: 'sktrc', text: "${sktrc}"
          }
          stage('Test') {
      	sh '''PATH="/home/worker/bin:$PATH" ./skt.py -vv --state --rc sktrc run --wait
      	      ./skt.py -vv --state --rc sktrc cleanup'''
          }
      }
      

      Pipeline threadDump:

      Thread #12
      	at WorkflowScript.run(WorkflowScript:36)
      	at DSL.stage(Native Method)
      	at WorkflowScript.run(WorkflowScript:35)
      	at DSL.node(running on runner_localhost)
      	at WorkflowScript.run(WorkflowScript:29)
      

        Attachments

          Issue Links

            Activity

            Hide
            lostinberlin Steve Boardwell added a comment -

            I can confirm this is still happening in Jenkins 2.89.4 with all plugins up to date at the time of writing. 

            In our situation we have one step to start a web server, followed by a step to wait until the web server is ready.

            node {
                stage('Running Tests') {
                    setupWebserver()
                    checkStart('http://webserver:8080')
                }
            }
            
            def checkStart(String url) {
                sh """
                ...
                ...curl command in a for loop etc...
                ...
                echo 'Web server up and running!!!'
                """
            }
            
            def setupWebserver() {
                sh """#!/bin/bash
                ...
                ...some pre-steps...
                ...
                webServer/bin/server.sh restart
                """
            }
            

            In the logs you can see that the server is started...

            18:20:20 Starting jetty using port 8080
            18:20:20 Stopping Jetty: OK
            18:20:25 Starting Jetty: STARTED Jetty Sat Mar  3 18:20:24 CET 2018 under PID 21653
            

            On the agent, you can see that the process was started successfully and that the web server is available:

            # check for PID
            jenkins@p-agent:~$ ps aux | grep [2]1653 | awk '{ print $1" "$2 }'
            jenkins 21653
            
            # check for url
            jenkins@p-agent:~$ curl --write-out '%{http_code}\n' -o /dev/null -qsSL http://localhost:8080
            200
            

            Looking for durable tasks, I could not find any processes containing the word 'durable':

            jenkins@p-agent:~$ ps auxww | grep [d]urable | awk '{ print $1" "$2 }'
            jenkins@p-agent:~$
            

            Question: would it be worth adding a pre-script to our sh scripts to:

            • print out the current PID of the durable task
            • print out the current parent directory of the durable task
            • do a ps aux | grep <parent directory> for the running script
            • add a trap to print when the script is exiting
              • which signals should be best handled in this case? EXIT, INT, TERM, HUP?
            Show
            lostinberlin Steve Boardwell added a comment - I can confirm this is still happening in Jenkins 2.89.4 with all plugins up to date at the time of writing.  In our situation we have one step to start a web server, followed by a step to wait until the web server is ready. node {     stage('Running Tests') {         setupWebserver()         checkStart('http://webserver:8080')     } } def checkStart(String url) {     sh """     ...     ...curl command in a for loop etc...     ...     echo 'Web server up and running!!!'     """ } def setupWebserver() {     sh """#!/bin/bash     ...     ...some pre-steps...     ...     webServer/bin/server.sh restart     """ } In the logs you can see that the server is started... 18:20:20 Starting jetty using port 8080 18:20:20 Stopping Jetty: OK 18:20:25 Starting Jetty: STARTED Jetty Sat Mar 3 18:20:24 CET 2018 under PID 21653 On the agent, you can see that the process was started successfully and that the web server is available: # check for PID jenkins@p-agent:~$ ps aux | grep [2]1653 | awk '{ print $1" "$2 }' jenkins 21653 # check for url jenkins@p-agent:~$ curl --write-out '%{http_code}\n' -o /dev/null -qsSL http://localhost:8080 200 Looking for durable tasks, I could not find any processes containing the word 'durable': jenkins@p-agent:~$ ps auxww | grep [d]urable | awk '{ print $1" "$2 }' jenkins@p-agent:~$ Question: would it be worth adding a pre-script to our sh scripts to: print out the current PID of the durable task print out the current parent directory of the durable task do a ps aux | grep <parent directory> for the running script add a trap to print when the script is exiting which signals should be best handled in this case? EXIT, INT, TERM, HUP?
            Hide
            lostinberlin Steve Boardwell added a comment - - edited

            Hi Jesse Glick / Jan Stancek / Reinhold Füreder

            I think I have the cause, or at least one of the possible causes, and can reproduce the hanging agent in principle. It has to do with the parent process being killed. Why that happens I cannot say, but perhaps the OS does this when the resources are low, etc (in any dockerized agent environment for example)

            I also have a bit of a hacky workaround - feedback and improvements welcome .

            Disclaimer: I'm using the exit codes calculated for bash in my solution as well as the bash built-in for finding my current directory. You'd need to take account of this if using a different shell.

            It's a bit of a long explanation but here goes...

            Summary

            • investigated how signals can affect shell scripts
            • investigated the processes involved
              • discovered an unexpected sibling process
            • found I could reproduce the behaviour by killing the parent process
            • scripted a workaround for the case that the parent process is missing

            Long version

            Investigated what signals can do to a shell script.

            Using the following script I tested what signals did to a bash script. I wanted to find out:

            • which signals would cause a script to exit immediately
            • which signals would first send an EXIT signal before exiting
            • which signals would do nothing, etc.
            [jenkins@jenkins-server] ~ $ cat /tmp/test.sh
            #!/bin/bash
            
            set -euo pipefail
            latestSignalRc=
            
            # register all known traps
            typeset -i sig=1
            while (( sig < 65 )); do
                trap "signum=${sig};test ${sig}" "$sig"
                let sig=sig+1
            done
            trap "test EXIT" "EXIT"
            trap "test ERR" "ERR"
            test() {
                local rc=$?
                if [[ "$1" != "EXIT" ]] && [[ "$1" != "ERR" ]]; then
            	#echo "Non EXIT or ERR. Making latestSignalRc from signum '$signum'."
            	latestSignalRc=$(( $signum + 128 ))
                fi
                echo "Got sig: ${1:-n/a}, signum: '${signum:-n/a}', rc: $rc, latestSignalRc: ${latestSignalRc:-n/a}"
                # Reset to a default signal handler.
                trap - $1
                unset signum
                if [[ "$1" != "EXIT" ]] && [[ "$1" != "ERR" ]]; then
            	# kill process with signal
            	kill -$1 $$
                else
                    # if we receive an error, reset the EXIT trap
                    # because we are leaving now anyway
                    [[ "$1" == "ERR" ]] && trap - EXIT
            	exit ${latestSignalRc:-$rc}
                fi
            }
            
            sleep 0.1
            if [[ "ERR" == "${1:-}" ]]; then
                 ls /tmp/nnnn &> /dev/null
            elif [ -n "${1:-}" ]; then
                kill -$1 $$
            fi
            echo "After kill..."
            

            Testing looked something like this:

            [jenkins@jenkins-server] ~ $ for i in ERR 1 2 3 6 15; do echo "-----------------------------"; (/tmp/test.sh $i; echo "Exited: $?"); done
            -----------------------------
            Got sig: ERR, signum: 'n/a', rc: 2, latestSignalRc: n/a
            Exited: 2
            -----------------------------
            Got sig: 1, signum: '1', rc: 0, latestSignalRc: 129
            Exited: 129
            -----------------------------
            Got sig: 2, signum: '2', rc: 0, latestSignalRc: 130
            Got sig: EXIT, signum: 'n/a', rc: 0, latestSignalRc: 130
            Exited: 130
            -----------------------------
            Got sig: 3, signum: '3', rc: 0, latestSignalRc: 131
            After kill...
            Got sig: EXIT, signum: 'n/a', rc: 0, latestSignalRc: 131
            Exited: 131
            -----------------------------
            Got sig: 6, signum: '6', rc: 0, latestSignalRc: 134
            Exited: 134
            -----------------------------
            Got sig: 15, signum: '15', rc: 0, latestSignalRc: 143
            Exited: 143
            

            I settled on catching
            1) SIGHUP
            2) SIGINT
            6) SIGABRT
            15) SIGTERM
            since they (1) could be caught, and (2) caused the script to exit.

            Investigated the processes involved

            At first I simply grep'ed the processes using derived values

            node('agent') {
                sh '''#!/bin/bash
                    set -euo pipefail
                    set +x
                    function onExit() {
                        local exitCode=\$?
                        echo ">>>> Exiting now with exitCode \$exitCode"
                        echo ">>>> Could place exitCode in \$myResultFile"
                        echo "one last log" >> "\$myDir/jenkins-log.txt"
                        echo \$exitCode > \$myResultFile
                        sleep 5
                        echo "Still running..."
                        exit \$exitCode
                    }
                    trap onExit EXIT
                    
                    myPid=\$\$
                    myDir="\$( cd "\$( dirname "\${BASH_SOURCE[0]}" )" && pwd )"
                    myResultFile="\$myDir/jenkins-result.txt"
                    myResultFileGrep="\$myDir/jenkins-result.tx[t]"
                    echo "my pid is \$myPid"
                    echo "my result file is \$myResultFile"
                    echo '----------------------------------'
                    ps -eaf | head -n 1
                    echo '--------- script PID -------------'
                    ps -eaf | grep [d]urable | grep \$myPid
                    echo '--------- script result file -------------'
                    myCmdsParentPid=\$(ps aux | grep "\$myResultFileGrep" | awk '{ print \$3 }' | sort -u )
                    ps -eaf | grep "\$myResultFileGrep"
                    echo '----------- script commands parent pid -----------'
                    ps -eaf | grep "[j]enkins.*\$myCmdsParentPid"
                    echo '---------------------------------'
                    echo "hello from \$(hostname)"
                '''
            }
            

            Resulting in something like:

            ...
            my pid is 10781
            my result file is /var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt
            UID        PID  PPID  C STIME TTY          TIME CMD
            --------- script PID -------------
            jenkins  10781 10778  0 14:02 ?        00:00:00 /bin/bash /var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/script.sh
            --------- script result file -------------
            jenkins  10778 12979  0 14:02 ?        00:00:00 sh -c { while [ -d '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0' -a \! -f '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt' ]; do touch '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt'; sleep 3; done } & jsc=durable-14f8a02757bd1625e0536d94affe2a93; JENKINS_SERVER_COOKIE=$jsc '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/script.sh' > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt' 2>&1; echo $? > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt'; wait
            jenkins  10780 10778  0 14:02 ?        00:00:00 sh -c { while [ -d '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0' -a \! -f '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt' ]; do touch '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt'; sleep 3; done } & jsc=durable-14f8a02757bd1625e0536d94affe2a93; JENKINS_SERVER_COOKIE=$jsc '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/script.sh' > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt' 2>&1; echo $? > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt'; wait
            ...
            

            Question: I still do not know why there are two instances of the same command "sh -c ...."

             

            I used a recursePid function to follow the trail and realised that the second sh -c ... process was actually a sibling of the script.sh process rather a parent.

            function recursePid() {
                local currentPid=$1
                local pidEntry=$(ps --no-headers -f --pid $currentPid)
                pidParent=$(ps --no-headers -o ppid --pid $currentPid | xargs)
                echo "--------- Current: $currentPid-------------"
                echo "$pidEntry"
                if [ $pidParent -ne 1 ]; then
                    local pidSiblings=$(ps --no-headers -f --ppid $pidParent | grep -v $currentPid)
                    if [ -n "$pidSiblings" ]; then
                        echo "--------- has following siblings -------------"
                        echo "$pidSiblings"
                        pidSibling=$(echo "$pidSiblings" | awk '{ print $2 }')
                        echo "Siblings pid = $pidSibling"
                   fi
                    echo "Sending signal to parent ${SIGNAL}"
                    sleep 3
                    kill -${SIGNAL:-0} $pidParent
                    sleep 3
                    echo "Parent killed..."
                    if ps --no-headers -$pidParent; then 
                        echo "Parent still there"
                    else 
                        echo "Parent gone"
                    fi
                    #recursePid $pidParent
                fi
            }
            

            So, the process tree looks like:

            slave.jar process
              |__ parent "sh -c ..." process
                   |__ sibling "sh -c ..." process
                   |__ "script.sh" process
            

            Found I could reproduce the behaviour by killing the parent process

            I tested sending various signals to the script.sh and sibling sh -c ... processes but:

            • script.sh - worked as expected
            • sibling sh -c ... didn't seem to have an effect.

            Moving up one level, I sent the TERM signal to the parent which causes the agent to hang.

            In the build log...

            14:37:15 Single pid = 90
            14:37:15 Sending signal to parent TERM
            14:37:21 So far, I've got signal: 'EXIT', signum: 'n/a', exitCode: '0', latestSignalRc: 'n/a'
            14:37:21 >>>> Exiting now with exitCode 0
            14:37:21 >>>> Could place exitCode in /home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt
            14:37:21 I am placing this log directly into the log file...
            <spinning-wheel>
            

            On the agent only the sibling "sh -c ..." process remains...

            jenkins@cd8c03e15e58:~$ ps aux | grep "[s]h -c"
            jenkins      90  0.0  0.0   4512   928 ?        S    14:37   0:00 sh -c { while [ -d '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a' -a \! -f '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-log.txt'; sleep 3; done } & jsc=durable-190a2420fb163bce1cd2a8d2213b499c; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/script.sh' > '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt'; wait
            

            The sibling process is touching the jenkins-log.txt every 3 seconds meaning Jenkins can't determine that it is hung.

            Entering a exit code into jenkins-result.txt causes the build to continue (see the timestamps)

            14:37:21 >>>> Could place exitCode in /home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt
            14:37:21 I am placing this log directly into the log file...
            
            
            ...
            placed the exit code into the jenkins-result.txt now
            ...
            
            
            [Pipeline] }
            [Pipeline] // node
            [Pipeline] node
            14:44:14 Running on Jenkins in /var/lib/jenkins/workspace/EXP-signal-test
            [Pipeline] {
            [Pipeline] tool
            [Pipeline] retry
            [Pipeline] {
            [Pipeline] node
            14:44:14 Running on Jenkins in /var/lib/jenkins/workspace/EXP-signal-tester@2
            [Pipeline] {
            [Pipeline] tool
            [Pipeline] sh
            

            Scripted a workaround for the case that the parent is missing

            I experimented further by catching and resending the signal as with the examples above. However, some signals such as TERM did not allow the script to write it's exit code to the jenkins-result.txt file.

            So, with the knowledge that

            • we can catch the common signals in the script which could cause the script to exit
            • we can determine the parent pid and its status

            I scripted the following to echo the 'would-be' exit code into the appropriate jenkins-result.txt in the case of the parent process being killed.

            The best way to explain how it works is by using an example. Please check the following Jenkinsfile job:

            def bashPreText(def script, def quiet = false, def login = false) {
                String verboseFlag = quiet ? '' : 'set -x'
                String loginFlag = login ? '-l' : ''
                return '''#!/bin/bash ''' + loginFlag + '''
            
            set -euo pipefail # fail fast and fail on unset variables
            
            # don't print the pre-text stuff 
            set +x
            
            # some shell options
            shopt -s globstar
            shopt -s expand_aliases
            
            # traps
            typeset -i sig=1
            for sig in 1 2 6 15; do
                trap "signum=\${sig};handleSignal \${sig}" "\$sig"
            done
            trap "handleSignal EXIT" "EXIT"
            trap "handleSignal ERR" "ERR"
            
            
            function handleSignal() {
                local exitCode=\$?
                local signal=\$1
                
                if [[ "\$signal" != "EXIT" ]] && [[ "\$signal" != "ERR" ]]; then
                    # TODO: account for non-bash exit codes (ksh = signum + 256) 
                	latestSignalRc=\$(( \$signum + 128 ))
                fi
                local finalExitCode=\${latestSignalRc:-\$exitCode}
            
                echo "SIGNAL STATUS: 
                signal: '\${signal:-n/a}'
                signum: '\${signum:-n/a}'
                exitCode: '\$exitCode'
                latestSignalRc: '\${latestSignalRc:-n/a}'
                finalExitCode: '\${finalExitCode}'"
            
                # don't trap EXIT if already in ERR
                [[ "\$signal" != "EXIT" ]] && trap - EXIT 
                
                # React if no parent found
                local currentPidParent=
                currentPidParent=$(ps --no-headers -o ppid --pid \$myPid | xargs)
                if [ \$pidParent -ne \$currentPidParent ]; then
                    echo "WARNING: Parent process missing..."
                    if [[ "true" == "\$ACTIVATE_WORKAROUND" ]]; then
                        echo "Activating workaround - writing the exitCode directly into the '\$myResultFile'."
                        echo \$finalExitCode > \$myResultFile
                    else 
                        echo "Not activating workaround - script has probably hung by now. Fix by aborting or by writing the exitCode directly into the '\$myResultFile'."
                    fi
                fi
                exit \$finalExitCode
            }
            
            myPid=\$\$
            pidParent=\$(ps --no-headers -o ppid --pid \$myPid | xargs)
            myDir="\$( cd "\$( dirname "\${BASH_SOURCE[0]}" )" && pwd )"
            myResultFile="\$myDir/jenkins-result.txt"
            
            
            # verbose flag
            ''' + verboseFlag + '''
            ''' +
            script + '''
            '''.trim().stripIndent()
                }
            
            def bash(Map vars = [:]) {
                vars.script = bashPreText(vars.script, vars.quiet, vars.login)
                sh(vars)
            }
            /* Convenience overload */
            def bash(String script) {
                return bash(script: script)
            }
            
            pipeline {
                agent any
                options {
                    skipDefaultCheckout()
                    timestamps()
                    disableConcurrentBuilds()
                    buildDiscarder(logRotator(numToKeepStr:'30'))
                }
                parameters {
            		booleanParam(defaultValue: true, description: 'Activate the workaround', name: 'ACTIVATE_WORKAROUND')
            		string(defaultValue: '', description: 'The signal to send to the SCRIPT (int or HUP, TERM, etc)', name: 'SIGNAL_SCRIPT')
            		string(defaultValue: 'TERM', description: 'The signal to send to the PARENT (int or HUP, TERM, etc)', name: 'SIGNAL_PARENT')
            	}
                stages {
                    stage('Test') {
                        steps {
                            script {
                    bash quiet: true, script: '''
            
            echo "Starting script with...
                myPid=\$myPid
                pidParent=\$pidParent
                myDir="\$myDir"
                myResultFile="\$myResultFile"
            "
            
            if [ -n "\${SIGNAL_PARENT:-}" ]; then
                echo "Sending signal \${SIGNAL_PARENT} to parent"
                sleep 0.1
                kill -\${SIGNAL_PARENT:-0} \$pidParent
                sleep 0.1
            fi
            
            echo "Middle of script..."
            
            if [ -n "\${SIGNAL_SCRIPT:-}" ]; then
                if [[ "ERR" == "\${SIGNAL_SCRIPT}" ]]; then
                    echo "Failing with an ERR"
                    ls /bla/bla/bla
                else
                    echo "Sending signal \${SIGNAL_SCRIPT} to script"
                    kill -\${SIGNAL_SCRIPT:-0} \$myPid
                fi
            fi
            
            echo "End of script..."
            '''
                            }
                        }
                    }            
                }
            }
            

            Final Workaround

            The final workaround for me was to put the traps and handleSignal function as a type of pretext in a vars/bash.groovy as in the job above (NOTE: don't forget to remove the ACTIVATE_WORKAROUND == true condition) and using it in my global library.

            Hope this maybe helps find a solution to the problem though rather just a hacky than a workaround .

            Show
            lostinberlin Steve Boardwell added a comment - - edited Hi Jesse Glick / Jan Stancek / Reinhold Füreder I think I have the cause, or at least one of the possible causes, and can reproduce the hanging agent in principle. It has to do with the parent process being killed. Why that happens I cannot say, but perhaps the OS does this when the resources are low, etc (in any dockerized agent environment for example) I also have a bit of a hacky workaround - feedback and improvements welcome . Disclaimer : I'm using the exit codes calculated for bash in my solution as well as the bash built-in for finding my current directory. You'd need to take account of this if using a different shell. It's a bit of a long explanation but here goes... Summary investigated how signals can affect shell scripts investigated the processes involved discovered an unexpected sibling process found I could reproduce the behaviour by killing the parent process scripted a workaround for the case that the parent process is missing Long version Investigated what signals can do to a shell script. Using the following script I tested what signals did to a bash script. I wanted to find out: which signals would cause a script to exit immediately which signals would first send an EXIT signal before exiting which signals would do nothing, etc. [jenkins@jenkins-server] ~ $ cat /tmp/test.sh #!/bin/bash set -euo pipefail latestSignalRc= # register all known traps typeset -i sig=1 while (( sig < 65 )); do trap "signum=${sig};test ${sig}" "$sig" let sig=sig+1 done trap "test EXIT" "EXIT" trap "test ERR" "ERR" test() { local rc=$? if [[ "$1" != "EXIT" ]] && [[ "$1" != "ERR" ]]; then #echo "Non EXIT or ERR. Making latestSignalRc from signum '$signum'." latestSignalRc=$(( $signum + 128 )) fi echo "Got sig: ${1:-n/a}, signum: '${signum:-n/a}', rc: $rc, latestSignalRc: ${latestSignalRc:-n/a}" # Reset to a default signal handler. trap - $1 unset signum if [[ "$1" != "EXIT" ]] && [[ "$1" != "ERR" ]]; then # kill process with signal kill -$1 $$ else # if we receive an error, reset the EXIT trap # because we are leaving now anyway [[ "$1" == "ERR" ]] && trap - EXIT exit ${latestSignalRc:-$rc} fi } sleep 0.1 if [[ "ERR" == "${1:-}" ]]; then ls /tmp/nnnn &> /dev/null elif [ -n "${1:-}" ]; then kill -$1 $$ fi echo "After kill..." Testing looked something like this: [jenkins@jenkins-server] ~ $ for i in ERR 1 2 3 6 15; do echo "-----------------------------"; (/tmp/test.sh $i; echo "Exited: $?"); done ----------------------------- Got sig: ERR, signum: 'n/a', rc: 2, latestSignalRc: n/a Exited: 2 ----------------------------- Got sig: 1, signum: '1', rc: 0, latestSignalRc: 129 Exited: 129 ----------------------------- Got sig: 2, signum: '2', rc: 0, latestSignalRc: 130 Got sig: EXIT, signum: 'n/a', rc: 0, latestSignalRc: 130 Exited: 130 ----------------------------- Got sig: 3, signum: '3', rc: 0, latestSignalRc: 131 After kill... Got sig: EXIT, signum: 'n/a', rc: 0, latestSignalRc: 131 Exited: 131 ----------------------------- Got sig: 6, signum: '6', rc: 0, latestSignalRc: 134 Exited: 134 ----------------------------- Got sig: 15, signum: '15', rc: 0, latestSignalRc: 143 Exited: 143 I settled on catching 1) SIGHUP 2) SIGINT 6) SIGABRT 15) SIGTERM since they (1) could be caught, and (2) caused the script to exit. Investigated the processes involved At first I simply grep'ed the processes using derived values node('agent') { sh '''#!/bin/bash set -euo pipefail set +x function onExit() { local exitCode=\$? echo ">>>> Exiting now with exitCode \$exitCode" echo ">>>> Could place exitCode in \$myResultFile" echo "one last log" >> "\$myDir/jenkins-log.txt" echo \$exitCode > \$myResultFile sleep 5 echo "Still running..." exit \$exitCode } trap onExit EXIT myPid=\$\$ myDir="\$( cd "\$( dirname "\${BASH_SOURCE[0]}" )" && pwd )" myResultFile="\$myDir/jenkins-result.txt" myResultFileGrep="\$myDir/jenkins-result.tx[t]" echo "my pid is \$myPid" echo "my result file is \$myResultFile" echo '----------------------------------' ps -eaf | head -n 1 echo '--------- script PID -------------' ps -eaf | grep [d]urable | grep \$myPid echo '--------- script result file -------------' myCmdsParentPid=\$(ps aux | grep "\$myResultFileGrep" | awk '{ print \$3 }' | sort -u ) ps -eaf | grep "\$myResultFileGrep" echo '----------- script commands parent pid -----------' ps -eaf | grep "[j]enkins.*\$myCmdsParentPid" echo '---------------------------------' echo "hello from \$(hostname)" ''' } Resulting in something like: ... my pid is 10781 my result file is /var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt UID PID PPID C STIME TTY TIME CMD --------- script PID ------------- jenkins 10781 10778 0 14:02 ? 00:00:00 /bin/bash /var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/script.sh --------- script result file ------------- jenkins 10778 12979 0 14:02 ? 00:00:00 sh -c { while [ -d '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0' -a \! -f '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt' ]; do touch '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt'; sleep 3; done } & jsc=durable-14f8a02757bd1625e0536d94affe2a93; JENKINS_SERVER_COOKIE=$jsc '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/script.sh' > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt' 2>&1; echo $? > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt'; wait jenkins 10780 10778 0 14:02 ? 00:00:00 sh -c { while [ -d '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0' -a \! -f '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt' ]; do touch '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt'; sleep 3; done } & jsc=durable-14f8a02757bd1625e0536d94affe2a93; JENKINS_SERVER_COOKIE=$jsc '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/script.sh' > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-log.txt' 2>&1; echo $? > '/var/lib/jenkins/workspace/EXP-signal-tester@tmp/durable-e19db7e0/jenkins-result.txt'; wait ... Question: I still do not know why there are two instances of the same command " sh -c ...."   I used a recursePid function to follow the trail and realised that the second sh -c ... process was actually a sibling of the script.sh process rather a parent. function recursePid() { local currentPid=$1 local pidEntry=$(ps --no-headers -f --pid $currentPid) pidParent=$(ps --no-headers -o ppid --pid $currentPid | xargs) echo "--------- Current: $currentPid-------------" echo "$pidEntry" if [ $pidParent -ne 1 ]; then local pidSiblings=$(ps --no-headers -f --ppid $pidParent | grep -v $currentPid) if [ -n "$pidSiblings" ]; then echo "--------- has following siblings -------------" echo "$pidSiblings" pidSibling=$(echo "$pidSiblings" | awk '{ print $2 }') echo "Siblings pid = $pidSibling" fi echo "Sending signal to parent ${SIGNAL}" sleep 3 kill -${SIGNAL:-0} $pidParent sleep 3 echo "Parent killed..." if ps --no-headers -$pidParent; then echo "Parent still there" else echo "Parent gone" fi #recursePid $pidParent fi } So, the process tree looks like: slave.jar process |__ parent "sh -c ..." process |__ sibling "sh -c ..." process |__ "script.sh" process Found I could reproduce the behaviour by killing the parent process I tested sending various signals to the script.sh and sibling sh -c ... processes but: script.sh - worked as expected sibling sh -c ... didn't seem to have an effect. Moving up one level, I sent the TERM signal to the parent which causes the agent to hang. In the build log... 14:37:15 Single pid = 90 14:37:15 Sending signal to parent TERM 14:37:21 So far, I've got signal: 'EXIT', signum: 'n/a', exitCode: '0', latestSignalRc: 'n/a' 14:37:21 >>>> Exiting now with exitCode 0 14:37:21 >>>> Could place exitCode in /home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt 14:37:21 I am placing this log directly into the log file... <spinning-wheel> On the agent only the sibling "sh -c ..." process remains... jenkins@cd8c03e15e58:~$ ps aux | grep "[s]h -c" jenkins 90 0.0 0.0 4512 928 ? S 14:37 0:00 sh -c { while [ -d '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a' -a \! -f '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-log.txt'; sleep 3; done } & jsc=durable-190a2420fb163bce1cd2a8d2213b499c; JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/script.sh' > '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt'; wait The sibling process is touching the jenkins-log.txt every 3 seconds meaning Jenkins can't determine that it is hung. Entering a exit code into jenkins-result.txt causes the build to continue (see the timestamps) 14:37:21 >>>> Could place exitCode in /home/jenkins/workspace/EXP-signal-tester@tmp/durable-889afc3a/jenkins-result.txt 14:37:21 I am placing this log directly into the log file... ... placed the exit code into the jenkins-result.txt now ... [Pipeline] } [Pipeline] // node [Pipeline] node 14:44:14 Running on Jenkins in /var/lib/jenkins/workspace/EXP-signal-test [Pipeline] { [Pipeline] tool [Pipeline] retry [Pipeline] { [Pipeline] node 14:44:14 Running on Jenkins in /var/lib/jenkins/workspace/EXP-signal-tester@2 [Pipeline] { [Pipeline] tool [Pipeline] sh Scripted a workaround for the case that the parent is missing I experimented further by catching and resending the signal as with the examples above. However, some signals such as TERM did not allow the script to write it's exit code to the jenkins-result.txt file. So, with the knowledge that we can catch the common signals in the script which could cause the script to exit we can determine the parent pid and its status I scripted the following to echo the 'would-be' exit code into the appropriate jenkins-result.txt in the case of the parent process being killed. The best way to explain how it works is by using an example. Please check the following Jenkinsfile job: def bashPreText(def script, def quiet = false, def login = false) { String verboseFlag = quiet ? '' : 'set -x' String loginFlag = login ? '-l' : '' return '''#!/bin/bash ''' + loginFlag + ''' set -euo pipefail # fail fast and fail on unset variables # don't print the pre-text stuff set +x # some shell options shopt -s globstar shopt -s expand_aliases # traps typeset -i sig=1 for sig in 1 2 6 15; do trap "signum=\${sig};handleSignal \${sig}" "\$sig" done trap "handleSignal EXIT" "EXIT" trap "handleSignal ERR" "ERR" function handleSignal() { local exitCode=\$? local signal=\$1 if [[ "\$signal" != "EXIT" ]] && [[ "\$signal" != "ERR" ]]; then # TODO: account for non-bash exit codes (ksh = signum + 256) latestSignalRc=\$(( \$signum + 128 )) fi local finalExitCode=\${latestSignalRc:-\$exitCode} echo "SIGNAL STATUS: signal: '\${signal:-n/a}' signum: '\${signum:-n/a}' exitCode: '\$exitCode' latestSignalRc: '\${latestSignalRc:-n/a}' finalExitCode: '\${finalExitCode}'" # don't trap EXIT if already in ERR [[ "\$signal" != "EXIT" ]] && trap - EXIT # React if no parent found local currentPidParent= currentPidParent=$(ps --no-headers -o ppid --pid \$myPid | xargs) if [ \$pidParent -ne \$currentPidParent ]; then echo "WARNING: Parent process missing..." if [[ "true" == "\$ACTIVATE_WORKAROUND" ]]; then echo "Activating workaround - writing the exitCode directly into the '\$myResultFile'." echo \$finalExitCode > \$myResultFile else echo "Not activating workaround - script has probably hung by now. Fix by aborting or by writing the exitCode directly into the '\$myResultFile'." fi fi exit \$finalExitCode } myPid=\$\$ pidParent=\$(ps --no-headers -o ppid --pid \$myPid | xargs) myDir="\$( cd "\$( dirname "\${BASH_SOURCE[0]}" )" && pwd )" myResultFile="\$myDir/jenkins-result.txt" # verbose flag ''' + verboseFlag + ''' ''' + script + ''' '''.trim().stripIndent() } def bash(Map vars = [:]) { vars.script = bashPreText(vars.script, vars.quiet, vars.login) sh(vars) } /* Convenience overload */ def bash(String script) { return bash(script: script) } pipeline { agent any options { skipDefaultCheckout() timestamps() disableConcurrentBuilds() buildDiscarder(logRotator(numToKeepStr:'30')) } parameters { booleanParam(defaultValue: true, description: 'Activate the workaround', name: 'ACTIVATE_WORKAROUND') string(defaultValue: '', description: 'The signal to send to the SCRIPT (int or HUP, TERM, etc)', name: 'SIGNAL_SCRIPT') string(defaultValue: 'TERM', description: 'The signal to send to the PARENT (int or HUP, TERM, etc)', name: 'SIGNAL_PARENT') } stages { stage('Test') { steps { script { bash quiet: true, script: ''' echo "Starting script with... myPid=\$myPid pidParent=\$pidParent myDir="\$myDir" myResultFile="\$myResultFile" " if [ -n "\${SIGNAL_PARENT:-}" ]; then echo "Sending signal \${SIGNAL_PARENT} to parent" sleep 0.1 kill -\${SIGNAL_PARENT:-0} \$pidParent sleep 0.1 fi echo "Middle of script..." if [ -n "\${SIGNAL_SCRIPT:-}" ]; then if [[ "ERR" == "\${SIGNAL_SCRIPT}" ]]; then echo "Failing with an ERR" ls /bla/bla/bla else echo "Sending signal \${SIGNAL_SCRIPT} to script" kill -\${SIGNAL_SCRIPT:-0} \$myPid fi fi echo "End of script..." ''' } } } } } Final Workaround The final workaround for me was to put the traps and handleSignal function as a type of pretext in a vars/bash.groovy as in the job above ( NOTE : don't forget to remove the ACTIVATE_WORKAROUND == true condition) and using it in my global library. Hope this maybe helps find a solution to the problem though rather just a hacky than a workaround .
            Hide
            robinrosenberg Robin Rosenberg added a comment -

            One thing I noted was that the first part of the script in our case was executed in one workspace and the command that was hanging, was executed in another workzpace, with the @tmp suffix. Since the @tmp workspace didn't contain the right stuff the comand in the shell script failed

             

             

            Show
            robinrosenberg Robin Rosenberg added a comment - One thing I noted was that the first part of the script in our case was executed in one workspace and the command that was hanging, was executed in another workzpace, with the @tmp suffix. Since the @tmp workspace didn't contain the right stuff the comand in the shell script failed    
            Hide
            jglick Jesse Glick added a comment -

            Skimming this, sounds like it could be a dupe of JENKINS-50892. Needs to be determined if there is a non-contrived way to reproduce part of the controller script being killed but not the rest of it; and, either way, whether there is a safe way to ensure that the controller script lives or dies atomically. It seems that use of { rather than ( does not suffice to avoid creation of a cloned sh process.

            Show
            jglick Jesse Glick added a comment - Skimming this, sounds like it could be a dupe of JENKINS-50892 . Needs to be determined if there is a non-contrived way to reproduce part of the controller script being killed but not the rest of it; and, either way, whether there is a safe way to ensure that the controller script lives or dies atomically. It seems that use of { rather than ( does not suffice to avoid creation of a cloned sh process.
            Hide
            jglick Jesse Glick added a comment -

            Most likely a dupe. Please use JENKINS-50892 for discussion.

            Steve Boardwell for some background: the second copy of sh is the stuff inside curly braces which is touching the log file. (That is a way for Jenkins to tell the difference between a process which just declines to produce output for a long time, as opposed to the whole computer having been rebooted and all these processes are dead.)

            As to why the first copy of sh is getting killed to begin with, your guess is as good as mine. You suggested that low resources in a container could trigger some processes to be killed, but why one and not the other?

            Show
            jglick Jesse Glick added a comment - Most likely a dupe. Please use JENKINS-50892 for discussion. Steve Boardwell for some background: the second copy of sh is the stuff inside curly braces which is touching the log file. (That is a way for Jenkins to tell the difference between a process which just declines to produce output for a long time, as opposed to the whole computer having been rebooted and all these processes are dead.) As to why the first copy of sh is getting killed to begin with, your guess is as good as mine. You suggested that low resources in a container could trigger some processes to be killed, but why one and not the other?

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              jstancek Jan Stancek
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: