Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50379

Jenkins kills long running sh script with no output

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Minor
    • Resolution: Unresolved
    • durable-task-plugin
    • None
    • Jenkins ver. 2.107.1 on CentOS 7

    Description

      I have a Jenkins pipeline that runs a shell script that takes about 5 minutes and generates no output. The job fails and I'm seeing the following in the output:

      wrapper script does not seem to be touching the log file in /home/jenkins/workspace/job_Pipeline@2@tmp/durable-595950a5
       (--JENKINS-48300--: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)
       script returned exit code -1
      

      Based on JENKINS-48300 it seems that Jenkins is intentionally killing my script while it is still running. IMHO it is a bug for Jenkins to assume that a shell script will generate output every n seconds for any finite n. As a workaround I've set -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL to one hour. But what happens when I have a script that takes an hour and one minute!?

      Attachments

        Issue Links

          Activity

            This issue started occurring a few days ago. Never happened before.

            (Jenkins 2.361.1)

            jhack Giacomo Boccardo added a comment - This issue started occurring a few days ago. Never happened before. (Jenkins 2.361.1)
            bbourdin Benoit Bourdin added a comment - - edited

            Investigated yesterday and sharing my findings below. Here's an example of a wrapper script running:

            10:48:01  root     15345  0.0  0.0   4632    92 ?        S    08:47   0:00 sh -c ({ while [ -d '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da' -a \! -f '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt' ]; do touch '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt'; sleep 3; done } & jsc=durable-bffc1fdb28b03efb823f6939b2ccaf2e; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/script.sh' > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt' 2>&1; echo $? > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp'; mv '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp' '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt'; wait) >&- 2>&- &
            

            In my case on the slave, we have 2 folders:
            1. the workplace folder /home/ubuntu/workspace/xxx, which is hosting the files downloaded from the git repo
            2. the control folder for the durable plugin: /home/ubuntu/workspace/xxx@tmp/durable-yyy containing:

            • script.sh which is the command line we are executing, took from the `sh` command of the Jenkinsfile
            • jenkins-log.txt which is the output of the command
            • jenkins-result.txt which will contain the exit code of the command (when finished)
            • pid which is not created by default but could contain the pid of the command

            On my environment it will throw the error message after 605-915 seconds when having both conditions met:
            1. no exit code in the jenkins-result.txt file
            2. the log file (jenkins-log.txt) not modified since more than 300 seconds
            (1) happens when the wrapper script is not running any more
            (2) happens when the wrapper script is not running, or when the command is not sending messages to logs since more than 300 seconds

            The source code is here:
            https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java

            tested scenarios:

            • when adding an exit code to the jenkins-result.txt file, it stops successfully
            • when only creating jenkins-result.txt file, it continues to wait till having a code
            • long script (30mins) without output with wrapper script running = no issue
            • long script (30mins) with output without wrapper script running = no issue
            • long script (30mins) without output without wrapper script running = failing after 608s
            • long script (30mins) without output without wrapper script running, but with a pid file = no issue

            That means that a _workaround_could be to manually create a pid file before starting the command and then remove when finished. When the command is quite long, it will kindly throw this warning every 5 minutes but wait for the process to finish:

            still have /home/ubuntu/workspace/xxxx@tmp/durable-yyy/pid so heartbeat checks unreliable; process may or may not be alive
            

            It could be done by a simple shell script like the one below. However, we should follow the best practices to setup pipeline/step timeouts and avoid pipelines to run forever.

            # workaround for the wrapper issue, create a pid file
            MYCONTROLDIR=`echo "$PWD@tmp/durable"*`
            echo $$ >$MYCONTROLDIR/pid
            
            # ./my_longcommand_to_run.sh
            exitcode=$?
            
            # lets clean up what we did
            rm -f $MYCONTROLDIR/pid
            exit $exitcode
            

            In my case, the root cause is still unknown. Suspecting that the wrapper script is killed, the system OOM killer (caused by the pipelines filling the slave's memory) is a great candidate, but needing some evidence of it.

            A good suggestion to improve the durable-task-plugin: if we could understand in which scenario we are:
            1. the wrapper script is not running any more (killed?)
            2. or, the wrapper script is running but with some error (and if it writes an error to stderr, please log it somewhere...)
            3. or, the wrapper script is running, but so slow due to some CPU/storage/filesystem slowness,

            bbourdin Benoit Bourdin added a comment - - edited Investigated yesterday and sharing my findings below. Here's an example of a wrapper script running: 10:48:01 root 15345 0.0 0.0 4632 92 ? S 08:47 0:00 sh -c ({ while [ -d '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da' -a \! -f '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt' ]; do touch '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt'; sleep 3; done } & jsc=durable-bffc1fdb28b03efb823f6939b2ccaf2e; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/script.sh' > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt' 2>&1; echo $? > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp'; mv '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp' '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt'; wait) >&- 2>&- & In my case on the slave, we have 2 folders: 1. the workplace folder /home/ubuntu/workspace/xxx , which is hosting the files downloaded from the git repo 2. the control folder for the durable plugin: /home/ubuntu/workspace/xxx@tmp/durable-yyy containing: script.sh which is the command line we are executing, took from the `sh` command of the Jenkinsfile jenkins-log.txt which is the output of the command jenkins-result.txt which will contain the exit code of the command (when finished) pid which is not created by default but could contain the pid of the command On my environment it will throw the error message after 605-915 seconds when having both conditions met: 1. no exit code in the jenkins-result.txt file 2. the log file (jenkins-log.txt) not modified since more than 300 seconds (1) happens when the wrapper script is not running any more (2) happens when the wrapper script is not running, or when the command is not sending messages to logs since more than 300 seconds The source code is here: https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java tested scenarios: when adding an exit code to the jenkins-result.txt file, it stops successfully when only creating jenkins-result.txt file, it continues to wait till having a code long script (30mins) without output with wrapper script running = no issue long script (30mins) with output without wrapper script running = no issue long script (30mins) without output without wrapper script running = failing after 608s long script (30mins) without output without wrapper script running, but with a pid file = no issue That means that a _workaround_could be to manually create a pid file before starting the command and then remove when finished. When the command is quite long, it will kindly throw this warning every 5 minutes but wait for the process to finish: still have /home/ubuntu/workspace/xxxx@tmp/durable-yyy/pid so heartbeat checks unreliable; process may or may not be alive It could be done by a simple shell script like the one below. However, we should follow the best practices to setup pipeline/step timeouts and avoid pipelines to run forever. # workaround for the wrapper issue, create a pid file MYCONTROLDIR=`echo "$PWD@tmp/durable" *` echo $$ >$MYCONTROLDIR/pid # ./my_longcommand_to_run.sh exitcode=$? # lets clean up what we did rm -f $MYCONTROLDIR/pid exit $exitcode In my case, the root cause is still unknown. Suspecting that the wrapper script is killed, the system OOM killer (caused by the pipelines filling the slave's memory) is a great candidate, but needing some evidence of it. A good suggestion to improve the durable-task-plugin : if we could understand in which scenario we are: 1. the wrapper script is not running any more (killed?) 2. or, the wrapper script is running but with some error (and if it writes an error to stderr, please log it somewhere...) 3. or, the wrapper script is running, but so slow due to some CPU/storage/filesystem slowness,
            jglick Jesse Glick added a comment -

            OOMKiller is a likely culprit. In general there is not going to be a reliable record of when the wrapper was killed or why, nor would there be any error messages from the wrapper script to log.

            The newer binary wrapper may behave differently, but it could still be killed.

            Originally the wrapper’s PID was recorded, and the agent would periodically check to see whether that process was still alive. This was switched to a heartbeat because it was more portable (no need for JNI-based system calls) and also worked without special considerations when Launcher was wrapped in a container (such as by the withDockerContainer step, or container in the kubernetes plugin). The only requirement is that the agent JVM and the user process and (if applicable) the filesystem server can agree on clock skew to within a few minutes. I doubt there is much difference either way in diagnosability: in either case, if the wrapper is killed, the agent JVM manages to detect this sooner or later, but does not have any information as to the ultimate reason for the error.

            (Freestyle projects do not survive controller or agent JVM restarts, because the user script is a child of the agent JVM. Thus if e.g. OOMKiller kicks in, the build will simply abort right away—again with no particularly informative message beyond the fact that it got a SIGKILL.)

            jglick Jesse Glick added a comment - OOMKiller is a likely culprit. In general there is not going to be a reliable record of when the wrapper was killed or why, nor would there be any error messages from the wrapper script to log. The newer binary wrapper may behave differently, but it could still be killed. Originally the wrapper’s PID was recorded, and the agent would periodically check to see whether that process was still alive. This was switched to a heartbeat because it was more portable (no need for JNI-based system calls) and also worked without special considerations when Launcher was wrapped in a container (such as by the withDockerContainer step, or container in the kubernetes plugin). The only requirement is that the agent JVM and the user process and (if applicable) the filesystem server can agree on clock skew to within a few minutes. I doubt there is much difference either way in diagnosability: in either case, if the wrapper is killed, the agent JVM manages to detect this sooner or later, but does not have any information as to the ultimate reason for the error. (Freestyle projects do not survive controller or agent JVM restarts, because the user script is a child of the agent JVM. Thus if e.g. OOMKiller kicks in, the build will simply abort right away—again with no particularly informative message beyond the fact that it got a SIGKILL .)

            After more testing and investigation, we identified two possible root causes for this common issue:

            1. a race condition: when the agent terminates before that the wrapper script could complete and report the status/log of the running command. For example, using the docker workflow plugin, when a docker container agent suddently stopped:
              1. when having a quick command as entrypoint.
              2. or when caused by an OutOfMemory.
            2. when the wrapper script would fail or being killed for whatever reason, but the command keeps running. It should be very rare, never observed an evidence of this situation.

            Cause 1.1: quick command as entrypoint
            When using a docker agent, Jenkins starts the following commands in that order:

            1. docker run: to start the container
            2. docker top: to validate that the container correctly started with the expected command
            3. docker exec: to start, inside the container, a wrapper script which will start a command which you have defined.

            The default ENTRYPOINT of the docker image could be a quick command (mdspell for example). So the container could:

            1. stop before the docker top command, causing the container started but didn't run the expected command error.
            2. stop before the end of the docker exec command , causing the wrapper script error.

            Resolution: Force the entrypoint even to an empty value, by using --entrypoint to the docker agent.

            Cause 1.2: when caused by an OutOfMemory
            I could not see any way to control the cgroup of jobs, not limit the capacity by executor. When using docker, the containers are created in a dedicated cgroup with no memory limit. We only rely on best practices, like writing pipelines with the --memory option to docker.
            When the memory consumption reaches a critical level on the slave, the OOMKiller can randomly kill processes, and cause termination of docker agents.
            Resolution: Best practices on memory usage and better sizing. We created a script running every minute to force a memory limit for all running docker containers, that could solve most of the issues.

            bbourdin Benoit Bourdin added a comment - After more testing and investigation, we identified two possible root causes for this common issue: a race condition: when the agent terminates before that the wrapper script could complete and report the status/log of the running command. For example, using the docker workflow plugin , when a docker container agent suddently stopped: when having a quick command as entrypoint. or when caused by an OutOfMemory. when the wrapper script would fail or being killed for whatever reason, but the command keeps running. It should be very rare, never observed an evidence of this situation. Cause 1.1: quick command as entrypoint When using a docker agent, Jenkins starts the following commands in that order: docker run: to start the container docker top: to validate that the container correctly started with the expected command docker exec: to start, inside the container, a wrapper script which will start a command which you have defined. The default ENTRYPOINT of the docker image could be a quick command (mdspell for example). So the container could: stop before the docker top command, causing the container started but didn't run the expected command error. stop before the end of the docker exec command , causing the wrapper script error. Resolution: Force the entrypoint even to an empty value, by using --entrypoint to the docker agent. Cause 1.2: when caused by an OutOfMemory I could not see any way to control the cgroup of jobs, not limit the capacity by executor. When using docker, the containers are created in a dedicated cgroup with no memory limit. We only rely on best practices, like writing pipelines with the --memory option to docker. When the memory consumption reaches a critical level on the slave, the OOMKiller can randomly kill processes, and cause termination of docker agents. Resolution: Best practices on memory usage and better sizing. We created a script running every minute to force a memory limit for all running docker containers, that could solve most of the issues.
            jglick Jesse Glick added a comment -

            1.1 sounds like a bug in the docker-workflow plugin which may be fixable, though I do not advise use of that plugin to begin with.

            1.2 is basically outside the control of Jenkins, but the durable-task plugin could suggest this as a possible root cause when printing error messages. (An older version of durable-task recorded the PID of the wrapper process and then checked the process list to see if it was still running. This proved to be hard to maintain in various environments, however; the heartbeat system is more portable.)

            jglick Jesse Glick added a comment - 1.1 sounds like a bug in the docker-workflow plugin which may be fixable, though I do not advise use of that plugin to begin with. 1.2 is basically outside the control of Jenkins, but the durable-task plugin could suggest this as a possible root cause when printing error messages. (An older version of durable-task recorded the PID of the wrapper process and then checked the process list to see if it was still running. This proved to be hard to maintain in various environments, however; the heartbeat system is more portable.)

            People

              Unassigned Unassigned
              evanward1 Evan Ward
              Votes:
              15 Vote for this issue
              Watchers:
              34 Start watching this issue

              Dates

                Created:
                Updated: