Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50379

Jenkins kills long running sh script with no output

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • durable-task-plugin
    • None
    • Jenkins ver. 2.107.1 on CentOS 7

      I have a Jenkins pipeline that runs a shell script that takes about 5 minutes and generates no output. The job fails and I'm seeing the following in the output:

      wrapper script does not seem to be touching the log file in /home/jenkins/workspace/job_Pipeline@2@tmp/durable-595950a5
       (--JENKINS-48300--: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)
       script returned exit code -1
      

      Based on JENKINS-48300 it seems that Jenkins is intentionally killing my script while it is still running. IMHO it is a bug for Jenkins to assume that a shell script will generate output every n seconds for any finite n. As a workaround I've set -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL to one hour. But what happens when I have a script that takes an hour and one minute!?

          [JENKINS-50379] Jenkins kills long running sh script with no output

          Nikolas Falco added a comment - - edited

          We have the same issue, during the JS build job execute a "ng build" command and the job after 32 minutes is killed because seems to not respond.

          Cannot contact Node 02: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@70fad4d7:JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702": Remote call on JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702 failed. The channel is closing down or has closed down
          wrapper script does not seem to be touching the log file in /var/lib/jenkins/workspace/xxx@tmp/durable-476d6be2
          (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

          Nikolas Falco added a comment - - edited We have the same issue, during the JS build job execute a "ng build" command and the job after 32 minutes is killed because seems to not respond. Cannot contact Node 02: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@70fad4d7:JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702": Remote call on JNLP4-connect connection from prd-cm-as-09.lan/10.1.3.72:56702 failed. The channel is closing down or has closed down wrapper script does not seem to be touching the log file in /var/lib/jenkins/workspace/xxx@tmp/durable-476d6be2 (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

          For the record, I was affected by this issue a while ago and in my case, I was running Jenkins agents on k8s, and increasing the pod memory limit solve it at least in my case.

          Olivier Vernin added a comment - For the record, I was affected by this issue a while ago and in my case, I was running Jenkins agents on k8s, and increasing the pod memory limit solve it at least in my case.

          bright.ma added a comment -

          I met this issue.

          [2021-05-25T13:42:16.469Z] wrapper script does not seem to be touching the log file in @tmp/durable-c284507c
          
          [2021-05-25T13:42:16.469Z] (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
          

          the reason is " No space left on device"

          bright.ma added a comment - I met this issue. [2021-05-25T13:42:16.469Z] wrapper script does not seem to be touching the log file in @tmp/durable-c284507c [2021-05-25T13:42:16.469Z] (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400) the reason is " No space left on device"

          Matt Dee added a comment - - edited

          Running into this. One of my playbooks will restart Jenkins service if it needs to reload init.groovy.d scripts. One Jenkins comes back the job fails with this error. This was working fine for months then stopped working with this same error. 

          • Plenty of memory and space on device. 
          • Durable task plugin is fully up to date.
          • Jenkins 2.360

          Matt Dee added a comment - - edited Running into this. One of my playbooks will restart Jenkins service if it needs to reload init.groovy.d scripts. One Jenkins comes back the job fails with this error. This was working fine for months then stopped working with this same error.  Plenty of memory and space on device.  Durable task plugin is fully up to date. Jenkins 2.360

          we have made all the possible upgrades and changes, even then we are facing the issue.

          Tejaswi Battuwar added a comment - we have made all the possible upgrades and changes, even then we are facing the issue.

          This issue started occurring a few days ago. Never happened before.

          (Jenkins 2.361.1)

          Giacomo Boccardo added a comment - This issue started occurring a few days ago. Never happened before. (Jenkins 2.361.1)

          Benoit Bourdin added a comment - - edited

          Investigated yesterday and sharing my findings below. Here's an example of a wrapper script running:

          10:48:01  root     15345  0.0  0.0   4632    92 ?        S    08:47   0:00 sh -c ({ while [ -d '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da' -a \! -f '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt' ]; do touch '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt'; sleep 3; done } & jsc=durable-bffc1fdb28b03efb823f6939b2ccaf2e; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/script.sh' > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt' 2>&1; echo $? > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp'; mv '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp' '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt'; wait) >&- 2>&- &
          

          In my case on the slave, we have 2 folders:
          1. the workplace folder /home/ubuntu/workspace/xxx, which is hosting the files downloaded from the git repo
          2. the control folder for the durable plugin: /home/ubuntu/workspace/xxx@tmp/durable-yyy containing:

          • script.sh which is the command line we are executing, took from the `sh` command of the Jenkinsfile
          • jenkins-log.txt which is the output of the command
          • jenkins-result.txt which will contain the exit code of the command (when finished)
          • pid which is not created by default but could contain the pid of the command

          On my environment it will throw the error message after 605-915 seconds when having both conditions met:
          1. no exit code in the jenkins-result.txt file
          2. the log file (jenkins-log.txt) not modified since more than 300 seconds
          (1) happens when the wrapper script is not running any more
          (2) happens when the wrapper script is not running, or when the command is not sending messages to logs since more than 300 seconds

          The source code is here:
          https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java

          tested scenarios:

          • when adding an exit code to the jenkins-result.txt file, it stops successfully
          • when only creating jenkins-result.txt file, it continues to wait till having a code
          • long script (30mins) without output with wrapper script running = no issue
          • long script (30mins) with output without wrapper script running = no issue
          • long script (30mins) without output without wrapper script running = failing after 608s
          • long script (30mins) without output without wrapper script running, but with a pid file = no issue

          That means that a _workaround_could be to manually create a pid file before starting the command and then remove when finished. When the command is quite long, it will kindly throw this warning every 5 minutes but wait for the process to finish:

          still have /home/ubuntu/workspace/xxxx@tmp/durable-yyy/pid so heartbeat checks unreliable; process may or may not be alive
          

          It could be done by a simple shell script like the one below. However, we should follow the best practices to setup pipeline/step timeouts and avoid pipelines to run forever.

          # workaround for the wrapper issue, create a pid file
          MYCONTROLDIR=`echo "$PWD@tmp/durable"*`
          echo $$ >$MYCONTROLDIR/pid
          
          # ./my_longcommand_to_run.sh
          exitcode=$?
          
          # lets clean up what we did
          rm -f $MYCONTROLDIR/pid
          exit $exitcode
          

          In my case, the root cause is still unknown. Suspecting that the wrapper script is killed, the system OOM killer (caused by the pipelines filling the slave's memory) is a great candidate, but needing some evidence of it.

          A good suggestion to improve the durable-task-plugin: if we could understand in which scenario we are:
          1. the wrapper script is not running any more (killed?)
          2. or, the wrapper script is running but with some error (and if it writes an error to stderr, please log it somewhere...)
          3. or, the wrapper script is running, but so slow due to some CPU/storage/filesystem slowness,

          Benoit Bourdin added a comment - - edited Investigated yesterday and sharing my findings below. Here's an example of a wrapper script running: 10:48:01 root 15345 0.0 0.0 4632 92 ? S 08:47 0:00 sh -c ({ while [ -d '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da' -a \! -f '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt' ]; do touch '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt'; sleep 3; done } & jsc=durable-bffc1fdb28b03efb823f6939b2ccaf2e; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/script.sh' > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-log.txt' 2>&1; echo $? > '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp'; mv '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt.tmp' '/home/ubuntu/workspace/rdin_tests_jenkins-wrapper-issue@tmp/durable-11c396da/jenkins-result.txt'; wait) >&- 2>&- & In my case on the slave, we have 2 folders: 1. the workplace folder /home/ubuntu/workspace/xxx , which is hosting the files downloaded from the git repo 2. the control folder for the durable plugin: /home/ubuntu/workspace/xxx@tmp/durable-yyy containing: script.sh which is the command line we are executing, took from the `sh` command of the Jenkinsfile jenkins-log.txt which is the output of the command jenkins-result.txt which will contain the exit code of the command (when finished) pid which is not created by default but could contain the pid of the command On my environment it will throw the error message after 605-915 seconds when having both conditions met: 1. no exit code in the jenkins-result.txt file 2. the log file (jenkins-log.txt) not modified since more than 300 seconds (1) happens when the wrapper script is not running any more (2) happens when the wrapper script is not running, or when the command is not sending messages to logs since more than 300 seconds The source code is here: https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java tested scenarios: when adding an exit code to the jenkins-result.txt file, it stops successfully when only creating jenkins-result.txt file, it continues to wait till having a code long script (30mins) without output with wrapper script running = no issue long script (30mins) with output without wrapper script running = no issue long script (30mins) without output without wrapper script running = failing after 608s long script (30mins) without output without wrapper script running, but with a pid file = no issue That means that a _workaround_could be to manually create a pid file before starting the command and then remove when finished. When the command is quite long, it will kindly throw this warning every 5 minutes but wait for the process to finish: still have /home/ubuntu/workspace/xxxx@tmp/durable-yyy/pid so heartbeat checks unreliable; process may or may not be alive It could be done by a simple shell script like the one below. However, we should follow the best practices to setup pipeline/step timeouts and avoid pipelines to run forever. # workaround for the wrapper issue, create a pid file MYCONTROLDIR=`echo "$PWD@tmp/durable" *` echo $$ >$MYCONTROLDIR/pid # ./my_longcommand_to_run.sh exitcode=$? # lets clean up what we did rm -f $MYCONTROLDIR/pid exit $exitcode In my case, the root cause is still unknown. Suspecting that the wrapper script is killed, the system OOM killer (caused by the pipelines filling the slave's memory) is a great candidate, but needing some evidence of it. A good suggestion to improve the durable-task-plugin : if we could understand in which scenario we are: 1. the wrapper script is not running any more (killed?) 2. or, the wrapper script is running but with some error (and if it writes an error to stderr, please log it somewhere...) 3. or, the wrapper script is running, but so slow due to some CPU/storage/filesystem slowness,

          Jesse Glick added a comment -

          OOMKiller is a likely culprit. In general there is not going to be a reliable record of when the wrapper was killed or why, nor would there be any error messages from the wrapper script to log.

          The newer binary wrapper may behave differently, but it could still be killed.

          Originally the wrapper’s PID was recorded, and the agent would periodically check to see whether that process was still alive. This was switched to a heartbeat because it was more portable (no need for JNI-based system calls) and also worked without special considerations when Launcher was wrapped in a container (such as by the withDockerContainer step, or container in the kubernetes plugin). The only requirement is that the agent JVM and the user process and (if applicable) the filesystem server can agree on clock skew to within a few minutes. I doubt there is much difference either way in diagnosability: in either case, if the wrapper is killed, the agent JVM manages to detect this sooner or later, but does not have any information as to the ultimate reason for the error.

          (Freestyle projects do not survive controller or agent JVM restarts, because the user script is a child of the agent JVM. Thus if e.g. OOMKiller kicks in, the build will simply abort right away—again with no particularly informative message beyond the fact that it got a SIGKILL.)

          Jesse Glick added a comment - OOMKiller is a likely culprit. In general there is not going to be a reliable record of when the wrapper was killed or why, nor would there be any error messages from the wrapper script to log. The newer binary wrapper may behave differently, but it could still be killed. Originally the wrapper’s PID was recorded, and the agent would periodically check to see whether that process was still alive. This was switched to a heartbeat because it was more portable (no need for JNI-based system calls) and also worked without special considerations when Launcher was wrapped in a container (such as by the withDockerContainer step, or container in the kubernetes plugin). The only requirement is that the agent JVM and the user process and (if applicable) the filesystem server can agree on clock skew to within a few minutes. I doubt there is much difference either way in diagnosability: in either case, if the wrapper is killed, the agent JVM manages to detect this sooner or later, but does not have any information as to the ultimate reason for the error. (Freestyle projects do not survive controller or agent JVM restarts, because the user script is a child of the agent JVM. Thus if e.g. OOMKiller kicks in, the build will simply abort right away—again with no particularly informative message beyond the fact that it got a SIGKILL .)

          After more testing and investigation, we identified two possible root causes for this common issue:

          1. a race condition: when the agent terminates before that the wrapper script could complete and report the status/log of the running command. For example, using the docker workflow plugin, when a docker container agent suddently stopped:
            1. when having a quick command as entrypoint.
            2. or when caused by an OutOfMemory.
          2. when the wrapper script would fail or being killed for whatever reason, but the command keeps running. It should be very rare, never observed an evidence of this situation.

          Cause 1.1: quick command as entrypoint
          When using a docker agent, Jenkins starts the following commands in that order:

          1. docker run: to start the container
          2. docker top: to validate that the container correctly started with the expected command
          3. docker exec: to start, inside the container, a wrapper script which will start a command which you have defined.

          The default ENTRYPOINT of the docker image could be a quick command (mdspell for example). So the container could:

          1. stop before the docker top command, causing the container started but didn't run the expected command error.
          2. stop before the end of the docker exec command , causing the wrapper script error.

          Resolution: Force the entrypoint even to an empty value, by using --entrypoint to the docker agent.

          Cause 1.2: when caused by an OutOfMemory
          I could not see any way to control the cgroup of jobs, not limit the capacity by executor. When using docker, the containers are created in a dedicated cgroup with no memory limit. We only rely on best practices, like writing pipelines with the --memory option to docker.
          When the memory consumption reaches a critical level on the slave, the OOMKiller can randomly kill processes, and cause termination of docker agents.
          Resolution: Best practices on memory usage and better sizing. We created a script running every minute to force a memory limit for all running docker containers, that could solve most of the issues.

          Benoit Bourdin added a comment - After more testing and investigation, we identified two possible root causes for this common issue: a race condition: when the agent terminates before that the wrapper script could complete and report the status/log of the running command. For example, using the docker workflow plugin , when a docker container agent suddently stopped: when having a quick command as entrypoint. or when caused by an OutOfMemory. when the wrapper script would fail or being killed for whatever reason, but the command keeps running. It should be very rare, never observed an evidence of this situation. Cause 1.1: quick command as entrypoint When using a docker agent, Jenkins starts the following commands in that order: docker run: to start the container docker top: to validate that the container correctly started with the expected command docker exec: to start, inside the container, a wrapper script which will start a command which you have defined. The default ENTRYPOINT of the docker image could be a quick command (mdspell for example). So the container could: stop before the docker top command, causing the container started but didn't run the expected command error. stop before the end of the docker exec command , causing the wrapper script error. Resolution: Force the entrypoint even to an empty value, by using --entrypoint to the docker agent. Cause 1.2: when caused by an OutOfMemory I could not see any way to control the cgroup of jobs, not limit the capacity by executor. When using docker, the containers are created in a dedicated cgroup with no memory limit. We only rely on best practices, like writing pipelines with the --memory option to docker. When the memory consumption reaches a critical level on the slave, the OOMKiller can randomly kill processes, and cause termination of docker agents. Resolution: Best practices on memory usage and better sizing. We created a script running every minute to force a memory limit for all running docker containers, that could solve most of the issues.

          Jesse Glick added a comment -

          1.1 sounds like a bug in the docker-workflow plugin which may be fixable, though I do not advise use of that plugin to begin with.

          1.2 is basically outside the control of Jenkins, but the durable-task plugin could suggest this as a possible root cause when printing error messages. (An older version of durable-task recorded the PID of the wrapper process and then checked the process list to see if it was still running. This proved to be hard to maintain in various environments, however; the heartbeat system is more portable.)

          Jesse Glick added a comment - 1.1 sounds like a bug in the docker-workflow plugin which may be fixable, though I do not advise use of that plugin to begin with. 1.2 is basically outside the control of Jenkins, but the durable-task plugin could suggest this as a possible root cause when printing error messages. (An older version of durable-task recorded the PID of the wrapper process and then checked the process list to see if it was still running. This proved to be hard to maintain in various environments, however; the heartbeat system is more portable.)

            Unassigned Unassigned
            evanward1 Evan Ward
            Votes:
            16 Vote for this issue
            Watchers:
            35 Start watching this issue

              Created:
              Updated: