Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-32264

sh step returns ERROR -1, but shell step still running

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • durable-task-plugin
    • None
    • FreeBSD 10.2
      workflow-plugin 1.12
      Jenkins 1.641
      durable-task plugin 1.7

      I have this workflow script:
      https://github.com/freebsd/freebsd-ci/blob/master/scripts/build/build-test.groovy

      which runs this script:
      https://github.com/freebsd/freebsd-ci/blob/master/scripts/build/build1.sh

      In this Jenkins workflow job:
      https://jenkins.freebsd.org/job/FreeBSD_HEAD_sparc64/4/console

      I get ERROR: script returned exit code -1

      I have tried instrumenting my script, but I have found that nothing in my script
      is failing. In fact, even though the script has reported as failed,
      I have found that if I log into my machine and look at the process table,
      the script is still running:

       7911  -  Is     0:00.00 |-- daemon: /usr/local/openjdk8/bin/java[7912] (daemon)
       7912  -  I      3:29.71 | `-- /usr/local/openjdk8/bin/java -DJENKINS_HOME=/usr/local/jenkins -jar /usr/local/share/jenkins/jenkins.war --webroot=/usr/local/jenkins/war --httpPort=8180 --prefix=/jenkins start
      11634  -  I      0:00.00 |   `-- sh -c echo $$ > '/usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/src/.jenkins-8cedefcf/pid'; jsc=durable-2216cfc607681c7235484b12b6ac8c19; JENKINS_SERVER_COOKIE=$jsc '/usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/src/.jenkins-8cedefcf/script.sh' > '/usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/src/.jenkins-8cedefcf/jenkins-log.txt' 2>&1; echo $? > '/usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/src/.jenkins-8cedefcf/jenkins-result.txt'
      11635  -  I      0:00.00 |     `-- /bin/sh -xe /usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/src/.jenkins-8cedefcf/script.sh
      11636  -  I      0:00.00 |       `-- /bin/sh /usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/freebsd-ci/scripts/build/build1.sh
      11642  -  I      0:00.01 |         `-- make -d xl buildworld __MAKE_CONF=/usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/make.conf
      11652  -  I      0:00.03 |           `-- make -m /usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/src/share/mk -f Makefile.inc1 TARGET=sparc64 TARGET_ARCH=sparc64 buildworld
      79938  -  S      0:00.02 |             `-- make -f Makefile.inc1 DESTDIR=/usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/obj/sparc64.sparc64/usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/src/tmp depend
      82835  -  S      0:00.01 |               `-- make depend DIRPRFX=kerberos5/
      82897  -  S      0:00.01 |                 `-- make depend DIRPRFX=kerberos5/tools/
      82898  -  S      0:00.02 |                   `-- make depend DIRPRFX=kerberos5/tools/make-roken/
      82899  -  R      0:00.02 |                     `-- awk -f awk /usr/local/jenkins/jobs/FreeBSD_HEAD_sparc64/workspace/src/kerberos5/tools/make-roken/../../../crypto/heimdal/lib/roken/roken.h.in
      

      The underlying script is still running, but the logs are going nowhere.x

          [JENKINS-32264] sh step returns ERROR -1, but shell step still running

          This function in akuma does a pretty good job at taking a pid as an argument, and then figuring out what arguments were passed to the pid:

          https://github.com/kohsuke/akuma/blob/master/src/main/java/com/sun/akuma/JavaVMArguments.java#L101

          It works on Linux, Solaris, MacOS X, and FreeBSD.

          If you can use something like that instead of ps, I would suggest going in that direction.

          Craig Rodrigues added a comment - This function in akuma does a pretty good job at taking a pid as an argument, and then figuring out what arguments were passed to the pid: https://github.com/kohsuke/akuma/blob/master/src/main/java/com/sun/akuma/JavaVMArguments.java#L101 It works on Linux, Solaris, MacOS X, and FreeBSD. If you can use something like that instead of ps, I would suggest going in that direction.

          Jesse Glick added a comment - - edited

          Well hudson.os.PosixAPI.jnr().getpgid(pid) != -1 might be easier than writing fresh JNA code or using Akuma for something other than its intended purpose.

          Whatever you do, be sure to test it. Connect to a slave (running on a different computer). Run a sh 'sleep 9999' step. Then look up the PID of the controlling script (the one mentioned in the pid file, not the one directly running sleep) and kill -9 it to simulate a reboot of the machine. Or actually reboot the machine. You should see the step exit with status -1. Otherwise you should see a real exit status. (If you abort the step with the red X, on Linux you will typically get status 143, which IIRC is 128 plus the value of SIGTERM.)

          Jesse Glick added a comment - - edited Well hudson.os.PosixAPI.jnr().getpgid(pid) != -1 might be easier than writing fresh JNA code or using Akuma for something other than its intended purpose. Whatever you do, be sure to test it. Connect to a slave (running on a different computer). Run a sh 'sleep 9999' step. Then look up the PID of the controlling script (the one mentioned in the pid file, not the one directly running sleep ) and kill -9 it to simulate a reboot of the machine. Or actually reboot the machine. You should see the step exit with status -1. Otherwise you should see a real exit status. (If you abort the step with the red X, on Linux you will typically get status 143, which IIRC is 128 plus the value of SIGTERM .)

          Jesse Glick added a comment -

          Ah, found that there is a BourneShellScriptTest.reboot which you could try running on FreeBSD.

          Jesse Glick added a comment - Ah, found that there is a BourneShellScriptTest.reboot which you could try running on FreeBSD.

          Jesse Glick added a comment -

          now seems to always return 0 whether or not the process exist

          False alarm. I am running

          ps -o pid= 1234
          

          which either prints 1234 and exits 0, or prints nothing and exits 1. This seems to suggest the same should work on FreeBSD. So I am still unsure why the current plugin would fail on that platform.

          Jesse Glick added a comment - now seems to always return 0 whether or not the process exist False alarm. I am running ps -o pid= 1234 which either prints 1234 and exits 0, or prints nothing and exits 1. This seems to suggest the same should work on FreeBSD. So I am still unsure why the current plugin would fail on that platform.

          Jesse Glick added a comment -

          Filed a PR that I think will work. Please help test it.

          Jesse Glick added a comment - Filed a PR that I think will work. Please help test it.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/test/java/org/jenkinsci/plugins/docker/workflow/WithContainerStepTest.java
          http://jenkins-ci.org/commit/docker-workflow-plugin/20791b4a48b8d6d27d0aedec6cb6894a92e8cab6
          Log:
          JENKINS-32264 Verifying that fallback behavior of ProcessLiveness for decorated launchers actually works to detect killed scripts.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/test/java/org/jenkinsci/plugins/docker/workflow/WithContainerStepTest.java http://jenkins-ci.org/commit/docker-workflow-plugin/20791b4a48b8d6d27d0aedec6cb6894a92e8cab6 Log: JENKINS-32264 Verifying that fallback behavior of ProcessLiveness for decorated launchers actually works to detect killed scripts.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/test/java/org/jenkinsci/plugins/docker/workflow/WithContainerStepTest.java
          http://jenkins-ci.org/commit/docker-workflow-plugin/5238693984b1a20ff5823c7b4ee59ac2cfcaddce
          Log:
          Merge pull request #22 from jglick/ProcessLiveness-JENKINS-32264

          JENKINS-32264 Verifying that fallback behavior of ProcessLiveness works

          Compare: https://github.com/jenkinsci/docker-workflow-plugin/compare/182d9f903f7d...5238693984b1

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/test/java/org/jenkinsci/plugins/docker/workflow/WithContainerStepTest.java http://jenkins-ci.org/commit/docker-workflow-plugin/5238693984b1a20ff5823c7b4ee59ac2cfcaddce Log: Merge pull request #22 from jglick/ProcessLiveness- JENKINS-32264 JENKINS-32264 Verifying that fallback behavior of ProcessLiveness works Compare: https://github.com/jenkinsci/docker-workflow-plugin/compare/182d9f903f7d...5238693984b1

          Daniel Beck added a comment -

          Daniel Beck added a comment - PR build can be downloaded from here for testing: https://jenkins.ci.cloudbees.com/job/plugins/job/durable-task-plugin/64/org.jenkins-ci.plugins$durable-task/

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/durabletask/ProcessLiveness.java
          http://jenkins-ci.org/commit/durable-task-plugin/a58303d43ce4ac46821d0decfb4b78415aef58d7
          Log:
          JENKINS-32264 Use native POSIX calls, not /proc, to determine liveness.
          Also try to autodetect OS-specific failures and fall back to assuming liveness.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/durabletask/ProcessLiveness.java http://jenkins-ci.org/commit/durable-task-plugin/a58303d43ce4ac46821d0decfb4b78415aef58d7 Log: JENKINS-32264 Use native POSIX calls, not /proc, to determine liveness. Also try to autodetect OS-specific failures and fall back to assuming liveness.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java
          src/main/java/org/jenkinsci/plugins/durabletask/ProcessLiveness.java
          http://jenkins-ci.org/commit/durable-task-plugin/89b0bb990f66b346cccb79904f0f89af0362e784
          Log:
          Merge pull request #14 from jglick/ProcessLiveness-JENKINS-32264

          JENKINS-32264 Fix process liveness check for non-Linux platforms

          Compare: https://github.com/jenkinsci/durable-task-plugin/compare/b27495f91909...89b0bb990f66

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java src/main/java/org/jenkinsci/plugins/durabletask/ProcessLiveness.java http://jenkins-ci.org/commit/durable-task-plugin/89b0bb990f66b346cccb79904f0f89af0362e784 Log: Merge pull request #14 from jglick/ProcessLiveness- JENKINS-32264 JENKINS-32264 Fix process liveness check for non-Linux platforms Compare: https://github.com/jenkinsci/durable-task-plugin/compare/b27495f91909...89b0bb990f66

            jglick Jesse Glick
            rodrigc Craig Rodrigues
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: