Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-38682

Pipeline plugin sh script returned exit code -1 on windows node after a short while

      I've got many nodes running on different Linux distribution and also few windows nodes running with SSH slaves and Cygwin (https://wiki.jenkins-ci.org/display/JENKINS/SSH+slaves+and+Cygwin)

      Using the following pipeline script:

      _node('linuxHost') {
      stage('linuxHost linux')

      { sh '''sleep 30''' }

      }
      node('windowsHost') {
      stage('windowsHost cygwin')

      { bat 'sleep 30' sh '''sleep 30''' }

      }_

      On my windows node I always got an error "ERROR: script returned exit code -1" if my script execution time is greater than about 250ms no matter what the command content is (that's why I'm using a basic sleep for this example). I've got the exact same behavior for all of my windows nodes.

      Here are the execution logs:
      _Started by user Administrateur Jenkins
      [Pipeline] node
      Running on linuxHost in /var/jenkins_home/workspace/Unstable/TestShInPipeline
      [Pipeline] {
      [Pipeline] stage
      [Pipeline]

      { (linuxHost linux) [Pipeline] sh [TestShInPipeline] Running shell script + sleep 30 [Pipeline] }

      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] node
      Running on windowsHost in C:/Users/localuser/jenkins_home/workspace/Unstable/TestShInPipeline
      [Pipeline] {
      [Pipeline] stage
      [Pipeline]

      { (windowsHost cygwin) [Pipeline] bat [C:/Users/localuser/jenkins_home/workspace/Unstable/TestShInPipeline] Running batch script C:\Users\localuser\jenkins_home\workspace\Unstable\TestShInPipeline>sleep 30 [Pipeline] sh [TestShInPipeline] Running shell script + sleep 30 [Pipeline] }

      [Pipeline] // stage
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] End of Pipeline
      ERROR: script returned exit code -1
      Finished: FAILURE_

      May be related to JENKINS-32017 ?
      It works fine using the same node and the "Execute Shell" build step in a freestyle project.

          [JENKINS-38682] Pipeline plugin sh script returned exit code -1 on windows node after a short while

          Jochen Ulrich added a comment -

          Ensure nohup is on the %PATH%. See JENKINS-33708

          Jochen Ulrich added a comment - Ensure nohup is on the %PATH% . See JENKINS-33708

          Ben Gamari added a comment - - edited

          I am also seeing this behavior on Windows 10 running a msys sshd. I have added C:\msys64\usr\bin to the Windows %PATH% environment variable and have confirmed that I can run nohup from a standard cmd.exe instance. I can also confirm that nohup is in %PATH% by running the following test Jenkinsfile,

          node(label: 'windows') {
            stage('Hello') {
              sh 'echo $PATH'
              sh "which nohup"
              sh "nohup echo Hello world"
              sh "sleep 5; echo Hello world"
              sh "echo Hello world"
            }
          }
          

          The echo $PATH step confirms that /usr/bin is in $PATH,

          [C:\msys64\home\ben\jenkins\workspace\gitlab-test_master-GOUXDG2ZFVA4TFFLTYLXH3EXZCZTQQEMXKUO2E4QRK5ESVKIB7UQ] Running shell script
          
          + echo /usr/bin:/usr/bin:/c/ProgramData/Oracle/Java/javapath:/c/Program Files '(x86)/I...
          

          which nohup again confirms that nohup is present,

          [C:\msys64\home\ben\jenkins\workspace\gitlab-test_master-GOUXDG2ZFVA4TFFLTYLXH3EXZCZTQQEMXKUO2E4QRK5ESVKIB7UQ] Running shell script
          
          + which nohup
          
          /usr/bin/nohup
          

          Moreover, nohup echo shows that nohup is indeed functional,

          [C:\msys64\home\ben\jenkins\workspace\gitlab-test_master-GOUXDG2ZFVA4TFFLTYLXH3EXZCZTQQEMXKUO2E4QRK5ESVKIB7UQ] Running shell script
          
          + nohup echo Hello world
          
          Hello world
          

          However, {{sleep}}ing in a shell script nevertheless fails,

          [C:\msys64\home\ben\jenkins\workspace\gitlab-test_master-GOUXDG2ZFVA4TFFLTYLXH3EXZCZTQQEMXKUO2E4QRK5ESVKIB7UQ] Running shell script
          
          + sleep 5
          
          script returned exit code -1
          

          Ben Gamari added a comment - - edited I am also seeing this behavior on Windows 10 running a msys sshd . I have added C:\msys64\usr\bin to the Windows %PATH% environment variable and have confirmed that I can run nohup from a standard cmd.exe instance. I can also confirm that nohup is in %PATH% by running the following test Jenkinsfile , node(label: 'windows') { stage('Hello') { sh 'echo $PATH' sh "which nohup" sh "nohup echo Hello world" sh "sleep 5; echo Hello world" sh "echo Hello world" } } The echo $PATH step confirms that /usr/bin is in $PATH , [C:\msys64\home\ben\jenkins\workspace\gitlab-test_master-GOUXDG2ZFVA4TFFLTYLXH3EXZCZTQQEMXKUO2E4QRK5ESVKIB7UQ] Running shell script + echo /usr/bin:/usr/bin:/c/ProgramData/Oracle/Java/javapath:/c/Program Files '(x86)/I... which nohup again confirms that nohup is present, [C:\msys64\home\ben\jenkins\workspace\gitlab-test_master-GOUXDG2ZFVA4TFFLTYLXH3EXZCZTQQEMXKUO2E4QRK5ESVKIB7UQ] Running shell script + which nohup /usr/bin/nohup Moreover, nohup echo shows that nohup is indeed functional, [C:\msys64\home\ben\jenkins\workspace\gitlab-test_master-GOUXDG2ZFVA4TFFLTYLXH3EXZCZTQQEMXKUO2E4QRK5ESVKIB7UQ] Running shell script + nohup echo Hello world Hello world However, {{sleep}}ing in a shell script nevertheless fails, [C:\msys64\home\ben\jenkins\workspace\gitlab-test_master-GOUXDG2ZFVA4TFFLTYLXH3EXZCZTQQEMXKUO2E4QRK5ESVKIB7UQ] Running shell script + sleep 5 script returned exit code -1

          Ben Gamari added a comment -

          I've bumped the priority of this as it renders Jenkins' sh task completely unusable on Windows.

          Ben Gamari added a comment - I've bumped the priority of this as it renders Jenkins' sh task completely unusable on Windows.

          Ben Gamari added a comment - - edited

          I believe I have worked out what is going on here. The issue was introduced by 3c10cc0cb0a4738646966494fbeaab54bca115e1, which switches from JNR to JNA for access to getpgid. Unfortunately, JNA isn't able to load libc under Windows (as it looks for libc instead of msvcrt; moreover, msys2 (the Unix-like environment I'm running under) only provides a static libc archive). This results in ProcessLiveness failing to load on the slave. This was quite difficult to track down due to the poor error messages provided by the slave (manifesting as a vague "remote call failed" IOException; it would make debugging much easier if remoting exceptions were consistently thrown as a distinct exception type).

          Anyways, reverting 3c10cc0cb0a4738646966494fbeaab54bca115e1 appears to fix the issue. Moreover, I believe this should be safe to do since JNR has apparently seen some BSD fixes in the past months. I have confirmed that with the reverted patch my simple sh testcase works on both Windows and FreeBSD slaves.

          Ben Gamari added a comment - - edited I believe I have worked out what is going on here. The issue was introduced by 3c10cc0cb0a4738646966494fbeaab54bca115e1, which switches from JNR to JNA for access to getpgid . Unfortunately, JNA isn't able to load libc under Windows (as it looks for libc instead of msvcrt ; moreover, msys2 (the Unix-like environment I'm running under) only provides a static libc archive). This results in ProcessLiveness failing to load on the slave. This was quite difficult to track down due to the poor error messages provided by the slave (manifesting as a vague "remote call failed" IOException ; it would make debugging much easier if remoting exceptions were consistently thrown as a distinct exception type). Anyways, reverting 3c10cc0cb0a4738646966494fbeaab54bca115e1 appears to fix the issue. Moreover, I believe this should be safe to do since JNR has apparently seen some BSD fixes in the past months. I have confirmed that with the reverted patch my simple sh testcase works on both Windows and FreeBSD slaves.

          Ben Gamari added a comment - - edited

          See this PR for what I believe should be a rather robust solution to this given in-progress upstream support in JNR.

          Ben Gamari added a comment - - edited See this PR for what I believe should be a rather robust solution to this given in-progress upstream support in JNR.

          Jesse Glick added a comment -

          In JNR/JNA analysis is in fact correct, then JENKINS-47791 should have obsoleted this.

          Jesse Glick added a comment - In JNR/JNA analysis is in fact correct, then  JENKINS-47791 should have obsoleted this.

            Unassigned Unassigned
            bdesvages Benoît Desvages
            Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: