Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48300

Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • durable-task-plugin
    • None
    • durable-task 1.26

      A few of my Jenkins pipelines failed last night with this failure mode:

      01:19:19 Running on blackbox-slave2 in /var/tmp/jenkins_slaves/jenkins-regression/path/to/workspace.   [Note: this is an SSH slave]
      [Pipeline] {
      [Pipeline] ws
      01:19:19 Running in /net/nas.delphix.com/nas/regression-run-workspace/jenkins-regression/workspace@10. [Note: This is an NFS share on a NAS]nd they shouldn't take down Jenkins jobs when they do. Our Jenkins jobs used to just hang when there was a NFS outage, now the script liveness check kills the job. I view this as a regression. As flawed
      [Pipeline] {
      [Pipeline] sh
      01:20:10 [qa-gate] Running shell script
      [... script output ...]
      01:27:19 Running test_create_domain at 2017-11-29 01:27:18.887531... 
      [Pipeline] // dir
      [Pipeline] }
      [Pipeline] // ws
      [Pipeline] }
      [Pipeline] // node
      [Pipeline] }
      [Pipeline] // timestamps
      [Pipeline] }
      [Pipeline] // timeout
      ERROR: script returned exit code -1
      Finished: FAILURE
      

      As far as I can tell the script was running fine, but apparently Jenkins killed it prematurely because Jenkins didn't think the process was still alive.

      The interesting thing is that this is normally working, but failed last night at exactly the same time in multiple pipeline jobs. And I only started seeing this after upgrading durable-task-plugin from 1.14 to 1.17. I looked at the code change and saw that the main change has been the change in ProcessLiveness from using a ps-based system to a timestamp-based system. What I suspect is that the NFS server on which this workspace is hosted wasn't processing I/O operations fast enough at the time this problem occurred, so the timestamp wasn't updated even though the script continued running. Note that I am not using Docker here, this is just a regular SSH slave.

      The ps-based approach may have been suboptimal, but it was more reliable for us than the new timestamp-based approach, at least when using NFS-based workspaces. Expecting a timestamp to increase on a file every 15 seconds may be a tall order for some system and network administrators, especially over NFS – network issues can and do happen, and they shouldn't take down Jenkins jobs when they do. Our Jenkins jobs used to just hang when there was a NFS outage, now the script liveness check kills the job. I view this as a regression. As flawed as the old approach may have been, it was immune to this failure mode. Is there anything I can do here besides increasing various timeouts to avoid hitting this? The fact that no diagnostic information was printed to the Jenkins log or the SSH slave remotin log is also problematic here.

          [JENKINS-48300] Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

          Basil Crow created issue -
          Basil Crow made changes -
          Link New: This issue relates to JENKINS-47791 [ JENKINS-47791 ]

          Basil Crow added a comment -

          jglick, since you've been working on this subsystem here, any ideas on a way forward?

          Basil Crow added a comment - jglick , since you've been working on this subsystem here, any ideas on a way forward?

          Basil Crow added a comment -

          When I add -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 to my JVM options, this problem goes away. I remain concerned about the general strategy in the case of this new NFS failure mode.

          Basil Crow added a comment - When I add -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 to my JVM options, this problem goes away. I remain concerned about the general strategy in the case of this new NFS failure mode.

          Jesse Glick added a comment -

          The default heartbeat interval could be increased, though using NFS for workspaces is generally not a good plan to begin with. Also there is an outstanding to-do task to adjust the durable-task API to allow a TaskListener to be injected into more calls, such as Controller.exitStatus, which would allow the implementation to print a helpful diagnostic before returning -1.

          Jesse Glick added a comment - The default heartbeat interval could be increased, though using NFS for workspaces is generally not a good plan to begin with. Also there is an outstanding to-do task to adjust the durable-task API to allow a TaskListener to be injected into more calls, such as Controller.exitStatus , which would allow the implementation to print a helpful diagnostic before returning -1.
          Jesse Glick made changes -
          Remote Link New: This issue links to "durable-task PR 57 (Web Link)" [ 19953 ]
          Jesse Glick made changes -
          Remote Link New: This issue links to "workflow-durable-task-step PR 62 (Web Link)" [ 19954 ]

          Jesse Glick added a comment -

          there is an outstanding to-do task

          Filed.

          Jesse Glick added a comment - there is an outstanding to-do task Filed.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java
          src/main/java/org/jenkinsci/plugins/durabletask/Controller.java
          src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java
          src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
          src/test/java/org/jenkinsci/plugins/durabletask/PowershellScriptTest.java
          src/test/java/org/jenkinsci/plugins/durabletask/WindowsBatchScriptTest.java
          http://jenkins-ci.org/commit/durable-task-plugin/bc0e2357e7ee49e0046f3a76ecf87802acd3934a
          Log:
          JENKINS-48300 Add an overload for exitStatus taking TaskListener.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java src/main/java/org/jenkinsci/plugins/durabletask/Controller.java src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java src/test/java/org/jenkinsci/plugins/durabletask/PowershellScriptTest.java src/test/java/org/jenkinsci/plugins/durabletask/WindowsBatchScriptTest.java http://jenkins-ci.org/commit/durable-task-plugin/bc0e2357e7ee49e0046f3a76ecf87802acd3934a Log: JENKINS-48300 Add an overload for exitStatus taking TaskListener.

          Code changed in jenkins
          User: Sam Van Oort
          Path:
          src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java
          src/main/java/org/jenkinsci/plugins/durabletask/Controller.java
          src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java
          src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java
          src/test/java/org/jenkinsci/plugins/durabletask/PowershellScriptTest.java
          src/test/java/org/jenkinsci/plugins/durabletask/WindowsBatchScriptTest.java
          http://jenkins-ci.org/commit/durable-task-plugin/7c12b3a72cb402d89f5d51b7a88811f2ac075891
          Log:
          Merge pull request #57 from jglick/exitStatus-JENKINS-48300

          JENKINS-48300 Add an overload for exitStatus taking TaskListener

          Compare: https://github.com/jenkinsci/durable-task-plugin/compare/7f57bb297ee3...7c12b3a72cb4

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Sam Van Oort Path: src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java src/main/java/org/jenkinsci/plugins/durabletask/Controller.java src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java src/test/java/org/jenkinsci/plugins/durabletask/PowershellScriptTest.java src/test/java/org/jenkinsci/plugins/durabletask/WindowsBatchScriptTest.java http://jenkins-ci.org/commit/durable-task-plugin/7c12b3a72cb402d89f5d51b7a88811f2ac075891 Log: Merge pull request #57 from jglick/exitStatus- JENKINS-48300 JENKINS-48300 Add an overload for exitStatus taking TaskListener Compare: https://github.com/jenkinsci/durable-task-plugin/compare/7f57bb297ee3...7c12b3a72cb4

            jglick Jesse Glick
            basil Basil Crow
            Votes:
            6 Vote for this issue
            Watchers:
            33 Start watching this issue

              Created:
              Updated:
              Resolved: