• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • durable-task-plugin
    • jenkins 2.19.4
      durable-task 1.12
      Linux server
      Windows slave
      java 1.8.0_51

      Pipeline jobs occasionally hang on bat steps. This seems similar to JENKINS-34150 but we are using durable-task 1.12 which has the fix for that.

      Thread dump from the job (after 18 hours):

      Thread #26
      	at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>)
      	at WorkflowScript.run(WorkflowScript:349)
      	at DSL.withEnv(Native Method)
      	at WorkflowScript.run(WorkflowScript:242)
      	at DSL.stage(Native Method)
      	at WorkflowScript.run(WorkflowScript:156)
      	at DSL.node(running on <slave>)
      	at WorkflowScript.run(WorkflowScript:34)
      

      The bat step is running a batch file:

      cmd /c call test.bat ....
      

      which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave.

      Please let me know if there is anything else I can do to help diagnose this.

          [JENKINS-41482] Pipeline bat step hangs (after restart)

          Russell Gallop created issue -
          Russell Gallop made changes -
          Description Original: Pipeline jobs occasionally hang on bat steps. This seems similar to JENKINS-34150 but we are using durable-task 1.12 which has the fix for that.

          Thread dump from the job (after 18 hours):
          {code}
          Thread #26
          at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>)
          at WorkflowScript.run(WorkflowScript:349)
          at DSL.withEnv(Native Method)
          at WorkflowScript.run(WorkflowScript:242)
          at DSL.stage(Native Method)
          at WorkflowScript.run(WorkflowScript:156)
          at DSL.node(running on jagent-win14)
          at WorkflowScript.run(WorkflowScript:34)
          {code}

          The bat step is running a batch file:
          {code}
          cmd /c call test.bat ....
          {code}
          which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave.

          Please let me know if there is anything else I can do to help diagnose this.
          New: Pipeline jobs occasionally hang on bat steps. This seems similar to JENKINS-34150 but we are using durable-task 1.12 which has the fix for that.

          Thread dump from the job (after 18 hours):
          {code}
          Thread #26
          at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>)
          at WorkflowScript.run(WorkflowScript:349)
          at DSL.withEnv(Native Method)
          at WorkflowScript.run(WorkflowScript:242)
          at DSL.stage(Native Method)
          at WorkflowScript.run(WorkflowScript:156)
          at DSL.node(running on jagent-win14)
          at WorkflowScript.run(WorkflowScript:34)
          {code}

          The bat step is running a batch file:
          {code}
          cmd /c call test.bat ....
          {code}
          which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave.

          Please let me know if there is anything else I can do to help diagnose this.
          Russell Gallop made changes -
          Description Original: Pipeline jobs occasionally hang on bat steps. This seems similar to JENKINS-34150 but we are using durable-task 1.12 which has the fix for that.

          Thread dump from the job (after 18 hours):
          {code}
          Thread #26
          at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>)
          at WorkflowScript.run(WorkflowScript:349)
          at DSL.withEnv(Native Method)
          at WorkflowScript.run(WorkflowScript:242)
          at DSL.stage(Native Method)
          at WorkflowScript.run(WorkflowScript:156)
          at DSL.node(running on jagent-win14)
          at WorkflowScript.run(WorkflowScript:34)
          {code}

          The bat step is running a batch file:
          {code}
          cmd /c call test.bat ....
          {code}
          which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave.

          Please let me know if there is anything else I can do to help diagnose this.
          New: Pipeline jobs occasionally hang on bat steps. This seems similar to JENKINS-34150 but we are using durable-task 1.12 which has the fix for that.

          Thread dump from the job (after 18 hours):
          {code}
          Thread #26
          at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>)
          at WorkflowScript.run(WorkflowScript:349)
          at DSL.withEnv(Native Method)
          at WorkflowScript.run(WorkflowScript:242)
          at DSL.stage(Native Method)
          at WorkflowScript.run(WorkflowScript:156)
          at DSL.node(running on <slave>)
          at WorkflowScript.run(WorkflowScript:34)
          {code}

          The bat step is running a batch file:
          {code}
          cmd /c call test.bat ....
          {code}
          which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave.

          Please let me know if there is anything else I can do to help diagnose this.

          Seen again today. This time the slave was Windows server 2012r2. This time there is no evidence of abnormal termination (e.g. exception) from inside the script but the slave was restarted during the job. There is no message in the console output saying that the job was resuming (would there be?). There is no evidence of the script running on the slave. Maybe this is another resume problem (like JENKINS-33761).

          The job thread dump is similar:

          Thread #28
          	at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>\<repo@tmp\durable-df209aae on <slave>)
          	at WorkflowScript.run(WorkflowScript:87)
          	at DSL.dir(Native Method)
          	at WorkflowScript.run(WorkflowScript:77)
          	at DSL.withEnv(Native Method)
          	at WorkflowScript.run(WorkflowScript:73)
          	at DSL.stage(Native Method)
          	at WorkflowScript.run(WorkflowScript:62)
          	at DSL.node(running on jagent-win11)
          	at WorkflowScript.run(WorkflowScript:25)
          

          And the scripts in durable are still present.

          Russell Gallop added a comment - Seen again today. This time the slave was Windows server 2012r2. This time there is no evidence of abnormal termination (e.g. exception) from inside the script but the slave was restarted during the job. There is no message in the console output saying that the job was resuming (would there be?). There is no evidence of the script running on the slave. Maybe this is another resume problem (like JENKINS-33761 ). The job thread dump is similar: Thread #28 at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>\<repo@tmp\durable-df209aae on <slave>) at WorkflowScript.run(WorkflowScript:87) at DSL.dir(Native Method) at WorkflowScript.run(WorkflowScript:77) at DSL.withEnv(Native Method) at WorkflowScript.run(WorkflowScript:73) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:62) at DSL.node(running on jagent-win11) at WorkflowScript.run(WorkflowScript:25) And the scripts in durable are still present.
          Russell Gallop made changes -
          Environment Original: jenkins 2.19.4
          durable-task 1.12
          Linux server
          Windows10 slave
          java 1.8.0_51
          New: jenkins 2.19.4
          durable-task 1.12
          Linux server
          Windows slave
          java 1.8.0_51

          Trying to recreate this on my test Jenkins system (Windows7,jenkins 2.32.1 durable-task 1.13, pipeline 2.4, 1 local executor on master). I'm using:

          node() {
              bat 'ping 127.0.0.1 -n 50'
              echo 'batch completed'
          }
          

          and I restart the Windows service while the ping command is running. 3 out of 10 of my runs resulted in a hang so there appears to be a race condition here. Can't be certain but it may be related to system load.

          When it hangs it has a thread dump like this:

          Thread #2
          	at DSL.bat(awaiting process completion in C:\j\jobs\JENKINS-41482\ws@tmp\durable-82758db8; recurrence period: 6628ms; check task scheduled; cancelled? false done? false)
          	at WorkflowScript.run(WorkflowScript:2)
          	at DSL.node(running on )
          	at WorkflowScript.run(WorkflowScript:1)
          

          Russell Gallop added a comment - Trying to recreate this on my test Jenkins system (Windows7,jenkins 2.32.1 durable-task 1.13, pipeline 2.4, 1 local executor on master). I'm using: node() { bat 'ping 127.0.0.1 -n 50' echo 'batch completed' } and I restart the Windows service while the ping command is running. 3 out of 10 of my runs resulted in a hang so there appears to be a race condition here. Can't be certain but it may be related to system load. When it hangs it has a thread dump like this: Thread #2 at DSL.bat(awaiting process completion in C:\j\jobs\JENKINS-41482\ws@tmp\durable-82758db8; recurrence period: 6628ms; check task scheduled; cancelled? false done? false ) at WorkflowScript.run(WorkflowScript:2) at DSL.node(running on ) at WorkflowScript.run(WorkflowScript:1)
          Russell Gallop made changes -
          Summary Original: Pipeline bat step hangs New: Pipeline bat step hangs (after restart)
          Andrew Bayer made changes -
          Component/s New: workflow-durable-task-step-plugin [ 21715 ]
          Component/s Original: pipeline [ 21692 ]

          Jesse Glick added a comment -

          Probably a duplicate.

          Being able to reproduce from scratch is key.

          Jesse Glick added a comment - Probably a duplicate. Being able to reproduce from scratch is key.
          Jesse Glick made changes -
          Component/s Original: workflow-durable-task-step-plugin [ 21715 ]
          Labels New: windows

            Unassigned Unassigned
            rg Russell Gallop
            Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: