-
Bug
-
Resolution: Unresolved
-
Critical
-
jenkins 2.19.4
durable-task 1.12
Linux server
Windows slave
java 1.8.0_51
Pipeline jobs occasionally hang on bat steps. This seems similar to JENKINS-34150 but we are using durable-task 1.12 which has the fix for that.
Thread dump from the job (after 18 hours):
Thread #26
at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>)
at WorkflowScript.run(WorkflowScript:349)
at DSL.withEnv(Native Method)
at WorkflowScript.run(WorkflowScript:242)
at DSL.stage(Native Method)
at WorkflowScript.run(WorkflowScript:156)
at DSL.node(running on <slave>)
at WorkflowScript.run(WorkflowScript:34)
The bat step is running a batch file:
cmd /c call test.bat ....
which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave.
Please let me know if there is anything else I can do to help diagnose this.
[JENKINS-41482] Pipeline bat step hangs (after restart)
Description |
Original:
Pipeline jobs occasionally hang on bat steps. This seems similar to Thread dump from the job (after 18 hours): {code} Thread #26 at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>) at WorkflowScript.run(WorkflowScript:349) at DSL.withEnv(Native Method) at WorkflowScript.run(WorkflowScript:242) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:156) at DSL.node(running on jagent-win14) at WorkflowScript.run(WorkflowScript:34) {code} The bat step is running a batch file: {code} cmd /c call test.bat .... {code} which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave. Please let me know if there is anything else I can do to help diagnose this. |
New:
Pipeline jobs occasionally hang on bat steps. This seems similar to Thread dump from the job (after 18 hours): {code} Thread #26 at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>) at WorkflowScript.run(WorkflowScript:349) at DSL.withEnv(Native Method) at WorkflowScript.run(WorkflowScript:242) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:156) at DSL.node(running on jagent-win14) at WorkflowScript.run(WorkflowScript:34) {code} The bat step is running a batch file: {code} cmd /c call test.bat .... {code} which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave. Please let me know if there is anything else I can do to help diagnose this. |
Description |
Original:
Pipeline jobs occasionally hang on bat steps. This seems similar to Thread dump from the job (after 18 hours): {code} Thread #26 at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>) at WorkflowScript.run(WorkflowScript:349) at DSL.withEnv(Native Method) at WorkflowScript.run(WorkflowScript:242) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:156) at DSL.node(running on jagent-win14) at WorkflowScript.run(WorkflowScript:34) {code} The bat step is running a batch file: {code} cmd /c call test.bat .... {code} which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave. Please let me know if there is anything else I can do to help diagnose this. |
New:
Pipeline jobs occasionally hang on bat steps. This seems similar to Thread dump from the job (after 18 hours): {code} Thread #26 at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>) at WorkflowScript.run(WorkflowScript:349) at DSL.withEnv(Native Method) at WorkflowScript.run(WorkflowScript:242) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:156) at DSL.node(running on <slave>) at WorkflowScript.run(WorkflowScript:34) {code} The bat step is running a batch file: {code} cmd /c call test.bat .... {code} which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave. Please let me know if there is anything else I can do to help diagnose this. |
Environment |
Original:
jenkins 2.19.4 durable-task 1.12 Linux server Windows10 slave java 1.8.0_51 |
New:
jenkins 2.19.4 durable-task 1.12 Linux server Windows slave java 1.8.0_51 |
Summary | Original: Pipeline bat step hangs | New: Pipeline bat step hangs (after restart) |
Component/s | New: workflow-durable-task-step-plugin [ 21715 ] | |
Component/s | Original: pipeline [ 21692 ] |
Component/s | Original: workflow-durable-task-step-plugin [ 21715 ] | |
Labels | New: windows |
Seen again today. This time the slave was Windows server 2012r2. This time there is no evidence of abnormal termination (e.g. exception) from inside the script but the slave was restarted during the job. There is no message in the console output saying that the job was resuming (would there be?). There is no evidence of the script running on the slave. Maybe this is another resume problem (like
JENKINS-33761).The job thread dump is similar:
Thread #28 at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>\<repo@tmp\durable-df209aae on <slave>) at WorkflowScript.run(WorkflowScript:87) at DSL.dir(Native Method) at WorkflowScript.run(WorkflowScript:77) at DSL.withEnv(Native Method) at WorkflowScript.run(WorkflowScript:73) at DSL.stage(Native Method) at WorkflowScript.run(WorkflowScript:62) at DSL.node(running on jagent-win11) at WorkflowScript.run(WorkflowScript:25)
And the scripts in durable are still present.