-
Bug
-
Resolution: Unresolved
-
Critical
-
jenkins 2.19.4
durable-task 1.12
Linux server
Windows slave
java 1.8.0_51
Pipeline jobs occasionally hang on bat steps. This seems similar to JENKINS-34150 but we are using durable-task 1.12 which has the fix for that.
Thread dump from the job (after 18 hours):
Thread #26
at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>)
at WorkflowScript.run(WorkflowScript:349)
at DSL.withEnv(Native Method)
at WorkflowScript.run(WorkflowScript:242)
at DSL.stage(Native Method)
at WorkflowScript.run(WorkflowScript:156)
at DSL.node(running on <slave>)
at WorkflowScript.run(WorkflowScript:34)
The bat step is running a batch file:
cmd /c call test.bat ....
which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave.
Please let me know if there is anything else I can do to help diagnose this.
I believe I hit the same problem reported here. As noted above, if you run a pipeline with a bat step that takes a while to complete, like:
bat "ping 127.0.0.1 -n 50"
and you kill the Jenkins agent while the ping command is still executing, you'll get an exception (expected) and a hang (unexpected):
2017-05-31 15:26:52 -0500 [Build-head] Reply from 127.0.0.1: bytes=32 time<1ms TTL=128
2017-05-31 15:26:53 -0500 [Build-head] Reply from 127.0.0.1: bytes=32 time<1ms TTL=128
2017-05-31 15:27:03 -0500 [Build-head] Cannot contact delcastillo: java.lang.InterruptedException
I added a try/catch block around the bat command, but that didn't help, the command is stuck. Not sure why, but looks like another unrelated issue....
Regarding the original issue, for what I can tell, the bat step expects a return value, which is written to a file (jenkins-result). If the agent reboots, the file is never created and Jenkins thinks the process is still running. On the sh step, the pid of the process is stored in a file as well, which I believe is used to determine if the process died. So, in the case the master loses connection to the agent, on reconnect the presence/absence of the pid file (and the jenkins-results file) determines if the process ended or not.
Seems to me like the bat step should implement a similar mechanism than sh and store the process pid.