• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • durable-task-plugin
    • jenkins 2.19.4
      durable-task 1.12
      Linux server
      Windows slave
      java 1.8.0_51

      Pipeline jobs occasionally hang on bat steps. This seems similar to JENKINS-34150 but we are using durable-task 1.12 which has the fix for that.

      Thread dump from the job (after 18 hours):

      Thread #26
      	at DSL.bat(awaiting process completion in C:\j\w\<folder>\<job>@tmp\durable-56b1eae1 on <slave>)
      	at WorkflowScript.run(WorkflowScript:349)
      	at DSL.withEnv(Native Method)
      	at WorkflowScript.run(WorkflowScript:242)
      	at DSL.stage(Native Method)
      	at WorkflowScript.run(WorkflowScript:156)
      	at DSL.node(running on <slave>)
      	at WorkflowScript.run(WorkflowScript:34)
      

      The bat step is running a batch file:

      cmd /c call test.bat ....
      

      which in turn is running a python script which (in this case) is throwing an exception (I can see from inspecting log files on the slave). Looking on the slave the "durable-56b1eae1" folder is present with jenkins-log.txt, jenkins-main.bat and jenkins-wrap.bat inside of it. There is no sign of the batch process on the slave so I presume that it has completed. The build continues to occupy a slot on the executor. There are also several flyweight tasks from the matrix plugin on the same slave.

      Please let me know if there is anything else I can do to help diagnose this.

          [JENKINS-41482] Pipeline bat step hangs (after restart)

          I believe I hit the same problem reported here. As noted above, if you run a pipeline with a bat step that takes a while to complete, like:

          bat "ping 127.0.0.1 -n 50"

          and you kill the Jenkins agent while the ping command is still executing, you'll get an exception (expected) and a hang (unexpected):

          2017-05-31 15:26:52 -0500 [Build-head] Reply from 127.0.0.1: bytes=32 time<1ms TTL=128
          2017-05-31 15:26:53 -0500 [Build-head] Reply from 127.0.0.1: bytes=32 time<1ms TTL=128
          2017-05-31 15:27:03 -0500 [Build-head] Cannot contact delcastillo: java.lang.InterruptedException

          I added a try/catch block around the bat command, but that didn't help, the command is stuck. Not sure why, but looks like another unrelated issue....

          Regarding the original issue, for what I can tell, the bat step expects a return value, which is written to a file (jenkins-result). If the agent reboots, the file is never created and Jenkins thinks the process is still running. On the sh step, the pid of the process is stored in a file as well, which I believe is used to determine if the process died. So, in the case the master loses connection to the agent, on reconnect the presence/absence of the pid file (and the jenkins-results file) determines if the process ended or not. 

          Seems to me like the bat step should implement a similar  mechanism than sh and store the process pid. 

           

           

           

           

          Alejandro del Castillo added a comment - I believe I hit the same problem reported here. As noted above, if you run a pipeline with a bat step that takes a while to complete, like: bat "ping 127.0.0.1 -n 50" and you kill the Jenkins agent while the ping command is still executing, you'll get an exception (expected) and a hang (unexpected): 2017-05-31 15:26:52 -0500 [Build-head] Reply from 127.0.0.1: bytes=32 time<1ms TTL=128 2017-05-31 15:26:53 -0500 [Build-head] Reply from 127.0.0.1: bytes=32 time<1ms TTL=128 2017-05-31 15:27:03 -0500 [Build-head] Cannot contact delcastillo: java.lang.InterruptedException I added a try/catch block around the bat command, but that didn't help, the command is stuck. Not sure why, but looks like another unrelated issue.... Regarding the original issue, for what I can tell, the bat step expects a return value, which is written to a file (jenkins-result). If the agent reboots, the file is never created and Jenkins thinks the process is still running. On the sh step, the pid of the process is stored in a file as well, which I believe is used to determine if the process died. So, in the case the master loses connection to the agent, on reconnect the presence/absence of the pid file (and the jenkins-results file) determines if the process ended or not.  Seems to me like the bat step should implement a similar  mechanism than sh and store the process pid.         

          Jesse Glick added a comment -

          Of course the hang is not expected, but neither is an exception: killing and restarting the agent should not cause the bat step to end. It should finish normally, as soon as the process completes. Perhaps whatever killed the agent is somehow killing the external process too? If so, that would be JENKINS-27617 (to avoid the killing) plus JENKINS-25053 (to make the step return -1 if it is killed anyway). If not, something else is broken, TBD.

          Jesse Glick added a comment - Of course the hang is not expected, but neither is an exception: killing and restarting the agent should not cause the bat step to end. It should finish normally, as soon as the process completes. Perhaps whatever killed the agent is somehow killing the external process too? If so, that would be  JENKINS-27617 (to avoid the killing) plus JENKINS-25053  (to make the step return -1 if it is killed anyway). If not, something else is broken, TBD.

          Re-reading my comment I realized that it was probably not clear that when I said 'kill the Jenkins agent' I meant "restart the machine that has the jenkins agent running". In that scenario, I would expect that on re-connect Jenkins would determine that the process is no longer running, then report an error. JENKINS-27617 might encompass my use case...do let me know if that's the case so I can track 27617 instead.

          Alejandro del Castillo added a comment - Re-reading my comment I realized that it was probably not clear that when I said 'kill the Jenkins agent' I meant " restart the machine that has the jenkins agent running" . In that scenario, I would expect that on re-connect Jenkins would determine that the process is no longer running, then report an error. JENKINS-27617  might encompass my use case...do let me know if that's the case so I can track 27617 instead.

          Jesse Glick added a comment -

          If you meant restarting the computer, then JENKINS-25053 is what would let Jenkins determine that the process is gone and abort the step properly.

          Jesse Glick added a comment - If you meant restarting the computer, then  JENKINS-25053 is what would let Jenkins determine that the process is gone and abort the step properly.

          Yes, that's what I meant. Thanks, that clarifies things.

          Alejandro del Castillo added a comment - Yes, that's what I meant. Thanks, that clarifies things.

          Jens Beyer added a comment -

          I noticed that I should probably add another observation.

          As said, I am using a Master-Only setup without slaves, but multiple executors for the "Master node". As soon as one of those batch tasks hang, on the other executors everything runs well until they also have batch tasks, which then also hang upon completion.

          Jens Beyer added a comment - I noticed that I should probably add another observation. As said, I am using a Master-Only setup without slaves, but multiple executors for the "Master node". As soon as one of those batch tasks hang, on the other executors everything runs well until they also have batch tasks, which then also hang upon completion.

          Jesse Glick added a comment -

           I am using a Master-Only setup without slaves

          Durability is not guaranteed for such a situation, since whatever restarts the Jenkins master might also kill child processes. (Depends on the specifics of your Jenkins installation.) Use an agent for all builds and set master to have zero executors.

          Jesse Glick added a comment -  I am using a Master-Only setup without slaves Durability is not guaranteed for such a situation, since whatever restarts the Jenkins master might also kill child processes. (Depends on the specifics of your Jenkins installation.) Use an agent for all builds and set master to have zero executors.

          Jens Beyer added a comment -

          Noted, thanks. Did as specified, having Master with 0 executors and just for testing 2 slaves with 2 executors each. The error still happened (all jobs hang at finishing any bat steps). Interesting fact: If I kill the Master now, and let the Slave live, as soon as the master comes up, the build continues as if there has been no issue at all. So this "bat hangs" is probably not an issue with executing or finishing the task itself, but something with the management of this inside Jenkins (probably remoting is involved somehow, too). Or is my observation totally off?

          Another observation (but this might be pure coincidence): It seems to happen more often when the queue is large (>30).

          Jens Beyer added a comment - Noted, thanks. Did as specified, having Master with 0 executors and just for testing 2 slaves with 2 executors each. The error still happened (all jobs hang at finishing any bat steps). Interesting fact: If I kill the Master now, and let the Slave live, as soon as the master comes up, the build continues as if there has been no issue at all. So this "bat hangs" is probably not an issue with executing or finishing the task itself, but something with the management of this inside Jenkins (probably remoting is involved somehow, too). Or is my observation totally off? Another observation (but this might be pure coincidence): It seems to happen more often when the queue is large (>30).

          Jens Beyer added a comment -

          Sorry, it's me again, but I guess I may have made an important observation.

          Out of sheer luck it occured to me that the "hang" happened roughly at full hours, so I checked what this machine was doing every hour (or so), and I found the ThinBackup Jenkins plugin which was configured to run differential mode every hour and to set the "shutdown mode" after two hours if it can not run because jobs are running. Originally I thought, the "shutdown mode" occured because the jobs hung, but... I removed the hourly differential setting, keeping the full backup once at night. This was three days ago, with partially massive load on the machine and a huge queue (around 80) over hours, and the hang didn't happen since.

          So, if you are having "hangs", and have a plugin which tinkers with the "shutdown mode" like ThinBackup does, please check if deactivating or reconfiguring this plugin helps. If so, it might either be a bug in ThinBackup or more realistically in the "shutdown mode" in Jenkins?

          Jens Beyer added a comment - Sorry, it's me again, but I guess I may have made an important observation. Out of sheer luck it occured to me that the "hang" happened roughly at full hours, so I checked what this machine was doing every hour (or so), and I found the ThinBackup Jenkins plugin which was configured to run differential mode every hour and to set the "shutdown mode" after two hours if it can not run because jobs are running. Originally I thought, the "shutdown mode" occured because the jobs hung, but... I removed the hourly differential setting, keeping the full backup once at night. This was three days ago, with partially massive load on the machine and a huge queue (around 80) over hours, and the hang didn't happen since. So, if you are having "hangs", and have a plugin which tinkers with the "shutdown mode" like ThinBackup does, please check if deactivating or reconfiguring this plugin helps. If so, it might either be a bug in ThinBackup or more realistically in the "shutdown mode" in Jenkins?

          Jesse Glick added a comment -

          Well if Jenkins is set to be in quiet mode, no more operations will be performed in a Pipeline build. That should not cause a “hang” per se—it should resume once Jenkins restarts, or quiet mode is cancelled. But I doubt this has been tested explicitly.

          Jesse Glick added a comment - Well if Jenkins is set to be in quiet mode, no more operations will be performed in a Pipeline build. That should not cause a “hang” per se—it should resume once Jenkins restarts, or quiet mode is cancelled. But I doubt this has been tested explicitly.

            Unassigned Unassigned
            rg Russell Gallop
            Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: