The pipeline batch command failed 3 out of 4 times and hang mostly after a long command. Both master and slave node are waiting for each other. Not sure it's the same, but here's what I have:

      • Jenkins 2.51
      • Windows 10 slave
      • Linux Master (CentOS 7)
      • pipeline script from SCM
      • Build Trigger is Poll SCM (manual trigger build does not have this behavior and complete successfully)
      • Mercurial SCM
      • The session is lock during the job is executing (user is still logon and slave is still available)
      • Seem to always happen on long batch command (short one doesn't display this behavior or maybe it's just less likely)
      • The project is parametrized for pipeline script repos and revision (default value are provided and the proper checkout is made).
      • Seem like the command complete successfully I see the final data into the log but it look like the master/slave doesn't known the batch command have terminated
      • I use the following syntax:

       

      bat returnStatus: false, script: 'msbuild ...'

      I cannot stop/cancel the build. I have to restart the master to unjam the slave and master (killing the slave client doesn't do anything either).

      Here's the last things into the console log:

      18:00:58 
      18:00:58 Build succeeded.
      18:00:58     0 Warning(s)
      18:00:58     0 Error(s)
      18:00:58 
      18:00:58 Time Elapsed 00:15:41.55

      which is correct, indicate to me that the msbuild command finished properly.

      This is a total show stopper, we cannot have any more CI with this behavior, we always have to restart the master. Make us wonder if we should start looking for an alternative (I have report this issue into the forum thread, without any answer 3 times already). The batch command seem to hang for many people if I see the bug listing, we all have different system and setup, but they all are related to the batch command seem like a nightmare for hang. Some are marked as resolved and many are still open.

          [JENKINS-42988] Batch command hang upon completion

          Jesse Glick added a comment -

          Again I suspect the problem is not the output, but the exit status file.

          Jesse Glick added a comment - Again I suspect the problem is not the output, but the exit status file.

          If this might help, manual triggered build never show this behavior, only scheduled one. I have no clue what is different behind the scene inside Jenkins, the only thing I can say about the slave Windows node when the trigger is activated is that the following power management is set on the slave (high performance based):

          • HD is turn off after 20 min
          • Display turn off after 15 min
          • Sleep after never
          • Allow hybrid sleep on
          • Hibernate after never
          • Allow wake timers enable
          • The computer is on the login screen but user is still login

          Seem like when the build is trigger on the slave and the slave is idle, this cause the problems (occur often around 66%) on 2 different machine with the exact same setup. Manual trigger build haven't show this behavior in more then 20 builds with the same batch command. There is something that seem to prevent the batch return code to be seen under those circumstances.

          Not sure it's the right track, maybe it just expose it more.

           

          Jerome Godbout added a comment - If this might help, manual triggered build never show this behavior, only scheduled one. I have no clue what is different behind the scene inside Jenkins, the only thing I can say about the slave Windows node when the trigger is activated is that the following power management is set on the slave (high performance based): HD is turn off after 20 min Display turn off after 15 min Sleep after never Allow hybrid sleep on Hibernate after never Allow wake timers enable The computer is on the login screen but user is still login Seem like when the build is trigger on the slave and the slave is idle, this cause the problems (occur often around 66%) on 2 different machine with the exact same setup. Manual trigger build haven't show this behavior in more then 20 builds with the same batch command. There is something that seem to prevent the batch return code to be seen under those circumstances. Not sure it's the right track, maybe it just expose it more.  

          Jesse Glick added a comment -

          Hmm. I cannot think offhand of any reason why the build trigger method would have anything to do with this. Might be more about the time of day and thus machine load?

          Jesse Glick added a comment - Hmm. I cannot think offhand of any reason why the build trigger method would have anything to do with this. Might be more about the time of day and thus machine load?

          I did try during the day when user session is logged in and active. Seem to happen less often, but still have happen with periodical polling, but still happen. When user is lock screen, it seem to happen more often (not sure both are related or just pure random luck on this part, but it's almost 66% of the time). I tried 3 different hour of the day (9PM, 3AM, 4AM) all with the same dead lock nearly everyday with polling and session is lock.

          But when session is active and build is trigger manually it seem to never happen.

          I think it's more related to the user session, we are using the slave with a user session since we need the GUI/OpenGL context for our unit tests. So it seem trigger the build with a polling when session is lock make a difference. As stated before this machine doesn't go in hybernation nor real sleep, only affect HD and monitor.

           

          Here's what I will try:

          1. I will try to remove the HD sleep, even if the batch command is HD intensive (compiling, it's MSBuild batch command return, complete fully and output the build result into the master log).
          2. I will try to prevent session from locking and start a polling.

          I will try to post the result of those 2 tests just to figure out.

          Jerome Godbout added a comment - I did try during the day when user session is logged in and active. Seem to happen less often, but still have happen with periodical polling, but still happen. When user is lock screen, it seem to happen more often (not sure both are related or just pure random luck on this part, but it's almost 66% of the time). I tried 3 different hour of the day (9PM, 3AM, 4AM) all with the same dead lock nearly everyday with polling and session is lock. But when session is active and build is trigger manually it seem to never happen. I think it's more related to the user session, we are using the slave with a user session since we need the GUI/OpenGL context for our unit tests. So it seem trigger the build with a polling when session is lock make a difference. As stated before this machine doesn't go in hybernation nor real sleep, only affect HD and monitor.   Here's what I will try: I will try to remove the HD sleep, even if the batch command is HD intensive (compiling, it's MSBuild batch command return, complete fully and output the build result into the master log). I will try to prevent session from locking and start a polling. I will try to post the result of those 2 tests just to figure out.

          Jesse Glick added a comment -

          Might be helpful for you, but unlikely to lead to a fix.

          Jesse Glick added a comment - Might be helpful for you, but unlikely to lead to a fix.

          James Femia added a comment -

          I noticed something resembling this issue after upgrading LTS to 2.60.1 - one of my agents fairly regularly completes a process in a batch step then just hangs. I tried using process explorer to see what was keeping it open but as soon as I interact with the "dead" process in any way, it terminated.

          Terminating the batch executable on the agent side seems to allow the master to continue executing the job.

          Spent some time in AV exceptions etc, can't seem to find a way of debugging the issue.

          James Femia added a comment - I noticed something resembling this issue after upgrading LTS to 2.60.1 - one of my agents fairly regularly completes a process in a batch step then just hangs. I tried using process explorer to see what was keeping it open but as soon as I interact with the "dead" process in any way, it terminated. Terminating the batch executable on the agent side seems to allow the master to continue executing the job. Spent some time in AV exceptions etc, can't seem to find a way of debugging the issue.

          James Femia added a comment -

          Rolling back to LTS 2.46.3 resulted batch steps no longer randomly hanging for me

          James Femia added a comment - Rolling back to LTS 2.46.3 resulted batch steps no longer randomly hanging for me

          I just added the Support Plugin information if this might help. I'm currently trying to reproduce it into another project, but reducing the scope (removing some variables from the jenkinsfile, removing instruction after the build, removing the email into the big try catch) for some reason seem to avoid the problem so far, I'm trying to figure out what is the difference between both project (they do the exact same thing up to the normal one hang, on the same repos checkout). Sound like there's something around the instruction that make the instruction hang for that particular project. But so far I don't have a clue of what this could be.

          Jerome Godbout added a comment - I just added the Support Plugin information if this might help. I'm currently trying to reproduce it into another project, but reducing the scope (removing some variables from the jenkinsfile, removing instruction after the build, removing the email into the big try catch) for some reason seem to avoid the problem so far, I'm trying to figure out what is the difference between both project (they do the exact same thing up to the normal one hang, on the same repos checkout). Sound like there's something around the instruction that make the instruction hang for that particular project. But so far I don't have a clue of what this could be.

          Add a new one, this one was perform with the reduced jenkinsfile and other project (just seem to happen less often for some obscure reason). The reduce jenkinsfile does the same operation on the same repos, just trimmed the email and other stuff like that.

          Next step is to create a mini repos/code and try to see if this still can happen or if I can do it with a much simpler bat command (avoiding the whole msbuild thing).

          Jerome Godbout added a comment - Add a new one, this one was perform with the reduced jenkinsfile and other project (just seem to happen less often for some obscure reason). The reduce jenkinsfile does the same operation on the same repos, just trimmed the email and other stuff like that. Next step is to create a mini repos/code and try to see if this still can happen or if I can do it with a much simpler bat command (avoiding the whole msbuild thing).

          One last one where the Pipeline TestHang build #10 is hang while the #11 is running just fine on another slave. (this might help to compare both)

          Jerome Godbout added a comment - One last one where the Pipeline TestHang build #10 is hang while the #11 is running just fine on another slave. (this might help to compare both)

            Unassigned Unassigned
            jerome_godbout Jerome Godbout
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: