[JENKINS-42988] Batch command hang upon completion

Type: Bug
Resolution: Unresolved
Priority: Critical
Component/s: durable-task-plugin
Labels:
- triaged-2018-11
- windows
Environment:
Window 10
Jenkins 2.51

Similar Issues:
Powered by SuggestiMate

Show

The pipeline batch command failed 3 out of 4 times and hang mostly after a long command. Both master and slave node are waiting for each other. Not sure it's the same, but here's what I have:

Jenkins 2.51
Windows 10 slave
Linux Master (CentOS 7)
pipeline script from SCM
Build Trigger is Poll SCM (manual trigger build does not have this behavior and complete successfully)
Mercurial SCM
The session is lock during the job is executing (user is still logon and slave is still available)
Seem to always happen on long batch command (short one doesn't display this behavior or maybe it's just less likely)
The project is parametrized for pipeline script repos and revision (default value are provided and the proper checkout is made).
Seem like the command complete successfully I see the final data into the log but it look like the master/slave doesn't known the batch command have terminated
I use the following syntax:

bat returnStatus: false, script: 'msbuild ...'

I cannot stop/cancel the build. I have to restart the master to unjam the slave and master (killing the slave client doesn't do anything either).

Here's the last things into the console log:

18:00:58 
18:00:58 Build succeeded.
18:00:58     0 Warning(s)
18:00:58     0 Error(s)
18:00:58 
18:00:58 Time Elapsed 00:15:41.55

which is correct, indicate to me that the msbuild command finished properly.

This is a total show stopper, we cannot have any more CI with this behavior, we always have to restart the master. Make us wonder if we should start looking for an alternative (I have report this issue into the forum thread, without any answer 3 times already). The batch command seem to hang for many people if I see the bug listing, we all have different system and setup, but they all are related to the batch command seem like a nightmare for hang. Some are marked as resolved and many are still open.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

support_2017-07-13_12.38.26.zip
395 kB
2017-07-13 12:39
support_2017-07-12_19.57.32.zip
275 kB
2017-07-12 19:58
support_2017-07-12_13.30.09.zip
307 kB
2017-07-12 14:00
System Information [Jenkins].pdf
120 kB
2017-03-22 13:57
Thread dump [Jenkins].pdf
223 kB
2017-03-22 13:57
Log [Jenkins].pdf
130 kB
2017-03-22 13:57

Jerome Godbout added a comment - 2017-03-22 13:57

I have attach the Master log, system info and thread dump. There's nothing special into the master host dmesg.

Jerome Godbout added a comment - 2017-03-22 13:57 I have attach the Master log, system info and thread dump. There's nothing special into the master host dmesg.

Jesse Glick added a comment - 2017-04-25 19:17

Probably a duplicate of one of the existing issues in this component; awaiting steps to reproduce from scratch and/or a Windows expert.

Jesse Glick added a comment - 2017-04-25 19:17 Probably a duplicate of one of the existing issues in this component; awaiting steps to reproduce from scratch and/or a Windows expert.

Jerome Godbout added a comment - 2017-04-27 13:42

Still happen on 2.56

Yeah probably a duplicate of many Windows hang communication between master/slave, some of them have been open a long time ago.

Maybe a quick workaround is to have a dead lock checker (is both, slave/master waiting for each other) and stop/cancel the build. At least until problem is resolved for real. Right now it put a slave into a busy state that no more can be used and it jam the CI totally, which render the system useless for CI. I have to reboot the master everyday.

Jerome Godbout added a comment - 2017-04-27 13:42 Still happen on 2.56 Yeah probably a duplicate of many Windows hang communication between master/slave, some of them have been open a long time ago. Maybe a quick workaround is to have a dead lock checker (is both, slave/master waiting for each other) and stop/cancel the build. At least until problem is resolved for real. Right now it put a slave into a busy state that no more can be used and it jam the CI totally, which render the system useless for CI. I have to reboot the master everyday.

Jerome Godbout added a comment - 2017-04-27 13:48

related issue could be (still open into critical, major):

https://issues.jenkins-ci.org/browse/JENKINS-28759 (since 2015/06)

https://issues.jenkins-ci.org/browse/JENKINS-33164 (since 2016/02)

They all seem to be related to batch command return. Either the slave doesn't catch it properly, doesn't communicate it properly or Master doesn't handle the answer properly. Also seem to happen for very long batch command it that might help.

Jerome Godbout added a comment - 2017-04-27 13:48 related issue could be (still open into critical, major): https://issues.jenkins-ci.org/browse/JENKINS-28759 (since 2015/06) https://issues.jenkins-ci.org/browse/JENKINS-33164 (since 2016/02) They all seem to be related to batch command return. Either the slave doesn't catch it properly, doesn't communicate it properly or Master doesn't handle the answer properly. Also seem to happen for very long batch command it that might help.

Jerome Godbout added a comment - 2017-04-27 13:49

I can reproduce it at will if you want some additional information, just ask.

Jerome Godbout added a comment - 2017-04-27 13:49 I can reproduce it at will if you want some additional information, just ask.

Jesse Glick added a comment - 2017-04-27 16:14

Well I can tell you what Jenkins is waiting for: an exit status in the file in the control directory. Somehow this is not getting written, perhaps just due to my novice batch scripting skills (i.e., copying and pasting from stackoverflow.com).

A PowerShell implementation is in the works, which might be more reliable.

Jesse Glick added a comment - 2017-04-27 16:14 Well I can tell you what Jenkins is waiting for: an exit status in the file in the control directory. Somehow this is not getting written, perhaps just due to my novice batch scripting skills (i.e., copying and pasting from stackoverflow.com). A PowerShell implementation is in the works, which might be more reliable.

Jerome Godbout added a comment - 2017-04-27 17:03

Ok, nice to known. I guess the source for this is there:

https://github.com/jenkinsci/jenkins/blob/d111e2ac1658c8fa5fb768e7d1233613b4b9992d/core/src/main/java/hudson/tasks/BatchFile.java

I might give a look (I'm no Java expert nor Windows guru) but I did some process/fork code.

Jerome Godbout added a comment - 2017-04-27 17:03 Ok, nice to known. I guess the source for this is there: https://github.com/jenkinsci/jenkins/blob/d111e2ac1658c8fa5fb768e7d1233613b4b9992d/core/src/main/java/hudson/tasks/BatchFile.java I might give a look (I'm no Java expert nor Windows guru) but I did some process/fork code.

Jesse Glick added a comment - 2017-04-27 17:13

Right. At this point I would recommend just waiting for the PowerShell version which will probably have its own set of bugs, but I hope not this one.

Jesse Glick added a comment - 2017-04-27 17:13 Right. At this point I would recommend just waiting for the PowerShell version which will probably have its own set of bugs, but I hope not this one.

Jerome Godbout added a comment - 2017-04-27 17:16 - edited

At first I would look at the Runtime process into Java, the length of the output might be a problems (which also be my problems, I have a long output when this occur, building the whole solution). Seem like a buffer under run is possible.

http://www.javaworld.com/article/2071275/core-java/when-runtime-exec---won-t.html

This might be why, the I/O doesn't get clean fast enough and the command deadlock.

Java JDK doc, Runtime.exec()

Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock.

Like every doc with 'some native platform' they mean Windows . Seem like a good trail there.

The section 4.7 give a good output (take from the link above):

import java.util.*;
import java.io.*;
class StreamGobbler extends Thread
{
    InputStream is;
    String type;
    OutputStream os;
    
    StreamGobbler(InputStream is, String type)
    {
        this(is, type, null);
    }
    StreamGobbler(InputStream is, String type, OutputStream redirect)
    {
        this.is = is;
        this.type = type;
        this.os = redirect;
    }
    
    public void run()
    {
        try
        {
            PrintWriter pw = null;
            if (os != null)
                pw = new PrintWriter(os);
                
            InputStreamReader isr = new InputStreamReader(is);
            BufferedReader br = new BufferedReader(isr);
            String line=null;
            while ( (line = br.readLine()) != null)
            {
                if (pw != null)
                    pw.println(line);
                System.out.println(type + ">" + line);    
            }
            if (pw != null)
                pw.flush();
        } catch (IOException ioe)
            {
            ioe.printStackTrace();  
            }
    }
}
public class GoodWinRedirect
{
    public static void main(String args[])
    {
        if (args.length < 1)
        {
            System.out.println("USAGE java GoodWinRedirect <outputfile>");
            System.exit(1);
        }
        
        try
        {            
            FileOutputStream fos = new FileOutputStream(args[0]);
            Runtime rt = Runtime.getRuntime();
            Process proc = rt.exec("java jecho 'Hello World'");
            // any error message?
            StreamGobbler errorGobbler = new 
                StreamGobbler(proc.getErrorStream(), "ERROR");            
            
            // any output?
            StreamGobbler outputGobbler = new 
                StreamGobbler(proc.getInputStream(), "OUTPUT", fos);
                
            // kick them off
            errorGobbler.start();
            outputGobbler.start();
                                    
            // any error???
            int exitVal = proc.waitFor();
            System.out.println("ExitValue: " + exitVal);
            fos.flush();
            fos.close();        
        } catch (Throwable t)
          {
            t.printStackTrace();
          }
    }
}

Maybe this can help or not, but sure seem like a good trail to test.

Jerome Godbout added a comment - 2017-04-27 17:16 - edited At first I would look at the Runtime process into Java, the length of the output might be a problems (which also be my problems, I have a long output when this occur, building the whole solution). Seem like a buffer under run is possible. http://www.javaworld.com/article/2071275/core-java/when-runtime-exec---won-t.html This might be why, the I/O doesn't get clean fast enough and the command deadlock. Java JDK doc, Runtime.exec() Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock. Like every doc with 'some native platform' they mean Windows . Seem like a good trail there. The section 4.7 give a good output (take from the link above): import java.util.*; import java.io.*; class StreamGobbler extends Thread { InputStream is; String type; OutputStream os; StreamGobbler(InputStream is, String type) { this (is, type, null ); } StreamGobbler(InputStream is, String type, OutputStream redirect) { this .is = is; this .type = type; this .os = redirect; } public void run() { try { PrintWriter pw = null ; if (os != null ) pw = new PrintWriter(os); InputStreamReader isr = new InputStreamReader(is); BufferedReader br = new BufferedReader(isr); String line= null ; while ( (line = br.readLine()) != null ) { if (pw != null ) pw.println(line); System .out.println(type + ">" + line); } if (pw != null ) pw.flush(); } catch (IOException ioe) { ioe.printStackTrace(); } } } public class GoodWinRedirect { public static void main( String args[]) { if (args.length < 1) { System .out.println( "USAGE java GoodWinRedirect <outputfile>" ); System .exit(1); } try { FileOutputStream fos = new FileOutputStream(args[0]); Runtime rt = Runtime .getRuntime(); Process proc = rt.exec( "java jecho 'Hello World' " ); // any error message? StreamGobbler errorGobbler = new StreamGobbler(proc.getErrorStream(), "ERROR" ); // any output? StreamGobbler outputGobbler = new StreamGobbler(proc.getInputStream(), "OUTPUT" , fos); // kick them off errorGobbler.start(); outputGobbler.start(); // any error??? int exitVal = proc.waitFor(); System .out.println( "ExitValue: " + exitVal); fos.flush(); fos.close(); } catch (Throwable t) { t.printStackTrace(); } } } Maybe this can help or not, but sure seem like a good trail to test.

Jesse Glick added a comment - 2017-04-27 17:19

Irrelevant here. Unlike with freestyle builds, we are forking the wrapper script process ultimately with ProcessBuilder, but that is not expected to produce any output. Actual user process output is redirected to a file which Jenkins tails.

Jesse Glick added a comment - 2017-04-27 17:19 Irrelevant here. Unlike with freestyle builds, we are forking the wrapper script process ultimately with ProcessBuilder , but that is not expected to produce any output. Actual user process output is redirected to a file which Jenkins tails.

Jerome Godbout added a comment - 2017-04-27 17:26

the tail seem to work fine I see the end result of the command into the log. So the wrapped ProcessBuilder must have a problems. Sorry I don't known enough about Jenkins and Java details here to be of any help it seem. Thanks for the feedback.

Jerome Godbout added a comment - 2017-04-27 17:26 the tail seem to work fine I see the end result of the command into the log. So the wrapped ProcessBuilder must have a problems. Sorry I don't known enough about Jenkins and Java details here to be of any help it seem. Thanks for the feedback.

Jesse Glick added a comment - 2017-04-27 17:36

Again I suspect the problem is not the output, but the exit status file.

Jesse Glick added a comment - 2017-04-27 17:36 Again I suspect the problem is not the output, but the exit status file.

Jerome Godbout added a comment - 2017-05-30 17:16

If this might help, manual triggered build never show this behavior, only scheduled one. I have no clue what is different behind the scene inside Jenkins, the only thing I can say about the slave Windows node when the trigger is activated is that the following power management is set on the slave (high performance based):

HD is turn off after 20 min
Display turn off after 15 min
Sleep after never
Allow hybrid sleep on
Hibernate after never
Allow wake timers enable
The computer is on the login screen but user is still login

Seem like when the build is trigger on the slave and the slave is idle, this cause the problems (occur often around 66%) on 2 different machine with the exact same setup. Manual trigger build haven't show this behavior in more then 20 builds with the same batch command. There is something that seem to prevent the batch return code to be seen under those circumstances.

Not sure it's the right track, maybe it just expose it more.

Jerome Godbout added a comment - 2017-05-30 17:16 If this might help, manual triggered build never show this behavior, only scheduled one. I have no clue what is different behind the scene inside Jenkins, the only thing I can say about the slave Windows node when the trigger is activated is that the following power management is set on the slave (high performance based): HD is turn off after 20 min Display turn off after 15 min Sleep after never Allow hybrid sleep on Hibernate after never Allow wake timers enable The computer is on the login screen but user is still login Seem like when the build is trigger on the slave and the slave is idle, this cause the problems (occur often around 66%) on 2 different machine with the exact same setup. Manual trigger build haven't show this behavior in more then 20 builds with the same batch command. There is something that seem to prevent the batch return code to be seen under those circumstances. Not sure it's the right track, maybe it just expose it more.

Jesse Glick added a comment - 2017-05-30 19:03

Hmm. I cannot think offhand of any reason why the build trigger method would have anything to do with this. Might be more about the time of day and thus machine load?

Jesse Glick added a comment - 2017-05-30 19:03 Hmm. I cannot think offhand of any reason why the build trigger method would have anything to do with this. Might be more about the time of day and thus machine load?

Jerome Godbout added a comment - 2017-05-30 19:28

I did try during the day when user session is logged in and active. Seem to happen less often, but still have happen with periodical polling, but still happen. When user is lock screen, it seem to happen more often (not sure both are related or just pure random luck on this part, but it's almost 66% of the time). I tried 3 different hour of the day (9PM, 3AM, 4AM) all with the same dead lock nearly everyday with polling and session is lock.

But when session is active and build is trigger manually it seem to never happen.

I think it's more related to the user session, we are using the slave with a user session since we need the GUI/OpenGL context for our unit tests. So it seem trigger the build with a polling when session is lock make a difference. As stated before this machine doesn't go in hybernation nor real sleep, only affect HD and monitor.

Here's what I will try:

I will try to remove the HD sleep, even if the batch command is HD intensive (compiling, it's MSBuild batch command return, complete fully and output the build result into the master log).
I will try to prevent session from locking and start a polling.

I will try to post the result of those 2 tests just to figure out.

Jerome Godbout added a comment - 2017-05-30 19:28 I did try during the day when user session is logged in and active. Seem to happen less often, but still have happen with periodical polling, but still happen. When user is lock screen, it seem to happen more often (not sure both are related or just pure random luck on this part, but it's almost 66% of the time). I tried 3 different hour of the day (9PM, 3AM, 4AM) all with the same dead lock nearly everyday with polling and session is lock. But when session is active and build is trigger manually it seem to never happen. I think it's more related to the user session, we are using the slave with a user session since we need the GUI/OpenGL context for our unit tests. So it seem trigger the build with a polling when session is lock make a difference. As stated before this machine doesn't go in hybernation nor real sleep, only affect HD and monitor. Here's what I will try: I will try to remove the HD sleep, even if the batch command is HD intensive (compiling, it's MSBuild batch command return, complete fully and output the build result into the master log). I will try to prevent session from locking and start a polling. I will try to post the result of those 2 tests just to figure out.

Jesse Glick added a comment - 2017-05-30 19:48

Might be helpful for you, but unlikely to lead to a fix.

Jesse Glick added a comment - 2017-05-30 19:48 Might be helpful for you, but unlikely to lead to a fix.

James Femia added a comment - 2017-07-05 21:03

I noticed something resembling this issue after upgrading LTS to 2.60.1 - one of my agents fairly regularly completes a process in a batch step then just hangs. I tried using process explorer to see what was keeping it open but as soon as I interact with the "dead" process in any way, it terminated.

Terminating the batch executable on the agent side seems to allow the master to continue executing the job.

Spent some time in AV exceptions etc, can't seem to find a way of debugging the issue.

James Femia added a comment - 2017-07-05 21:03 I noticed something resembling this issue after upgrading LTS to 2.60.1 - one of my agents fairly regularly completes a process in a batch step then just hangs. I tried using process explorer to see what was keeping it open but as soon as I interact with the "dead" process in any way, it terminated. Terminating the batch executable on the agent side seems to allow the master to continue executing the job. Spent some time in AV exceptions etc, can't seem to find a way of debugging the issue.

James Femia added a comment - 2017-07-12 11:00

Rolling back to LTS 2.46.3 resulted batch steps no longer randomly hanging for me

James Femia added a comment - 2017-07-12 11:00 Rolling back to LTS 2.46.3 resulted batch steps no longer randomly hanging for me

Jerome Godbout added a comment - 2017-07-12 14:44

I just added the Support Plugin information if this might help. I'm currently trying to reproduce it into another project, but reducing the scope (removing some variables from the jenkinsfile, removing instruction after the build, removing the email into the big try catch) for some reason seem to avoid the problem so far, I'm trying to figure out what is the difference between both project (they do the exact same thing up to the normal one hang, on the same repos checkout). Sound like there's something around the instruction that make the instruction hang for that particular project. But so far I don't have a clue of what this could be.

Jerome Godbout added a comment - 2017-07-12 14:44 I just added the Support Plugin information if this might help. I'm currently trying to reproduce it into another project, but reducing the scope (removing some variables from the jenkinsfile, removing instruction after the build, removing the email into the big try catch) for some reason seem to avoid the problem so far, I'm trying to figure out what is the difference between both project (they do the exact same thing up to the normal one hang, on the same repos checkout). Sound like there's something around the instruction that make the instruction hang for that particular project. But so far I don't have a clue of what this could be.

Jerome Godbout added a comment - 2017-07-12 20:01

Add a new one, this one was perform with the reduced jenkinsfile and other project (just seem to happen less often for some obscure reason). The reduce jenkinsfile does the same operation on the same repos, just trimmed the email and other stuff like that.

Next step is to create a mini repos/code and try to see if this still can happen or if I can do it with a much simpler bat command (avoiding the whole msbuild thing).

Jerome Godbout added a comment - 2017-07-12 20:01 Add a new one, this one was perform with the reduced jenkinsfile and other project (just seem to happen less often for some obscure reason). The reduce jenkinsfile does the same operation on the same repos, just trimmed the email and other stuff like that. Next step is to create a mini repos/code and try to see if this still can happen or if I can do it with a much simpler bat command (avoiding the whole msbuild thing).

Jerome Godbout added a comment - 2017-07-13 12:40

One last one where the Pipeline TestHang build #10 is hang while the #11 is running just fine on another slave. (this might help to compare both)

Jerome Godbout added a comment - 2017-07-13 12:40 One last one where the Pipeline TestHang build #10 is hang while the #11 is running just fine on another slave. (this might help to compare both)

Assignee:: Unassigned

Reporter:: Jerome Godbout

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2017-03-21 19:17

Updated:: 2018-11-16 17:50

Jenkins

Details

Description

Attachments

Attachments

Activity

Collapse comment: Jerome Godbout added a comment - 2017-03-22 13:57

Expand comment: Jerome Godbout added a comment - 2017-03-22 13:57

Collapse comment: Jesse Glick added a comment - 2017-04-25 19:17

Expand comment: Jesse Glick added a comment - 2017-04-25 19:17

Collapse comment: Jerome Godbout added a comment - 2017-04-27 13:42

Expand comment: Jerome Godbout added a comment - 2017-04-27 13:42

Collapse comment: Jerome Godbout added a comment - 2017-04-27 13:48

Expand comment: Jerome Godbout added a comment - 2017-04-27 13:48

Collapse comment: Jerome Godbout added a comment - 2017-04-27 13:49

Expand comment: Jerome Godbout added a comment - 2017-04-27 13:49

Collapse comment: Jesse Glick added a comment - 2017-04-27 16:14

Expand comment: Jesse Glick added a comment - 2017-04-27 16:14

Collapse comment: Jerome Godbout added a comment - 2017-04-27 17:03

Expand comment: Jerome Godbout added a comment - 2017-04-27 17:03

Collapse comment: Jesse Glick added a comment - 2017-04-27 17:13

Expand comment: Jesse Glick added a comment - 2017-04-27 17:13

Collapse comment: Jerome Godbout added a comment - 2017-04-27 17:16, Edited by Jerome Godbout - 2017-04-27 17:17

Expand comment: Jerome Godbout added a comment - 2017-04-27 17:16, Edited by Jerome Godbout - 2017-04-27 17:17

Collapse comment: Jesse Glick added a comment - 2017-04-27 17:19

Expand comment: Jesse Glick added a comment - 2017-04-27 17:19

Collapse comment: Jerome Godbout added a comment - 2017-04-27 17:26

Expand comment: Jerome Godbout added a comment - 2017-04-27 17:26

Collapse comment: Jesse Glick added a comment - 2017-04-27 17:36

Expand comment: Jesse Glick added a comment - 2017-04-27 17:36

Collapse comment: Jerome Godbout added a comment - 2017-05-30 17:16

Expand comment: Jerome Godbout added a comment - 2017-05-30 17:16

Collapse comment: Jesse Glick added a comment - 2017-05-30 19:03

Expand comment: Jesse Glick added a comment - 2017-05-30 19:03

Collapse comment: Jerome Godbout added a comment - 2017-05-30 19:28

Expand comment: Jerome Godbout added a comment - 2017-05-30 19:28

Collapse comment: Jesse Glick added a comment - 2017-05-30 19:48

Expand comment: Jesse Glick added a comment - 2017-05-30 19:48

Collapse comment: James Femia added a comment - 2017-07-05 21:03

Expand comment: James Femia added a comment - 2017-07-05 21:03

Collapse comment: James Femia added a comment - 2017-07-12 11:00

Expand comment: James Femia added a comment - 2017-07-12 11:00

Collapse comment: Jerome Godbout added a comment - 2017-07-12 14:44

Expand comment: Jerome Godbout added a comment - 2017-07-12 14:44

Collapse comment: Jerome Godbout added a comment - 2017-07-12 20:01

Expand comment: Jerome Godbout added a comment - 2017-07-12 20:01

Collapse comment: Jerome Godbout added a comment - 2017-07-13 12:40

Expand comment: Jerome Godbout added a comment - 2017-07-13 12:40

People

Dates