-
Bug
-
Resolution: Unresolved
-
Critical
-
Window 10
Jenkins 2.51
-
Powered by SuggestiMate
The pipeline batch command failed 3 out of 4 times and hang mostly after a long command. Both master and slave node are waiting for each other. Not sure it's the same, but here's what I have:
- Jenkins 2.51
- Windows 10 slave
- Linux Master (CentOS 7)
- pipeline script from SCM
- Build Trigger is Poll SCM (manual trigger build does not have this behavior and complete successfully)
- Mercurial SCM
- The session is lock during the job is executing (user is still logon and slave is still available)
- Seem to always happen on long batch command (short one doesn't display this behavior or maybe it's just less likely)
- The project is parametrized for pipeline script repos and revision (default value are provided and the proper checkout is made).
- Seem like the command complete successfully I see the final data into the log but it look like the master/slave doesn't known the batch command have terminated
- I use the following syntax:
bat returnStatus: false, script: 'msbuild ...'
I cannot stop/cancel the build. I have to restart the master to unjam the slave and master (killing the slave client doesn't do anything either).
Here's the last things into the console log:
18:00:58 18:00:58 Build succeeded. 18:00:58 0 Warning(s) 18:00:58 0 Error(s) 18:00:58 18:00:58 Time Elapsed 00:15:41.55
which is correct, indicate to me that the msbuild command finished properly.
This is a total show stopper, we cannot have any more CI with this behavior, we always have to restart the master. Make us wonder if we should start looking for an alternative (I have report this issue into the forum thread, without any answer 3 times already). The batch command seem to hang for many people if I see the bug listing, we all have different system and setup, but they all are related to the batch command seem like a nightmare for hang. Some are marked as resolved and many are still open.
[JENKINS-42988] Batch command hang upon completion
Probably a duplicate of one of the existing issues in this component; awaiting steps to reproduce from scratch and/or a Windows expert.
Still happen on 2.56
Yeah probably a duplicate of many Windows hang communication between master/slave, some of them have been open a long time ago.
Maybe a quick workaround is to have a dead lock checker (is both, slave/master waiting for each other) and stop/cancel the build. At least until problem is resolved for real. Right now it put a slave into a busy state that no more can be used and it jam the CI totally, which render the system useless for CI. I have to reboot the master everyday.
related issue could be (still open into critical, major):
https://issues.jenkins-ci.org/browse/JENKINS-28759 (since 2015/06)
https://issues.jenkins-ci.org/browse/JENKINS-33164 (since 2016/02)
They all seem to be related to batch command return. Either the slave doesn't catch it properly, doesn't communicate it properly or Master doesn't handle the answer properly. Also seem to happen for very long batch command it that might help.
I can reproduce it at will if you want some additional information, just ask.
Well I can tell you what Jenkins is waiting for: an exit status in the file in the control directory. Somehow this is not getting written, perhaps just due to my novice batch scripting skills (i.e., copying and pasting from stackoverflow.com).
A PowerShell implementation is in the works, which might be more reliable.
Ok, nice to known. I guess the source for this is there:
I might give a look (I'm no Java expert nor Windows guru) but I did some process/fork code.
Right. At this point I would recommend just waiting for the PowerShell version which will probably have its own set of bugs, but I hope not this one.
At first I would look at the Runtime process into Java, the length of the output might be a problems (which also be my problems, I have a long output when this occur, building the whole solution). Seem like a buffer under run is possible.
http://www.javaworld.com/article/2071275/core-java/when-runtime-exec---won-t.html
This might be why, the I/O doesn't get clean fast enough and the command deadlock.
Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock.
Like every doc with 'some native platform' they mean Windows . Seem like a good trail there.
The section 4.7 give a good output (take from the link above):
import java.util.*; import java.io.*; class StreamGobbler extends Thread { InputStream is; String type; OutputStream os; StreamGobbler(InputStream is, String type) { this(is, type, null); } StreamGobbler(InputStream is, String type, OutputStream redirect) { this.is = is; this.type = type; this.os = redirect; } public void run() { try { PrintWriter pw = null; if (os != null) pw = new PrintWriter(os); InputStreamReader isr = new InputStreamReader(is); BufferedReader br = new BufferedReader(isr); String line=null; while ( (line = br.readLine()) != null) { if (pw != null) pw.println(line); System.out.println(type + ">" + line); } if (pw != null) pw.flush(); } catch (IOException ioe) { ioe.printStackTrace(); } } } public class GoodWinRedirect { public static void main(String args[]) { if (args.length < 1) { System.out.println("USAGE java GoodWinRedirect <outputfile>"); System.exit(1); } try { FileOutputStream fos = new FileOutputStream(args[0]); Runtime rt = Runtime.getRuntime(); Process proc = rt.exec("java jecho 'Hello World'"); // any error message? StreamGobbler errorGobbler = new StreamGobbler(proc.getErrorStream(), "ERROR"); // any output? StreamGobbler outputGobbler = new StreamGobbler(proc.getInputStream(), "OUTPUT", fos); // kick them off errorGobbler.start(); outputGobbler.start(); // any error??? int exitVal = proc.waitFor(); System.out.println("ExitValue: " + exitVal); fos.flush(); fos.close(); } catch (Throwable t) { t.printStackTrace(); } } }
Maybe this can help or not, but sure seem like a good trail to test.
Irrelevant here. Unlike with freestyle builds, we are forking the wrapper script process ultimately with ProcessBuilder, but that is not expected to produce any output. Actual user process output is redirected to a file which Jenkins tails.
the tail seem to work fine I see the end result of the command into the log. So the wrapped ProcessBuilder must have a problems. Sorry I don't known enough about Jenkins and Java details here to be of any help it seem. Thanks for the feedback.
Again I suspect the problem is not the output, but the exit status file.
If this might help, manual triggered build never show this behavior, only scheduled one. I have no clue what is different behind the scene inside Jenkins, the only thing I can say about the slave Windows node when the trigger is activated is that the following power management is set on the slave (high performance based):
- HD is turn off after 20 min
- Display turn off after 15 min
- Sleep after never
- Allow hybrid sleep on
- Hibernate after never
- Allow wake timers enable
- The computer is on the login screen but user is still login
Seem like when the build is trigger on the slave and the slave is idle, this cause the problems (occur often around 66%) on 2 different machine with the exact same setup. Manual trigger build haven't show this behavior in more then 20 builds with the same batch command. There is something that seem to prevent the batch return code to be seen under those circumstances.
Not sure it's the right track, maybe it just expose it more.
Hmm. I cannot think offhand of any reason why the build trigger method would have anything to do with this. Might be more about the time of day and thus machine load?
I did try during the day when user session is logged in and active. Seem to happen less often, but still have happen with periodical polling, but still happen. When user is lock screen, it seem to happen more often (not sure both are related or just pure random luck on this part, but it's almost 66% of the time). I tried 3 different hour of the day (9PM, 3AM, 4AM) all with the same dead lock nearly everyday with polling and session is lock.
But when session is active and build is trigger manually it seem to never happen.
I think it's more related to the user session, we are using the slave with a user session since we need the GUI/OpenGL context for our unit tests. So it seem trigger the build with a polling when session is lock make a difference. As stated before this machine doesn't go in hybernation nor real sleep, only affect HD and monitor.
Here's what I will try:
- I will try to remove the HD sleep, even if the batch command is HD intensive (compiling, it's MSBuild batch command return, complete fully and output the build result into the master log).
- I will try to prevent session from locking and start a polling.
I will try to post the result of those 2 tests just to figure out.
I noticed something resembling this issue after upgrading LTS to 2.60.1 - one of my agents fairly regularly completes a process in a batch step then just hangs. I tried using process explorer to see what was keeping it open but as soon as I interact with the "dead" process in any way, it terminated.
Terminating the batch executable on the agent side seems to allow the master to continue executing the job.
Spent some time in AV exceptions etc, can't seem to find a way of debugging the issue.
Rolling back to LTS 2.46.3 resulted batch steps no longer randomly hanging for me
I just added the Support Plugin information if this might help. I'm currently trying to reproduce it into another project, but reducing the scope (removing some variables from the jenkinsfile, removing instruction after the build, removing the email into the big try catch) for some reason seem to avoid the problem so far, I'm trying to figure out what is the difference between both project (they do the exact same thing up to the normal one hang, on the same repos checkout). Sound like there's something around the instruction that make the instruction hang for that particular project. But so far I don't have a clue of what this could be.
Add a new one, this one was perform with the reduced jenkinsfile and other project (just seem to happen less often for some obscure reason). The reduce jenkinsfile does the same operation on the same repos, just trimmed the email and other stuff like that.
Next step is to create a mini repos/code and try to see if this still can happen or if I can do it with a much simpler bat command (avoiding the whole msbuild thing).
One last one where the Pipeline TestHang build #10 is hang while the #11 is running just fine on another slave. (this might help to compare both)
I have attach the Master log, system info and thread dump. There's nothing special into the master host dmesg.