-
Bug
-
Resolution: Fixed
-
Blocker
-
Jenkins 2.0RC1, Jenkins 1.6* LTS
Windows 2008r2, Windows 2012r2, Windows 7, Windows 10
-
Powered by SuggestiMate
When batch task is included in a Pipeline job it will hang on completion of the batch task. I can see in task manager that the job startup it logs data to jenkins-log.txt. The batch completes and I see in task manager that the batch is nolonger running. But Jenkins is still waiting for the task to complete. I do not see jenkins-result.txt writen to the workspace tmp durabletask directory. If I create the file manually or run the workflow-wrap.bat manually the task completes. This is an itermitent bug. Task might work 3 time then fail 5 times then work 8 times. No change to the system during this time. I am setting the job to run every min to see what the stats look like for longer run.
job:
node { bat 'ping 127.0.0.1 -n 10' echo 'batch completed' }
Could be any command you want ping is just an easy one to have it take a little bit of time. And require nothing else installed on machine.
I see many other task like this I have tested on serveral differnt machines using a base install of Jenkins.
- How.png
- 321 kB
- durable-task.hpi
- 41 kB
- durable-task.hpi
- 41 kB
- duplicates
-
JENKINS-50025 On some Windows Build Agents, Batch Steps Hang
-
- Closed
-
- is related to
-
JENKINS-33164 Pipeline bat stuck on Windows Server 2012
-
- Open
-
-
JENKINS-33749 jenkins pipeline dsl.bat does not return
-
- Resolved
-
-
JENKINS-32000 durable-task 1.7 breaks workflow bat steps
-
- Open
-
-
JENKINS-33904 Jenkinsfile batch steps randomly hang when complete
-
- Open
-
-
JENKINS-33456 durable-task 1.8 breaks workflow bat steps
-
- Resolved
-
-
JENKINS-50025 On some Windows Build Agents, Batch Steps Hang
-
- Closed
-
- links to
[JENKINS-34150] Pipeline Batch hangs
It is a bug in the durable task plugin.
I created a test to reproduce it and submitted a fix to resolve the issue.
Both neatly bundled as a pull request: https://github.com/jenkinsci/durable-task-plugin/pull/21
After some hours trying combinations (and installing Windows VMs), I've not been able to reproduce it. Tried on:
- Jenkins 2.0 RC1 (running on Ubuntu, Durable Task 1.9) + Agent on Windows Server 2008R2 (JDK8)
- Jenkins 2.0 RC1 (running on OSX, Durable Task 1.9) + Agent on Windows Server 2012R2 (JDK8)
My test script is:
node ('windows') { bat 'ping 192.168.1.108 -n 10' // this is the Jenkins master IP in my local network echo 'batch completed' }
I executed up to 10 concurrent builds. All of them finished successfully.
I developed the test case and the fix on a Windows 10 development box.
I encountered the issue on a windows 2012 r2 server.
On both systems the hanging happened consistently. How ever on Windows 2012 I never encountered the problem in case only one script was executing. But starting with two scripts there was a fair chance of one of the scripts hanging. My test case on Windows 10 caused the hanging problem in 100% of the tests even with a single running task.
My best guess is that it is a timing issue, the test case I provided queries the exit status as fast as possible and causes the hanging issue this way. The problem is that the batch script does not create the file while the plugin is checking the file system for the presence of the file. I do not know why this happens, in my mind the file creation should be unrelated to the file presence check, but it seems on Windows systems it is not.
nitram Thanks for the info. I'm going to move the master to a Windows box too as it's the only thing I see my environment differing from yours.
The issue is reproducible only if two batch durable tasks run concurrently on the master node.
For some reason the result file is not being created, this line must be failing to execute (but I can not find any log):
echo %ERRORLEVEL% > "[JENKINS_HOME]\my-job@tmp\durable-acc8d5a4\jenkins-result.txt"
While the build was hanging, I manually executed jenkins-wrap.bat and it made the execution to finish (as the result file was created).
Trying to see why the result file is not being written.
Wild guess: Spaces in JENKINS_HOME path when installing using the installer?
No. It's reproducible for me without spaces in the path (and didn't use the installer but direct java -jar mode).
Not a regression in core at least, reproduced in 1.651.1 (Pipeline Durable Task Step 2.0 + Durable Task 1.9)
Antonio, If you apply my pull request (20) you will see the logs for the jenkins-wraper.bat to include the echo command run. But you will no longer experince the error. Just as you will not enconter the error if you apply Martins pull (21). The main action that both change is that a reference to Launcher.ProcStarter ps is maintained after doLaunch() is called. This to me speaks of a possible GC issue where the Proc is destroyed before task completes. And that causes the wraper to not finish. But when I watch performance monitor in windows as the job is runing I don't see the command prompts being killed early. So may locking issue as Martin mentioned earlier. Where checking for the existince of the result file prevents the result from being created. But that does not explain why keeping a reference to the PS instance causes the bug to no longer happen. That speaks more to something happening due to GC. If this was C I would call it a use after free errror. Where PS is nolonger reference but expected to continue to do things.
dpd_30: Actually it does explain why it works when maintaining the reference to the process causes the bug to disappear. My pull requests works because it does not monitor the presence of the file, but it waits until the process is terminated and only after the process is no longer present is looks for the file. This way there is no file checking done as long as the process is active and the batch file is able to create the result file without any problems.
This way there is no file checking done as long as the process is active
Right. As I noted in the PR, keeping that reference is just what this plugin is trying to avoid.
I think the GC theory is probably the culprit. Perhaps if we keep a transient instance field private transient Launcher.ProcStarter ps in WindowsBatchScript it is prevented to be collected. I'm currently testing this option.
This issue is really annoying. I tried to track it with the SysInternals Process Monitor. As soon as the monitor runs the issue does not happen any more.
Also I tried to alter the wrapping batch file to check if the file was created after writing the error code and if not try again. This does not resolve the issue. It seems like the batch file "sees" the jenkins-result.txt during it's execution.
Okay, the reason why the loop in the batch file did not work is that something is killing the entire batch file structure before it finishes. I have no idea why this happens and I can't track it because process monitor seems to do something that stops this from happening.
I was able to get it running by changing the script so the execution of the wrapper is done by a additional "start" command. This cause command line windows to popup all over my desktop, but it allowed the entire thing to execute properly. This approach has the massive disadvantage that there is no way to terminate the script process in case the script itself hangs or something like this because it runs fully detached and there is no reference to the process.
On the other hand the purpose of this plugin is to allow scripts to run across restarts of Jenkins. So it has to run as a detached process from Jenkins, so the JVM doesn't tear it down along with it, but we need a serializable reference to the process so it's possible to locate it again. Just in case it is required to terminate it after a Jenkins reboot.
It's related to https://issues.jenkins-ci.org/browse/JENKINS-33164. Just attach a simple repro from that bug:
Pipeline code is:
node('master') {
for(int i=0; i < 100; ++i)
{ bat('echo "Hello from batch file."' + i.toString()) }}
Click "build now" 5 times.
All 5 jobs got stuck on windows OS. Please help. Thanks.
I was not able to reproduce the issue in a debug session and did not manage to diagnose why jenkins-wrapper.bat is not fully executed (so jenkins-result.txt is not created) and the bat step never finishes. If someone with more Windows background can throw some light here, it would be great.
Perhaps the additional start command proposed by nitram is the less ugly fix, what others think? jglick ?
In the meantime, the workaround is to use a build agent (other than master), even being in the same physical machine.
I was able to track down that the batch process is forcefully terminated.
If you run the wrapper batch by hand and close the command line window before the process finishes you get exactly the same behaviour. The main and the child processes are terminated and no files are created. The only thing that is attached to the command line actually is java. So the termination has to come from there.
The thing I wonder is: Can all this even work across a jenkins restart? If Java terminates its child processes this would kill no command line execution no matter what.
I think there are solutions to work around this using powershell or the scripting host. But those may be blocked on the host system.
Just gave a try. "workaround is to use a build agent (other than master), even being in the same physical machine." really works! Thanks. At least, I can proceed.
BTW, I notice that for the same machine, master node has info, "Windows Server 2012 R2 (x86)", while client has info, "Windows Server 2012 R2 (amd64)". I doubt if the bug trigger corner case related to machine architecture.
Can all this even work across a jenkins restart?
If you are using a master executor, not generally. (On Unix, it works under some conditions but not others.)
You are strongly recommended to use an agent rather than master executors in general. In particular, if you have any kind of layered security on your Jenkins installation—whereby people configuring jobs (or permitted to edit build scripts in SCM) are not Jenkins administrators—you must not have a master executor, or any pretense at security is gone. Even if only one physical computer is available, you must configure a separate service account for builds.
All that said, if the problem can be fixed—or at least clearly diagnosed and reported—without breaking anything for the more general use case of an agent on another machine, obviously we want to apply a fix.
This hanging batch bug has bitten me heavily too. Failing batch jobs always terminated, successful jobs never did.
A working solution for me is to explicitely return an exit code from any batch call. I'm on Windows 7 Professional.
So:
{{
echo 'Successful step'
bat '''dir
exit /B %ERRORLEVEL%'''
echo 'Failing step'
bat '''find /c "_no.file" "_no.file"
exit /B %ERRORLEVEL%'''
echo 'Never execute step'
bat '''dir"
exit /B %ERRORLEVEL%'''
}}
successfully terminates step one and returns from the batch execution, the second step fails the build and the third step never gets executed.
Can somebody please confirm that this solution works for them too?
Looks like I was too optimistic. The solution always worked with short running batch jobs, like dir, but it didn't with long running jobs, like a NAnt build.
Is it possible that there is a race condition? Some state is checked very quickly after a task is started. A simple "dir" is quick enough to deliver the result in time and a slower task isn't?
That would explain the iconsistent results during my tests. Weird.
Anyway, I've encountered this both on failed and successful jobs.
Have the same issue on Windows Server 2008 R2. Hope it will get fixed soon (or a plain maven command would be usefull as well)
node { mvn 'clean package -DskipTests=true' } def mvn(args) { bat "${tool 'Maven 3.3.9'}/bin/mvn ${args}" }
I have mentioned in other related tickets but I just want to point out that I have seen this behaviour with a linux master and multiple windows agents (with 5-10 executors on each). You could try this example code. I haven't tested it but it's similar to our Jenkinsfile where we have branches doing chunks of a test. Cancelling the job in the middle of execution sometimes puts the agents in a weird state too.
def branches = [:] for (int i = 0; i < 64; i++) { def id = "branch-${i}" branches[id] = { node ('windows') { for (int j = 0; j < 8; j++) { bat 'ping 127.0.0.1 -n 10' } } } } parallel branches
I have the same exact issue on a windows server 2012 R2 with a Jenkins 2.1 installations and all plugins fully updated.
I have installed the Github organization folder plugin to scan my organization.
My jenkins file has a simple batch job to use MSBuild to build our project and a batch job for analysis of sonar.
The job randomly hangs after one of these two jobs.
Is there any progress on this issue?
I have this issue also. my batch invocation is:
bat '''
call %BUILD_CONFIG_PATH%
setenv.cmd
perl <custom build manager script that typically runs for 90 minutes>
'''
This has broken the job completely. I cannot terminate the job, nor can i start another (this is an incremental build with a fixed workspace location so i dont want to run concurrently). If i restart jenkins the job restarts and immediately hangs again.
Are there any workaround short of full uninstall/reinstall of jenkins to recover this job setup?
I encounter this issue too. I'm running a maven job using
bat "${mavenHome}\\bin\\mvn clean package"
. But the job always hangs at last and never exits.
Is there any workaround?
I found a workaround (better than reinstalling at least).
1. Shutdown jenkins service
2. go to <install>/jobs/<jobname>/ and delete the <jobnumber> folder.
3. Restart jenkins.
There is probably a flag somewhere in that folder you can set to prevent the job from "restarting" after restart.
Since ive hit this bug about 80% of the runs I've tried so far. this workaround is unusable in any kind of production environment. At least you can run the job again though. I guess another workaround would be allow multiple instances of the job to run and clean up the zombies once a day or something?
Also, it seems this happens more when i view the console output via jenkins ui while the job is running.
Another workaround that doesn't involve deleting jobs (but also isn't a long term solution) is to realise that batch steps create two batch files in the @tmp directory (which is relative to where batch is run, so it might be in the workspace if you've changed the directory, or just outside it): a jenkins-main.bat and a jenkins-wrap.bat. The main bat file contains your commands. The wrap bat file will run the main bat, pipe output to a log file and finally writes a result file.
The bug concerns the wrap bat file not completing so the result file is never written. You can search for the file and run the final line manually (looks like echo %errorlevel% > ...\jenkins-result.txt), or run the wrap batch file again if you don't mind it performing the same operation again.
cmd /c ""<script> > ".../jenkins-log.txt"" 2>&1
echo %ERRORLEVEL% > "...\..@tmp\durable-cf7a3b23\jenkins-result.txt"
Perhaps using "call" will work better. that will leverage the current cmd shell to execute the batch. Ive found it more stable than launching a second cmd from a batch file. "start" is another option. That gives you a subshell that is detached from the main shell. There are parameters that prevent that though. other bonuses of start is ability to set process priorities,etc.
If you know of the jar/class i need to hack off hand to make this change, i'll try to give it a shot today.
I am probably going to end up grabbing the jenkins source at some point but that will probably be later next week if then.
There is a pull request being worked on by Martin Karing you might want to look at and comment on. Link is attached to this issue https://github.com/jenkinsci/durable-task-plugin/pull/21
First off, i found a better workaround. First go to the console screen for the job. click the abort button in upper right area. scroll to bottom and wait about 10 seconds. You will see a link allowing you to force kill the job.
Started At: 05-06-2016 20:00:08
Ended At: 05-06-2016 20:02:19
Build Lasted: 2 minutes 10 seconds
Highest Error Code: 0
<hang here>
Aborted by admin
Sending interrupt signal to process
Click here to forcibly terminate running steps
Terminating bat
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: ABORTED
Unfortunately I'm seeing an almost 100% chance of hangs on my machine so still pretty useless.
-------------------------
I acquired Karing's code and tested it. it terminates my bat build steps after about 3-5 seconds no matter what the state is. I didnt dig too much into that.
I reverted to baseline and tried my suggestions. None of them work. In fact, i can't prove that the script call from jenkins-wrap.bat ever returns...
here is my current attempt at jenkins-wrap.bat:
cmd /c ""...\jenkins-main.bat"" > "...\jenkins-log.txt" 2>&1
:retry
echo writing jenkins.results >> "...\jenkins-log.txt"
echo %ERRORLEVEL% > "...\jenkins-result.txt"
if not exist "...\jenkins-result.txt" goto retry
It is supposed to jackhammer that results file until it is created. I see no "writing jenkins.results" in the logs, therefore the wrapper script is terminating early. The same thing happens if i replace cmd /c with call or start.
(note: "..." is a placeholder for my real paths not some kind of relative path thing. sorry for confusion.)
I believe I have a 100% reproducible test-case for this issue
It seems that in the case that the parent process has been killed (e.g. the slave dies) then all though the script terminates successfully and the wrapper terminates successfully (checked with process monitor) there is no attempt to create the result file.
where the parent process has not been killed I never see this issue.
All I needed to do to fix the issue I was observing is add @echo off as the first line of the wrapper script. Basically I believe it is trying to echo the commands to be run before running the commands and as there is no longer anything consuming the wrappers input/output when echoing the command it is doomed to fail (but strangely not with an exit code that implies something died!??!
I'm also having a build hang with simple pipeline that uses bat script to run Gradle tasks:
node { timeout(time: 10, unit: 'MINUTES') { timestamps { stage 'Checkout' git ... stage 'Tests' bat 'gradlew test' step([$class: 'JUnitResultArchiver', testResults: 'build/test-results/*.xml']) } } }
For anyone observing the issue you can try installing the build from PR24 or PR21 and see if this resolves your issue. (I would start with PR24 first as it is a much smaller change, but then I am biased!)
Code changed in jenkins
User: James Nord
Path:
src/main/java/org/jenkinsci/plugins/durabletask/WindowsBatchScript.java
http://jenkins-ci.org/commit/durable-task-plugin/d156ebfbcdb70666757ff48127d0597bd5891a61
Log:
JENKINS-34150 Fixes my observed issue.
I have a reproducable tests case in a propratary implementation using this
code that is 100% reproducable.
The simple "@echo off" fixes the failing test for me.
It seems that in the case that the parent process has been killed (e.g.
the slave dies) then all though the script terminates successfully and the
wrapper terminates successfully (checked with process monitor) there is no
attempt to create the result file.
where the parent process has not been killed I never see this issue.
All I needed to do to fix the issue I was observing is add @echo off as
the first line of the wrapper script. Basically I believe it is trying to
echo the commands to be run before running the commands and as there is no
longer anything consuming the wrappers input/output when echoing the
command it is doomed to fail (but strangely not with an exit code that
implies something died!??!
Code changed in jenkins
User: Jesse Glick
Path:
src/main/java/org/jenkinsci/plugins/durabletask/WindowsBatchScript.java
http://jenkins-ci.org/commit/durable-task-plugin/8a2537cf28c826ad91d4ce14cd657712364c8953
Log:
Merge pull request #24 from jtnord/jenkins-34150
[FIXED JENKINS-34150] Fixes my observed issue.
Compare: https://github.com/jenkinsci/durable-task-plugin/compare/0f09bb54a1b7...8a2537cf28c8
The test job below used to consistently get stuck at the 8mins stage. After upgrading the Durable Task Plugin to 1.10, the job passed successfully (once so far).
node { stage '1min' bat ''' @ECHO OFF FOR /L %%A IN (0,1,60) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '5mins' bat ''' @ECHO OFF FOR /L %%A IN (0,1,300) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '8mins' bat ''' @ECHO OFF FOR /L %%A IN (0,1,480) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '10mins' bat ''' @ECHO OFF FOR /L %%A IN (0,1,600) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '30mins' bat ''' @ECHO OFF FOR /L %%A IN (0,1,1800) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '1hr' bat ''' @ECHO OFF FOR /L %%A IN (0,1,3600) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' }
Initial testing here show issue is resolved
Just getting back from vacation so nice to see this resolved. Thanks all for the good work.
Thanks guys. Great job. Updating to the newest version of the durable-task-plugin solved the issue for me as well!
Hello,
since the version 1.20 of the Durable Task plugin we encounter the same behavior described above.
Our worker nodes have Win2012 and Win2018 environment.
A quick solution was to downgrade the version of this plugin to 1.18.
Then everything works again in the DSL Jenkinsfiles with bat(ch) step.
stevenfoster Does it work if you set 'returnStdOut: true'? If so, I have an attached hotfix for you to try – please let us know if this resolves it. durable-task.hpi
lidl Does it work if you set 'returnStdOut: true', and if so, please try the attached hotfix and let us know if this resolves it.
returnStdout: true has the same result. confirmed the process is finished on the machine.
stevenfoster In the control directory in the build agent's workspace for this job (where there's a jenkins-main.bat and jenkins-wrap.bat) do you see a jenkins-result.txt file, a jenkins-result.txt.tmp, or both?
I'm trying to figure out what actually triggered this because it did not fail any of our unit tests that should have explicitly covered this functionality.
stevenfoster Okay, that means the "move" operation failed to rename the file, which should basically never happen. Would you be able to hop on #jenkins IRC briefly to discuss (I'm svanoort there)? It should be a quick and trivial fix but since we can't reproduce the issue in our own environment, it would be super-helpful to be able to see a case where it happened.
svanoort: Can not test it again, as it is a corporate jenkins and i can not up- and downgrade the plugins on the fly.
How can i still help you?
lidl Don't worry about it, stevenfoster was thankfully available to help debug in an environment where this is reproducible (thanks!).
I'm attaching one more hotfix version for him to try out which should fully resolve the issue. durable-task.hpi
This issue duplicates symptoms of JENKINS-50025 but the root causes are significantly different, so that is being tracked separately.
lidl stevenfoster I'm closing THIS issue because while the result is the same this has a different cause and is resolved in JENKINS-50025.
lidl stevenfoster Released fix after review and testing as durable-task-plugin 1.21
I agree with Antonio that this is a durable task issue. While trouble shooting this I reverted a logging change in durable task and the issue could not longer be replicated. I suspected a locking issue with the results file where the durable task was locking the results file causing the batch to not be able to create it. But the current 1.9 version of the durable task plugin does not capture the output of the wraper batch file. When I enable the logging of that information the issue no longer shows up. True heisenbug. The act of looking for the cause of the problem causes it to nolonger exist. I submitted pull request to the durable task plugin with the 3 line change that I made. Reverting to older logging code in that project. https://github.com/jenkinsci/durable-task-plugin/pull/20
While this resolves the issue I do not feel that it is a fix. Logging or not logging should not cause batch to work or not work. It is masking the underlying issue where the result file does not get created. But with out the logging you can't see why the result file is not created.
Multiple executions is one of the ways that I have triggered it also.