[JENKINS-34150] Pipeline Batch hangs

Antonio Muñiz added a comment - 2016-04-18 09:50

nitram Thanks for the info. I'm going to move the master to a Windows box too as it's the only thing I see my environment differing from yours.

Antonio Muñiz added a comment - 2016-04-18 09:50 nitram Thanks for the info. I'm going to move the master to a Windows box too as it's the only thing I see my environment differing from yours.

Antonio Muñiz added a comment - 2016-04-18 10:33

The issue is reproducible only if two batch durable tasks run concurrently on the master node.

Antonio Muñiz added a comment - 2016-04-18 10:33 The issue is reproducible only if two batch durable tasks run concurrently on the master node.

Antonio Muñiz added a comment - 2016-04-18 11:04

For some reason the result file is not being created, this line must be failing to execute (but I can not find any log):

echo %ERRORLEVEL% > "[JENKINS_HOME]\my-job@tmp\durable-acc8d5a4\jenkins-result.txt"

While the build was hanging, I manually executed jenkins-wrap.bat and it made the execution to finish (as the result file was created).

Trying to see why the result file is not being written.

Antonio Muñiz added a comment - 2016-04-18 11:04 For some reason the result file is not being created, this line must be failing to execute (but I can not find any log): echo %ERRORLEVEL% > "[JENKINS_HOME]\my-job@tmp\durable-acc8d5a4\jenkins-result.txt" While the build was hanging, I manually executed jenkins-wrap.bat and it made the execution to finish (as the result file was created). Trying to see why the result file is not being written.

Daniel Beck added a comment - 2016-04-18 11:09

Wild guess: Spaces in JENKINS_HOME path when installing using the installer?

Daniel Beck added a comment - 2016-04-18 11:09 Wild guess: Spaces in JENKINS_HOME path when installing using the installer?

Antonio Muñiz added a comment - 2016-04-18 11:15

No. It's reproducible for me without spaces in the path (and didn't use the installer but direct java -jar mode).

Antonio Muñiz added a comment - 2016-04-18 11:15 No. It's reproducible for me without spaces in the path (and didn't use the installer but direct java -jar mode).

Antonio Muñiz added a comment - 2016-04-18 11:43

Not a regression in core at least, reproduced in 1.651.1 (Pipeline Durable Task Step 2.0 + Durable Task 1.9)

Antonio Muñiz added a comment - 2016-04-18 11:43 Not a regression in core at least, reproduced in 1.651.1 (Pipeline Durable Task Step 2.0 + Durable Task 1.9)

Daniel Daugherty added a comment - 2016-04-18 12:18

Antonio, If you apply my pull request (20) you will see the logs for the jenkins-wraper.bat to include the echo command run. But you will no longer experince the error. Just as you will not enconter the error if you apply Martins pull (21). The main action that both change is that a reference to Launcher.ProcStarter ps is maintained after doLaunch() is called. This to me speaks of a possible GC issue where the Proc is destroyed before task completes. And that causes the wraper to not finish. But when I watch performance monitor in windows as the job is runing I don't see the command prompts being killed early. So may locking issue as Martin mentioned earlier. Where checking for the existince of the result file prevents the result from being created. But that does not explain why keeping a reference to the PS instance causes the bug to no longer happen. That speaks more to something happening due to GC. If this was C I would call it a use after free errror. Where PS is nolonger reference but expected to continue to do things.

Daniel Daugherty added a comment - 2016-04-18 12:18 Antonio, If you apply my pull request (20) you will see the logs for the jenkins-wraper.bat to include the echo command run. But you will no longer experince the error. Just as you will not enconter the error if you apply Martins pull (21). The main action that both change is that a reference to Launcher.ProcStarter ps is maintained after doLaunch() is called. This to me speaks of a possible GC issue where the Proc is destroyed before task completes. And that causes the wraper to not finish. But when I watch performance monitor in windows as the job is runing I don't see the command prompts being killed early. So may locking issue as Martin mentioned earlier. Where checking for the existince of the result file prevents the result from being created. But that does not explain why keeping a reference to the PS instance causes the bug to no longer happen. That speaks more to something happening due to GC. If this was C I would call it a use after free errror. Where PS is nolonger reference but expected to continue to do things.

Martin Karing added a comment - 2016-04-18 15:03

dpd_30: Actually it does explain why it works when maintaining the reference to the process causes the bug to disappear. My pull requests works because it does not monitor the presence of the file, but it waits until the process is terminated and only after the process is no longer present is looks for the file. This way there is no file checking done as long as the process is active and the batch file is able to create the result file without any problems.

Martin Karing added a comment - 2016-04-18 15:03 dpd_30 : Actually it does explain why it works when maintaining the reference to the process causes the bug to disappear. My pull requests works because it does not monitor the presence of the file, but it waits until the process is terminated and only after the process is no longer present is looks for the file. This way there is no file checking done as long as the process is active and the batch file is able to create the result file without any problems.

Antonio Muñiz added a comment - 2016-04-18 15:27

This way there is no file checking done as long as the process is active

Right. As I noted in the PR, keeping that reference is just what this plugin is trying to avoid.

I think the GC theory is probably the culprit. Perhaps if we keep a transient instance field private transient Launcher.ProcStarter ps in WindowsBatchScript it is prevented to be collected. I'm currently testing this option.

Antonio Muñiz added a comment - 2016-04-18 15:27 This way there is no file checking done as long as the process is active Right. As I noted in the PR, keeping that reference is just what this plugin is trying to avoid. I think the GC theory is probably the culprit. Perhaps if we keep a transient instance field private transient Launcher.ProcStarter ps in WindowsBatchScript it is prevented to be collected. I'm currently testing this option.

Antonio Muñiz added a comment - 2016-04-18 15:35

No, it does not work.

Antonio Muñiz added a comment - 2016-04-18 15:35 No, it does not work.

Martin Karing added a comment - 2016-04-18 16:45

This issue is really annoying. I tried to track it with the SysInternals Process Monitor. As soon as the monitor runs the issue does not happen any more.

Also I tried to alter the wrapping batch file to check if the file was created after writing the error code and if not try again. This does not resolve the issue. It seems like the batch file "sees" the jenkins-result.txt during it's execution.

Martin Karing added a comment - 2016-04-18 16:45 This issue is really annoying. I tried to track it with the SysInternals Process Monitor. As soon as the monitor runs the issue does not happen any more. Also I tried to alter the wrapping batch file to check if the file was created after writing the error code and if not try again. This does not resolve the issue. It seems like the batch file "sees" the jenkins-result.txt during it's execution.

Martin Karing added a comment - 2016-04-18 18:22

Okay, the reason why the loop in the batch file did not work is that something is killing the entire batch file structure before it finishes. I have no idea why this happens and I can't track it because process monitor seems to do something that stops this from happening.

I was able to get it running by changing the script so the execution of the wrapper is done by a additional "start" command. This cause command line windows to popup all over my desktop, but it allowed the entire thing to execute properly. This approach has the massive disadvantage that there is no way to terminate the script process in case the script itself hangs or something like this because it runs fully detached and there is no reference to the process.

On the other hand the purpose of this plugin is to allow scripts to run across restarts of Jenkins. So it has to run as a detached process from Jenkins, so the JVM doesn't tear it down along with it, but we need a serializable reference to the process so it's possible to locate it again. Just in case it is required to terminate it after a Jenkins reboot.

Martin Karing added a comment - 2016-04-18 18:22 Okay, the reason why the loop in the batch file did not work is that something is killing the entire batch file structure before it finishes. I have no idea why this happens and I can't track it because process monitor seems to do something that stops this from happening. I was able to get it running by changing the script so the execution of the wrapper is done by a additional " start " command. This cause command line windows to popup all over my desktop, but it allowed the entire thing to execute properly. This approach has the massive disadvantage that there is no way to terminate the script process in case the script itself hangs or something like this because it runs fully detached and there is no reference to the process. On the other hand the purpose of this plugin is to allow scripts to run across restarts of Jenkins. So it has to run as a detached process from Jenkins, so the JVM doesn't tear it down along with it, but we need a serializable reference to the process so it's possible to locate it again. Just in case it is required to terminate it after a Jenkins reboot.

KK G added a comment - 2016-04-18 18:27

It's related to https://issues.jenkins-ci.org/browse/JENKINS-33164. Just attach a simple repro from that bug:
Pipeline code is:
node('master') {
for(int i=0; i < 100; ++i)
{ bat('echo "Hello from batch file."' + i.toString()) }}
Click "build now" 5 times.
All 5 jobs got stuck on windows OS. Please help. Thanks.

KK G added a comment - 2016-04-18 18:27 It's related to https://issues.jenkins-ci.org/browse/JENKINS-33164 . Just attach a simple repro from that bug: Pipeline code is: node('master') { for(int i=0; i < 100; ++i) { bat('echo "Hello from batch file."' + i.toString()) }} Click "build now" 5 times. All 5 jobs got stuck on windows OS. Please help. Thanks.

Antonio Muñiz added a comment - 2016-04-19 08:15

I was not able to reproduce the issue in a debug session and did not manage to diagnose why jenkins-wrapper.bat is not fully executed (so jenkins-result.txt is not created) and the bat step never finishes. If someone with more Windows background can throw some light here, it would be great.

Perhaps the additional start command proposed by nitram is the less ugly fix, what others think? jglick ?

In the meantime, the workaround is to use a build agent (other than master), even being in the same physical machine.

Antonio Muñiz added a comment - 2016-04-19 08:15 I was not able to reproduce the issue in a debug session and did not manage to diagnose why jenkins-wrapper.bat is not fully executed (so jenkins-result.txt is not created) and the bat step never finishes. If someone with more Windows background can throw some light here, it would be great. Perhaps the additional start command proposed by nitram is the less ugly fix, what others think? jglick ? In the meantime, the workaround is to use a build agent (other than master), even being in the same physical machine.

Martin Karing added a comment - 2016-04-19 08:35

I was able to track down that the batch process is forcefully terminated.
If you run the wrapper batch by hand and close the command line window before the process finishes you get exactly the same behaviour. The main and the child processes are terminated and no files are created. The only thing that is attached to the command line actually is java. So the termination has to come from there.

The thing I wonder is: Can all this even work across a jenkins restart? If Java terminates its child processes this would kill no command line execution no matter what.

I think there are solutions to work around this using powershell or the scripting host. But those may be blocked on the host system.

Martin Karing added a comment - 2016-04-19 08:35 I was able to track down that the batch process is forcefully terminated. If you run the wrapper batch by hand and close the command line window before the process finishes you get exactly the same behaviour. The main and the child processes are terminated and no files are created. The only thing that is attached to the command line actually is java. So the termination has to come from there. The thing I wonder is: Can all this even work across a jenkins restart? If Java terminates its child processes this would kill no command line execution no matter what. I think there are solutions to work around this using powershell or the scripting host. But those may be blocked on the host system.

KK G added a comment - 2016-04-19 17:08

Just gave a try. "workaround is to use a build agent (other than master), even being in the same physical machine." really works! Thanks. At least, I can proceed.

BTW, I notice that for the same machine, master node has info, "Windows Server 2012 R2 (x86)", while client has info, "Windows Server 2012 R2 (amd64)". I doubt if the bug trigger corner case related to machine architecture.

KK G added a comment - 2016-04-19 17:08 Just gave a try. "workaround is to use a build agent (other than master), even being in the same physical machine." really works! Thanks. At least, I can proceed. BTW, I notice that for the same machine, master node has info, "Windows Server 2012 R2 (x86)", while client has info, "Windows Server 2012 R2 (amd64)". I doubt if the bug trigger corner case related to machine architecture.

Jesse Glick added a comment - 2016-04-26 15:24

Can all this even work across a jenkins restart?

If you are using a master executor, not generally. (On Unix, it works under some conditions but not others.)

You are strongly recommended to use an agent rather than master executors in general. In particular, if you have any kind of layered security on your Jenkins installation—whereby people configuring jobs (or permitted to edit build scripts in SCM) are not Jenkins administrators—you must not have a master executor, or any pretense at security is gone. Even if only one physical computer is available, you must configure a separate service account for builds.

All that said, if the problem can be fixed—or at least clearly diagnosed and reported—without breaking anything for the more general use case of an agent on another machine, obviously we want to apply a fix.

Jesse Glick added a comment - 2016-04-26 15:24 Can all this even work across a jenkins restart? If you are using a master executor, not generally. (On Unix, it works under some conditions but not others.) You are strongly recommended to use an agent rather than master executors in general. In particular, if you have any kind of layered security on your Jenkins installation—whereby people configuring jobs (or permitted to edit build scripts in SCM) are not Jenkins administrators—you must not have a master executor, or any pretense at security is gone. Even if only one physical computer is available, you must configure a separate service account for builds. All that said, if the problem can be fixed—or at least clearly diagnosed and reported—without breaking anything for the more general use case of an agent on another machine, obviously we want to apply a fix.

Lübbe Onken added a comment - 2016-04-28 08:17

This hanging batch bug has bitten me heavily too. Failing batch jobs always terminated, successful jobs never did.
A working solution for me is to explicitely return an exit code from any batch call. I'm on Windows 7 Professional.

So:
{{
echo 'Successful step'
bat '''dir
exit /B %ERRORLEVEL%'''

echo 'Failing step'
bat '''find /c "_no.file" "_no.file"
exit /B %ERRORLEVEL%'''

echo 'Never execute step'
bat '''dir"
exit /B %ERRORLEVEL%'''
}}

successfully terminates step one and returns from the batch execution, the second step fails the build and the third step never gets executed.

Can somebody please confirm that this solution works for them too?

Lübbe Onken added a comment - 2016-04-28 08:17 This hanging batch bug has bitten me heavily too. Failing batch jobs always terminated, successful jobs never did. A working solution for me is to explicitely return an exit code from any batch call. I'm on Windows 7 Professional. So: {{ echo 'Successful step' bat '''dir exit /B %ERRORLEVEL%''' echo 'Failing step' bat '''find /c "_no.file" "_no.file" exit /B %ERRORLEVEL%''' echo 'Never execute step' bat '''dir" exit /B %ERRORLEVEL%''' }} successfully terminates step one and returns from the batch execution, the second step fails the build and the third step never gets executed. Can somebody please confirm that this solution works for them too?

Lübbe Onken added a comment - 2016-04-28 08:35 - edited

Looks like I was too optimistic. The solution always worked with short running batch jobs, like dir, but it didn't with long running jobs, like a NAnt build.
Is it possible that there is a race condition? Some state is checked very quickly after a task is started. A simple "dir" is quick enough to deliver the result in time and a slower task isn't?

Lübbe Onken added a comment - 2016-04-28 08:35 - edited Looks like I was too optimistic. The solution always worked with short running batch jobs, like dir, but it didn't with long running jobs, like a NAnt build. Is it possible that there is a race condition? Some state is checked very quickly after a task is started. A simple "dir" is quick enough to deliver the result in time and a slower task isn't?

Christophe Carpentier added a comment - 2016-04-28 08:58

That would explain the iconsistent results during my tests. Weird.
Anyway, I've encountered this both on failed and successful jobs.

Christophe Carpentier added a comment - 2016-04-28 08:58 That would explain the iconsistent results during my tests. Weird. Anyway, I've encountered this both on failed and successful jobs.

Rens Hoskens added a comment - 2016-04-28 11:44 - edited

Have the same issue on Windows Server 2008 R2. Hope it will get fixed soon (or a plain maven command would be usefull as well)

node {
    mvn 'clean package -DskipTests=true'
}

def mvn(args) {
    bat "${tool 'Maven 3.3.9'}/bin/mvn ${args}"
}

Rens Hoskens added a comment - 2016-04-28 11:44 - edited Have the same issue on Windows Server 2008 R2. Hope it will get fixed soon (or a plain maven command would be usefull as well) node { mvn 'clean package -DskipTests=true' } def mvn(args) { bat "${tool 'Maven 3.3.9'}/bin/mvn ${args}" }

Nick Sonneveld added a comment - 2016-05-03 11:30 - edited

I have mentioned in other related tickets but I just want to point out that I have seen this behaviour with a linux master and multiple windows agents (with 5-10 executors on each). You could try this example code. I haven't tested it but it's similar to our Jenkinsfile where we have branches doing chunks of a test. Cancelling the job in the middle of execution sometimes puts the agents in a weird state too.

def branches = [:]

for (int i = 0; i < 64; i++) {
	def id = "branch-${i}"
	branches[id] = {
		node ('windows') {
			for (int j = 0; j < 8; j++) {
			    bat 'ping 127.0.0.1 -n 10' 
			}
		}
	}
}

parallel branches

Nick Sonneveld added a comment - 2016-05-03 11:30 - edited I have mentioned in other related tickets but I just want to point out that I have seen this behaviour with a linux master and multiple windows agents (with 5-10 executors on each). You could try this example code. I haven't tested it but it's similar to our Jenkinsfile where we have branches doing chunks of a test. Cancelling the job in the middle of execution sometimes puts the agents in a weird state too. def branches = [:] for ( int i = 0; i < 64; i++) { def id = "branch-${i}" branches[id] = { node ( 'windows' ) { for ( int j = 0; j < 8; j++) { bat 'ping 127.0.0.1 -n 10' } } } } parallel branches

Gijs Kuijer added a comment - 2016-05-04 13:07

I have the same exact issue on a windows server 2012 R2 with a Jenkins 2.1 installations and all plugins fully updated.
I have installed the Github organization folder plugin to scan my organization.

My jenkins file has a simple batch job to use MSBuild to build our project and a batch job for analysis of sonar.
The job randomly hangs after one of these two jobs.

Is there any progress on this issue?

Gijs Kuijer added a comment - 2016-05-04 13:07 I have the same exact issue on a windows server 2012 R2 with a Jenkins 2.1 installations and all plugins fully updated. I have installed the Github organization folder plugin to scan my organization. My jenkins file has a simple batch job to use MSBuild to build our project and a batch job for analysis of sonar. The job randomly hangs after one of these two jobs. Is there any progress on this issue?

maaltan natlaam added a comment - 2016-05-05 13:38

I have this issue also. my batch invocation is:

bat '''
call %BUILD_CONFIG_PATH%
setenv.cmd
perl <custom build manager script that typically runs for 90 minutes>
'''

This has broken the job completely. I cannot terminate the job, nor can i start another (this is an incremental build with a fixed workspace location so i dont want to run concurrently). If i restart jenkins the job restarts and immediately hangs again.

Are there any workaround short of full uninstall/reinstall of jenkins to recover this job setup?

maaltan natlaam added a comment - 2016-05-05 13:38 I have this issue also. my batch invocation is: bat ''' call %BUILD_CONFIG_PATH% setenv.cmd perl <custom build manager script that typically runs for 90 minutes> ''' This has broken the job completely. I cannot terminate the job, nor can i start another (this is an incremental build with a fixed workspace location so i dont want to run concurrently). If i restart jenkins the job restarts and immediately hangs again. Are there any workaround short of full uninstall/reinstall of jenkins to recover this job setup?

Wilson Tian added a comment - 2016-05-05 14:10 - edited

I encounter this issue too. I'm running a maven job using

bat "${mavenHome}\\bin\\mvn clean package"

. But the job always hangs at last and never exits.
Is there any workaround?

Wilson Tian added a comment - 2016-05-05 14:10 - edited I encounter this issue too. I'm running a maven job using bat "${mavenHome}\\bin\\mvn clean package " . But the job always hangs at last and never exits. Is there any workaround?

maaltan natlaam added a comment - 2016-05-05 21:36

I found a workaround (better than reinstalling at least).

1. Shutdown jenkins service
2. go to <install>/jobs/<jobname>/ and delete the <jobnumber> folder.
3. Restart jenkins.

There is probably a flag somewhere in that folder you can set to prevent the job from "restarting" after restart.

Since ive hit this bug about 80% of the runs I've tried so far. this workaround is unusable in any kind of production environment. At least you can run the job again though. I guess another workaround would be allow multiple instances of the job to run and clean up the zombies once a day or something?

Also, it seems this happens more when i view the console output via jenkins ui while the job is running.

maaltan natlaam added a comment - 2016-05-05 21:36 I found a workaround (better than reinstalling at least). 1. Shutdown jenkins service 2. go to <install>/jobs/<jobname>/ and delete the <jobnumber> folder. 3. Restart jenkins. There is probably a flag somewhere in that folder you can set to prevent the job from "restarting" after restart. Since ive hit this bug about 80% of the runs I've tried so far. this workaround is unusable in any kind of production environment. At least you can run the job again though. I guess another workaround would be allow multiple instances of the job to run and clean up the zombies once a day or something? Also, it seems this happens more when i view the console output via jenkins ui while the job is running.

Nick Sonneveld added a comment - 2016-05-06 02:39 - edited

Another workaround that doesn't involve deleting jobs (but also isn't a long term solution) is to realise that batch steps create two batch files in the @tmp directory (which is relative to where batch is run, so it might be in the workspace if you've changed the directory, or just outside it): a jenkins-main.bat and a jenkins-wrap.bat. The main bat file contains your commands. The wrap bat file will run the main bat, pipe output to a log file and finally writes a result file.

The bug concerns the wrap bat file not completing so the result file is never written. You can search for the file and run the final line manually (looks like echo %errorlevel% > ...\jenkins-result.txt), or run the wrap batch file again if you don't mind it performing the same operation again.

Nick Sonneveld added a comment - 2016-05-06 02:39 - edited Another workaround that doesn't involve deleting jobs (but also isn't a long term solution) is to realise that batch steps create two batch files in the @tmp directory (which is relative to where batch is run, so it might be in the workspace if you've changed the directory, or just outside it): a jenkins-main.bat and a jenkins-wrap.bat. The main bat file contains your commands. The wrap bat file will run the main bat, pipe output to a log file and finally writes a result file. The bug concerns the wrap bat file not completing so the result file is never written. You can search for the file and run the final line manually (looks like echo %errorlevel% > ...\jenkins-result.txt), or run the wrap batch file again if you don't mind it performing the same operation again.

maaltan natlaam added a comment - 2016-05-06 13:48

cmd /c ""<script> > ".../jenkins-log.txt"" 2>&1
echo %ERRORLEVEL% > "...\..@tmp\durable-cf7a3b23\jenkins-result.txt"

Perhaps using "call" will work better. that will leverage the current cmd shell to execute the batch. Ive found it more stable than launching a second cmd from a batch file. "start" is another option. That gives you a subshell that is detached from the main shell. There are parameters that prevent that though. other bonuses of start is ability to set process priorities,etc.

If you know of the jar/class i need to hack off hand to make this change, i'll try to give it a shot today.

I am probably going to end up grabbing the jenkins source at some point but that will probably be later next week if then.

maaltan natlaam added a comment - 2016-05-06 13:48 cmd /c ""<script> > ".../jenkins-log.txt"" 2>&1 echo %ERRORLEVEL% > "...\..@tmp\durable-cf7a3b23\jenkins-result.txt" Perhaps using "call" will work better. that will leverage the current cmd shell to execute the batch. Ive found it more stable than launching a second cmd from a batch file. "start" is another option. That gives you a subshell that is detached from the main shell. There are parameters that prevent that though. other bonuses of start is ability to set process priorities,etc. If you know of the jar/class i need to hack off hand to make this change, i'll try to give it a shot today. I am probably going to end up grabbing the jenkins source at some point but that will probably be later next week if then.

Nick Sonneveld added a comment - 2016-05-06 15:28

There is a pull request being worked on by Martin Karing you might want to look at and comment on. Link is attached to this issue https://github.com/jenkinsci/durable-task-plugin/pull/21

Nick Sonneveld added a comment - 2016-05-06 15:28 There is a pull request being worked on by Martin Karing you might want to look at and comment on. Link is attached to this issue https://github.com/jenkinsci/durable-task-plugin/pull/21

maaltan natlaam added a comment - 2016-05-09 14:12

First off, i found a better workaround. First go to the console screen for the job. click the abort button in upper right area. scroll to bottom and wait about 10 seconds. You will see a link allowing you to force kill the job.

Started At: 05-06-2016 20:00:08
Ended At: 05-06-2016 20:02:19
Build Lasted: 2 minutes 10 seconds
Highest Error Code: 0
<hang here>
Aborted by admin
Sending interrupt signal to process
Click here to forcibly terminate running steps
Terminating bat
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: ABORTED

Unfortunately I'm seeing an almost 100% chance of hangs on my machine so still pretty useless.

-------------------------

I acquired Karing's code and tested it. it terminates my bat build steps after about 3-5 seconds no matter what the state is. I didnt dig too much into that.

I reverted to baseline and tried my suggestions. None of them work. In fact, i can't prove that the script call from jenkins-wrap.bat ever returns...

here is my current attempt at jenkins-wrap.bat:
cmd /c ""...\jenkins-main.bat"" > "...\jenkins-log.txt" 2>&1
:retry
echo writing jenkins.results >> "...\jenkins-log.txt"
echo %ERRORLEVEL% > "...\jenkins-result.txt"
if not exist "...\jenkins-result.txt" goto retry

It is supposed to jackhammer that results file until it is created. I see no "writing jenkins.results" in the logs, therefore the wrapper script is terminating early. The same thing happens if i replace cmd /c with call or start.
(note: "..." is a placeholder for my real paths not some kind of relative path thing. sorry for confusion.)

maaltan natlaam added a comment - 2016-05-09 14:12 First off, i found a better workaround. First go to the console screen for the job. click the abort button in upper right area. scroll to bottom and wait about 10 seconds. You will see a link allowing you to force kill the job. Started At: 05-06-2016 20:00:08 Ended At: 05-06-2016 20:02:19 Build Lasted: 2 minutes 10 seconds Highest Error Code: 0 <hang here> Aborted by admin Sending interrupt signal to process Click here to forcibly terminate running steps Terminating bat [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Finished: ABORTED Unfortunately I'm seeing an almost 100% chance of hangs on my machine so still pretty useless. ------------------------- I acquired Karing's code and tested it. it terminates my bat build steps after about 3-5 seconds no matter what the state is. I didnt dig too much into that. I reverted to baseline and tried my suggestions. None of them work. In fact, i can't prove that the script call from jenkins-wrap.bat ever returns... here is my current attempt at jenkins-wrap.bat: cmd /c ""...\jenkins-main.bat"" > "...\jenkins-log.txt" 2>&1 :retry echo writing jenkins.results >> "...\jenkins-log.txt" echo %ERRORLEVEL% > "...\jenkins-result.txt" if not exist "...\jenkins-result.txt" goto retry It is supposed to jackhammer that results file until it is created. I see no "writing jenkins.results" in the logs, therefore the wrapper script is terminating early. The same thing happens if i replace cmd /c with call or start. (note: "..." is a placeholder for my real paths not some kind of relative path thing. sorry for confusion.)

James Nord added a comment - 2016-05-19 12:45

I believe I have a 100% reproducible test-case for this issue

It seems that in the case that the parent process has been killed (e.g. the slave dies) then all though the script terminates successfully and the wrapper terminates successfully (checked with process monitor) there is no attempt to create the result file.
where the parent process has not been killed I never see this issue.

All I needed to do to fix the issue I was observing is add @echo off as the first line of the wrapper script. Basically I believe it is trying to echo the commands to be run before running the commands and as there is no longer anything consuming the wrappers input/output when echoing the command it is doomed to fail (but strangely not with an exit code that implies something died!??!

James Nord added a comment - 2016-05-19 12:45 I believe I have a 100% reproducible test-case for this issue It seems that in the case that the parent process has been killed (e.g. the slave dies) then all though the script terminates successfully and the wrapper terminates successfully (checked with process monitor) there is no attempt to create the result file. where the parent process has not been killed I never see this issue. All I needed to do to fix the issue I was observing is add @echo off as the first line of the wrapper script. Basically I believe it is trying to echo the commands to be run before running the commands and as there is no longer anything consuming the wrappers input/output when echoing the command it is doomed to fail (but strangely not with an exit code that implies something died!??!

Dmitry Vyazelenko added a comment - 2016-05-19 13:09

I'm also having a build hang with simple pipeline that uses bat script to run Gradle tasks:

node {
    timeout(time: 10, unit: 'MINUTES') {
        timestamps {
            stage 'Checkout'
            git ...
            
            stage 'Tests'
            bat 'gradlew test'
            step([$class: 'JUnitResultArchiver', testResults: 'build/test-results/*.xml'])
        }
    }
}

Dmitry Vyazelenko added a comment - 2016-05-19 13:09 I'm also having a build hang with simple pipeline that uses bat script to run Gradle tasks: node { timeout(time: 10, unit: 'MINUTES' ) { timestamps { stage 'Checkout' git ... stage 'Tests' bat 'gradlew test' step([$class: 'JUnitResultArchiver' , testResults: 'build/test-results/*.xml' ]) } } }

James Nord added a comment - 2016-05-19 13:21 - edited

For anyone observing the issue you can try installing the build from PR24 or PR21 and see if this resolves your issue. (I would start with PR24 first as it is a much smaller change, but then I am biased!)

James Nord added a comment - 2016-05-19 13:21 - edited For anyone observing the issue you can try installing the build from PR24 or PR21 and see if this resolves your issue. (I would start with PR24 first as it is a much smaller change, but then I am biased!)

Christophe Carpentier added a comment - 2016-05-19 14:23

PR24 fixes my particular issue.

Christophe Carpentier added a comment - 2016-05-19 14:23 PR24 fixes my particular issue.

SCM/JIRA link daemon added a comment - 2016-05-19 15:15

Code changed in jenkins
User: James Nord
Path:
src/main/java/org/jenkinsci/plugins/durabletask/WindowsBatchScript.java
http://jenkins-ci.org/commit/durable-task-plugin/d156ebfbcdb70666757ff48127d0597bd5891a61
Log:
~~JENKINS-34150~~ Fixes my observed issue.

I have a reproducable tests case in a propratary implementation using this
code that is 100% reproducable.
The simple "@echo off" fixes the failing test for me.

It seems that in the case that the parent process has been killed (e.g.
the slave dies) then all though the script terminates successfully and the
wrapper terminates successfully (checked with process monitor) there is no
attempt to create the result file.
where the parent process has not been killed I never see this issue.
All I needed to do to fix the issue I was observing is add @echo off as
the first line of the wrapper script. Basically I believe it is trying to
echo the commands to be run before running the commands and as there is no
longer anything consuming the wrappers input/output when echoing the
command it is doomed to fail (but strangely not with an exit code that
implies something died!??!

SCM/JIRA link daemon added a comment - 2016-05-19 15:15 Code changed in jenkins User: James Nord Path: src/main/java/org/jenkinsci/plugins/durabletask/WindowsBatchScript.java http://jenkins-ci.org/commit/durable-task-plugin/d156ebfbcdb70666757ff48127d0597bd5891a61 Log: JENKINS-34150 Fixes my observed issue. I have a reproducable tests case in a propratary implementation using this code that is 100% reproducable. The simple "@echo off" fixes the failing test for me. It seems that in the case that the parent process has been killed (e.g. the slave dies) then all though the script terminates successfully and the wrapper terminates successfully (checked with process monitor) there is no attempt to create the result file. where the parent process has not been killed I never see this issue. All I needed to do to fix the issue I was observing is add @echo off as the first line of the wrapper script. Basically I believe it is trying to echo the commands to be run before running the commands and as there is no longer anything consuming the wrappers input/output when echoing the command it is doomed to fail (but strangely not with an exit code that implies something died!??!

SCM/JIRA link daemon added a comment - 2016-05-19 15:15

Code changed in jenkins
User: Jesse Glick
Path:
src/main/java/org/jenkinsci/plugins/durabletask/WindowsBatchScript.java
http://jenkins-ci.org/commit/durable-task-plugin/8a2537cf28c826ad91d4ce14cd657712364c8953
Log:
Merge pull request #24 from jtnord/jenkins-34150

[FIXED JENKINS-34150] Fixes my observed issue.

Compare: https://github.com/jenkinsci/durable-task-plugin/compare/0f09bb54a1b7...8a2537cf28c8

SCM/JIRA link daemon added a comment - 2016-05-19 15:15 Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/durabletask/WindowsBatchScript.java http://jenkins-ci.org/commit/durable-task-plugin/8a2537cf28c826ad91d4ce14cd657712364c8953 Log: Merge pull request #24 from jtnord/jenkins-34150 [FIXED JENKINS-34150] Fixes my observed issue. Compare: https://github.com/jenkinsci/durable-task-plugin/compare/0f09bb54a1b7...8a2537cf28c8

Sam Van Oort added a comment - 2016-05-19 17:55

Most important single-line-of-code change I've seen recently.

Sam Van Oort added a comment - 2016-05-19 17:55 Most important single-line-of-code change I've seen recently.

Junichi Kimura added a comment - 2016-05-19 23:04 - edited

The test job below used to consistently get stuck at the 8mins stage. After upgrading the Durable Task Plugin to 1.10, the job passed successfully (once so far).

node {
   stage '1min'
   bat '''
@ECHO OFF
FOR /L %%A IN (0,1,60) DO (
  ECHO %%A
  PING 192.0.2.1 -n 1 -w 1000 >NUL
)
EXIT /B 0
'''
   stage '5mins'
   bat '''
@ECHO OFF
FOR /L %%A IN (0,1,300) DO (
  ECHO %%A
  PING 192.0.2.1 -n 1 -w 1000 >NUL
)
EXIT /B 0
'''
   stage '8mins'
   bat '''
@ECHO OFF
FOR /L %%A IN (0,1,480) DO (
  ECHO %%A
  PING 192.0.2.1 -n 1 -w 1000 >NUL
)
EXIT /B 0
'''
   stage '10mins'
   bat '''
@ECHO OFF
FOR /L %%A IN (0,1,600) DO (
  ECHO %%A
  PING 192.0.2.1 -n 1 -w 1000 >NUL
)
EXIT /B 0
'''
   stage '30mins'
   bat '''
@ECHO OFF
FOR /L %%A IN (0,1,1800) DO (
  ECHO %%A
  PING 192.0.2.1 -n 1 -w 1000 >NUL
)
EXIT /B 0
'''
   stage '1hr'
   bat '''
@ECHO OFF
FOR /L %%A IN (0,1,3600) DO (
  ECHO %%A
  PING 192.0.2.1 -n 1 -w 1000 >NUL
)
EXIT /B 0
'''
}

Junichi Kimura added a comment - 2016-05-19 23:04 - edited The test job below used to consistently get stuck at the 8mins stage. After upgrading the Durable Task Plugin to 1.10, the job passed successfully (once so far). node { stage '1min' bat ''' @ECHO OFF FOR /L %%A IN (0,1,60) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '5mins' bat ''' @ECHO OFF FOR /L %%A IN (0,1,300) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '8mins' bat ''' @ECHO OFF FOR /L %%A IN (0,1,480) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '10mins' bat ''' @ECHO OFF FOR /L %%A IN (0,1,600) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '30mins' bat ''' @ECHO OFF FOR /L %%A IN (0,1,1800) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' stage '1hr' bat ''' @ECHO OFF FOR /L %%A IN (0,1,3600) DO ( ECHO %%A PING 192.0.2.1 -n 1 -w 1000 >NUL ) EXIT /B 0 ''' }

Gijs Kuijer added a comment - 2016-05-20 05:39

Great job! Totally solves my issues!

Gijs Kuijer added a comment - 2016-05-20 05:39 Great job! Totally solves my issues!

Daniel Daugherty added a comment - 2016-05-20 20:30

Initial testing here show issue is resolved
Just getting back from vacation so nice to see this resolved. Thanks all for the good work.

Daniel Daugherty added a comment - 2016-05-20 20:30 Initial testing here show issue is resolved Just getting back from vacation so nice to see this resolved. Thanks all for the good work.

Marc Rufer added a comment - 2016-05-27 09:44

Thanks guys. Great job. Updating to the newest version of the durable-task-plugin solved the issue for me as well!

Marc Rufer added a comment - 2016-05-27 09:44 Thanks guys. Great job. Updating to the newest version of the durable-task-plugin solved the issue for me as well!

Sven Brosi added a comment - 2018-03-08 12:54

Hello,

since the version 1.20 of the Durable Task plugin we encounter the same behavior described above.

Our worker nodes have Win2012 and Win2018 environment.

A quick solution was to downgrade the version of this plugin to 1.18.

Then everything works again in the DSL Jenkinsfiles with bat(ch) step.

Sven Brosi added a comment - 2018-03-08 12:54 Hello, since the version 1.20 of the Durable Task plugin we encounter the same behavior described above. Our worker nodes have Win2012 and Win2018 environment. A quick solution was to downgrade the version of this plugin to 1.18. Then everything works again in the DSL Jenkinsfiles with bat(ch) step.

Steven Foster added a comment - 2018-03-08 14:00

I'm encountering the same issue after updating to 1.20 from 1.18

Steven Foster added a comment - 2018-03-08 14:00 I'm encountering the same issue after updating to 1.20 from 1.18

Sam Van Oort added a comment - 2018-03-08 14:41

stevenfoster Does it work if you set 'returnStdOut: true'? If so, I have an attached hotfix for you to try – please let us know if this resolves it. durable-task.hpi

Sam Van Oort added a comment - 2018-03-08 14:41 stevenfoster Does it work if you set 'returnStdOut: true'? If so, I have an attached hotfix for you to try – please let us know if this resolves it. durable-task.hpi

Sam Van Oort added a comment - 2018-03-08 14:47

lidl Does it work if you set 'returnStdOut: true', and if so, please try the attached hotfix and let us know if this resolves it.

Sam Van Oort added a comment - 2018-03-08 14:47 lidl Does it work if you set 'returnStdOut: true', and if so, please try the attached hotfix and let us know if this resolves it.

Steven Foster added a comment - 2018-03-08 15:06

returnStdout: true has the same result. confirmed the process is finished on the machine.

Steven Foster added a comment - 2018-03-08 15:06 returnStdout: true has the same result. confirmed the process is finished on the machine.

Sam Van Oort added a comment - 2018-03-08 15:14

stevenfoster In the control directory in the build agent's workspace for this job (where there's a jenkins-main.bat and jenkins-wrap.bat) do you see a jenkins-result.txt file, a jenkins-result.txt.tmp, or both?

I'm trying to figure out what actually triggered this because it did not fail any of our unit tests that should have explicitly covered this functionality.

Sam Van Oort added a comment - 2018-03-08 15:14 stevenfoster In the control directory in the build agent's workspace for this job (where there's a jenkins-main.bat and jenkins-wrap.bat) do you see a jenkins-result.txt file, a jenkins-result.txt.tmp, or both? I'm trying to figure out what actually triggered this because it did not fail any of our unit tests that should have explicitly covered this functionality.

Steven Foster added a comment - 2018-03-08 15:16

just the .tmp

Steven Foster added a comment - 2018-03-08 15:16 just the .tmp

Sam Van Oort added a comment - 2018-03-08 15:22

stevenfoster Okay, that means the "move" operation failed to rename the file, which should basically never happen. Would you be able to hop on #jenkins IRC briefly to discuss (I'm svanoort there)? It should be a quick and trivial fix but since we can't reproduce the issue in our own environment, it would be super-helpful to be able to see a case where it happened.

Sam Van Oort added a comment - 2018-03-08 15:22 stevenfoster Okay, that means the "move" operation failed to rename the file, which should basically never happen. Would you be able to hop on #jenkins IRC briefly to discuss (I'm svanoort there)? It should be a quick and trivial fix but since we can't reproduce the issue in our own environment, it would be super-helpful to be able to see a case where it happened.

Sven Brosi added a comment - 2018-03-08 16:02 - edited

svanoort: Can not test it again, as it is a corporate jenkins and i can not up- and downgrade the plugins on the fly.

How can i still help you?

Sven Brosi added a comment - 2018-03-08 16:02 - edited svanoort : Can not test it again, as it is a corporate jenkins and i can not up- and downgrade the plugins on the fly. How can i still help you?

Sam Van Oort added a comment - 2018-03-08 16:38

lidl Don't worry about it, stevenfoster was thankfully available to help debug in an environment where this is reproducible (thanks!).

I'm attaching one more hotfix version for him to try out which should fully resolve the issue. durable-task.hpi

Sam Van Oort added a comment - 2018-03-08 16:38 lidl Don't worry about it, stevenfoster was thankfully available to help debug in an environment where this is reproducible (thanks!). I'm attaching one more hotfix version for him to try out which should fully resolve the issue. durable-task.hpi

Sam Van Oort added a comment - 2018-03-08 17:08

This issue duplicates symptoms of ~~JENKINS-50025~~ but the root causes are significantly different, so that is being tracked separately.

Sam Van Oort added a comment - 2018-03-08 17:08 This issue duplicates symptoms of JENKINS-50025 but the root causes are significantly different, so that is being tracked separately.

Sam Van Oort added a comment - 2018-03-08 17:10

lidl stevenfoster I'm closing THIS issue because while the result is the same this has a different cause and is resolved in JENKINS-50025.

Sam Van Oort added a comment - 2018-03-08 17:10 lidl stevenfoster I'm closing THIS issue because while the result is the same this has a different cause and is resolved in JENKINS-50025 .

Sam Van Oort added a comment - 2018-03-08 17:29

lidl stevenfoster Released fix after review and testing as durable-task-plugin 1.21

Sam Van Oort added a comment - 2018-03-08 17:29 lidl stevenfoster Released fix after review and testing as durable-task-plugin 1.21

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Antonio Muñiz added a comment - 2016-04-18 09:50

Expand comment: Antonio Muñiz added a comment - 2016-04-18 09:50

Collapse comment: Antonio Muñiz added a comment - 2016-04-18 10:33

Expand comment: Antonio Muñiz added a comment - 2016-04-18 10:33

Collapse comment: Antonio Muñiz added a comment - 2016-04-18 11:04

Expand comment: Antonio Muñiz added a comment - 2016-04-18 11:04

Collapse comment: Daniel Beck added a comment - 2016-04-18 11:09

Expand comment: Daniel Beck added a comment - 2016-04-18 11:09

Collapse comment: Antonio Muñiz added a comment - 2016-04-18 11:15

Expand comment: Antonio Muñiz added a comment - 2016-04-18 11:15

Collapse comment: Antonio Muñiz added a comment - 2016-04-18 11:43

Expand comment: Antonio Muñiz added a comment - 2016-04-18 11:43

Collapse comment: Daniel Daugherty added a comment - 2016-04-18 12:18

Expand comment: Daniel Daugherty added a comment - 2016-04-18 12:18

Collapse comment: Martin Karing added a comment - 2016-04-18 15:03

Expand comment: Martin Karing added a comment - 2016-04-18 15:03

Collapse comment: Antonio Muñiz added a comment - 2016-04-18 15:27

Expand comment: Antonio Muñiz added a comment - 2016-04-18 15:27

Collapse comment: Antonio Muñiz added a comment - 2016-04-18 15:35

Expand comment: Antonio Muñiz added a comment - 2016-04-18 15:35

Collapse comment: Martin Karing added a comment - 2016-04-18 16:45

Expand comment: Martin Karing added a comment - 2016-04-18 16:45

Collapse comment: Martin Karing added a comment - 2016-04-18 18:22

Expand comment: Martin Karing added a comment - 2016-04-18 18:22

Collapse comment: KK G added a comment - 2016-04-18 18:27

Expand comment: KK G added a comment - 2016-04-18 18:27

Collapse comment: Antonio Muñiz added a comment - 2016-04-19 08:15

Expand comment: Antonio Muñiz added a comment - 2016-04-19 08:15

Collapse comment: Martin Karing added a comment - 2016-04-19 08:35

Expand comment: Martin Karing added a comment - 2016-04-19 08:35

Collapse comment: KK G added a comment - 2016-04-19 17:08

Expand comment: KK G added a comment - 2016-04-19 17:08

Collapse comment: Jesse Glick added a comment - 2016-04-26 15:24

Expand comment: Jesse Glick added a comment - 2016-04-26 15:24

Collapse comment: Lübbe Onken added a comment - 2016-04-28 08:17

Expand comment: Lübbe Onken added a comment - 2016-04-28 08:17

Collapse comment: Lübbe Onken added a comment - 2016-04-28 08:35, Edited by Lübbe Onken - 2016-04-28 08:36

Expand comment: Lübbe Onken added a comment - 2016-04-28 08:35, Edited by Lübbe Onken - 2016-04-28 08:36

Collapse comment: Christophe Carpentier added a comment - 2016-04-28 08:58

Expand comment: Christophe Carpentier added a comment - 2016-04-28 08:58

Collapse comment: Rens Hoskens added a comment - 2016-04-28 11:44, Edited by Rens Hoskens - 2016-04-28 11:46

Expand comment: Rens Hoskens added a comment - 2016-04-28 11:44, Edited by Rens Hoskens - 2016-04-28 11:46

Collapse comment: Nick Sonneveld added a comment - 2016-05-03 11:30, Edited by Nick Sonneveld - 2016-05-03 11:35

Expand comment: Nick Sonneveld added a comment - 2016-05-03 11:30, Edited by Nick Sonneveld - 2016-05-03 11:35

Collapse comment: Gijs Kuijer added a comment - 2016-05-04 13:07

Expand comment: Gijs Kuijer added a comment - 2016-05-04 13:07

Collapse comment: maaltan natlaam added a comment - 2016-05-05 13:38

Expand comment: maaltan natlaam added a comment - 2016-05-05 13:38

Collapse comment: Wilson Tian added a comment - 2016-05-05 14:10, Edited by Wilson Tian - 2016-05-05 14:11

Expand comment: Wilson Tian added a comment - 2016-05-05 14:10, Edited by Wilson Tian - 2016-05-05 14:11

Collapse comment: maaltan natlaam added a comment - 2016-05-05 21:36

Expand comment: maaltan natlaam added a comment - 2016-05-05 21:36

Collapse comment: Nick Sonneveld added a comment - 2016-05-06 02:39, Edited by Nick Sonneveld - 2016-05-06 02:39

Expand comment: Nick Sonneveld added a comment - 2016-05-06 02:39, Edited by Nick Sonneveld - 2016-05-06 02:39

Collapse comment: maaltan natlaam added a comment - 2016-05-06 13:48

Expand comment: maaltan natlaam added a comment - 2016-05-06 13:48

Collapse comment: Nick Sonneveld added a comment - 2016-05-06 15:28

Expand comment: Nick Sonneveld added a comment - 2016-05-06 15:28

Collapse comment: maaltan natlaam added a comment - 2016-05-09 14:12

Expand comment: maaltan natlaam added a comment - 2016-05-09 14:12

Collapse comment: James Nord added a comment - 2016-05-19 12:45

Expand comment: James Nord added a comment - 2016-05-19 12:45

Collapse comment: Dmitry Vyazelenko added a comment - 2016-05-19 13:09

Expand comment: Dmitry Vyazelenko added a comment - 2016-05-19 13:09

Collapse comment: James Nord added a comment - 2016-05-19 13:21, Edited by James Nord - 2016-05-19 13:21

Expand comment: James Nord added a comment - 2016-05-19 13:21, Edited by James Nord - 2016-05-19 13:21

Collapse comment: Christophe Carpentier added a comment - 2016-05-19 14:23

Expand comment: Christophe Carpentier added a comment - 2016-05-19 14:23

Collapse comment: SCM/JIRA link daemon added a comment - 2016-05-19 15:15

Expand comment: SCM/JIRA link daemon added a comment - 2016-05-19 15:15

Collapse comment: SCM/JIRA link daemon added a comment - 2016-05-19 15:15

Expand comment: SCM/JIRA link daemon added a comment - 2016-05-19 15:15