-
Bug
-
Resolution: Unresolved
-
Critical
-
CentOS 6, JRE 7 or 8
Windows Server 2008 R2, JRE 7 or 8
-
Powered by SuggestiMate
Similar to bug 22641, Jenkins no longer cleaning up child processes when a job is stopped.
To reproduce : Start any job with a surefire execution, once the tests start running, stop the job through the jenkins interface. The surefire process continues to run.
Just like bug 22641, this behavior started in 1.553, was resolved in 1.565.2, then started again in 1.587.
Rating this as critical because processes build up eventually causing issues with running out of memory or hitting the nproc limit.
- is blocking
-
JENKINS-22641 Jenkins no longer kills running processes after job fails
-
- Closed
-
- is duplicated by
-
JENKINS-28968 Aborting builds does not kill surefire sub-process
-
- Reopened
-
- is related to
-
JENKINS-38807 Jenkins 2.7.4 seems to leave behind Java processes (on Windows agent) if the build is aborted/agent loses connection
-
- Open
-
- links to
[JENKINS-26048] Jenkins no longer cleaning up child processes when build stopped - as of 1.587
When you write it started in 1.587, do you mean that it worked in 1.586? Or do you not know when between 1.565.2 and 1.587 it started?
The problem is NOT present pre 1.553, and also NOT present from 1.565.2 through version 1.586.
So yes, the problem is NOT occurring in 1.586, which is the version we reverted to.
The problem IS present in 1.553 until fixed in 1.565.2, then it emerged again in 1.587 and continues in the current version.
If you could help pinpoint this issue, that'd help a lot.
- Is it only Maven or also Freestyle projects? Test the latter.
- Is it occurring when building on master, or on slave nodes? Both?
If you can reliably reproduce this, and know how to build Jenkins, a git bisect would help in pinpointing the responsible commit. Otherwise, a really idiot(me)-proof step-by-step instruction how to reproduce and test for this when compiling e.g. Jenkins itself would be great.
The issue happens on both master and slaves, no difference there.
For the freestyle test - Would it tell us what we need to know if we created a freestyle project, but configure it to perform the same maven steps?
For the freestyle test - Would it tell us what we need to know if we created a freestyle project, but configure it to perform the same maven steps?
Yes, that would be ideal.
With the Holiday break, it will be a few weeks before I can work on this more.
Has anyone else encountered this problem? Please chime in with details.
One of our maven jobs aborts because of a timeout during a test but the process on the slave is never stopped.
I created a freestyle job with an 'Invoke top-level Maven targets' build step. When the job aborts the process on the slave is also stopped.
During the holidays we updated from 1.574 to 1.594
I'm not sure if this is related:
I have configured my jobs to fail on timeout. This works in the freestyle project but not in the Maven project.
The Maven type build is marked as 'aborted'
Happens to me with a freestyle project with 1.598, Windows master, Linux slave (debian wheezy).
When I abort a job, it appears as aborted in Jenkins, but keeps running.
The following exception appears in the log.
Project #362 aborted
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:146)
at hudson.remoting.Channel.call(Channel.java:751)
at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:179)
at com.sun.proxy.$Proxy47.isAlive(Unknown Source)
at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:984)
at hudson.plugins.xshell.XShellBuilder.perform(XShellBuilder.java:140)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:761)
at hudson.model.Build$BuildExecution.build(Build.java:199)
at hudson.model.Build$BuildExecution.doRun(Build.java:160)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:536)
at hudson.model.Run.execute(Run.java:1718)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:89)
at hudson.model.Executor.run(Executor.java:240)
Added XShell Plugin to components list, as it is mentioned in the stack trace.
Here is my stack trace. I'm executing a freestyle matrix job. My master is on a Mac, and this matrix job runs on multiple heterogeneous slaves (hence the use of the XShell plugin), but on this particular occasion it was a slave instance on a Mac that we wanted to abort. This is what appeared in the Jenkins log (with project name and slave name edited):
INFO: TestMatrixMyProjectName/SLAVE=MyMacSlave #308 aborted
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:146)
at hudson.remoting.Channel.call(Channel.java:742)
at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:168)
at com.sun.proxy.$Proxy54.isAlive(Unknown Source)
at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:961)
at hudson.plugins.xshell.XShellBuilder.perform(XShellBuilder.java:140)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:756)
at hudson.model.Build$BuildExecution.build(Build.java:198)
at hudson.model.Build$BuildExecution.doRun(Build.java:159)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
at hudson.model.Run.execute(Run.java:1706)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:232)
The processes on the Mac carried on running, even though it appeared from the Jenkins dashboard that the job had been aborted successfully.
I have now changed my jobs so that they don't use the XShell plugin (instead they use conditional build steps to run either a Windows BAT file or a Unix shell script). This has solved the problem with the child processes not being cleaned up. HOWEVER, I still get a stack trace in the Jenkins log file. This suggests to me that there are two separate problems here. The failure to clean up the child processes is definitely connected with the XShell plugin. The stacktrace is separate.
Here's an example of the stacktrace I get now (this particular one is from a different slave):
INFO: TestMatrixMyProjectName/SLAVE=MyUbuntuSlave #311 aborted
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:146)
at hudson.remoting.Channel.call(Channel.java:742)
at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:168)
at com.sun.proxy.$Proxy73.join(Unknown Source)
at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:956)
at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:137)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:97)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
at org.jenkinsci.plugins.conditionalbuildstep.BuilderChain.perform(BuilderChain.java:71)
at org.jenkins_ci.plugins.run_condition.BuildStepRunner$2.run(BuildStepRunner.java:110)
at org.jenkins_ci.plugins.run_condition.BuildStepRunner$Fail.conditionalRun(BuildStepRunner.java:154)
at org.jenkins_ci.plugins.run_condition.BuildStepRunner.perform(BuildStepRunner.java:105)
at org.jenkinsci.plugins.conditionalbuildstep.ConditionalBuilder.perform(ConditionalBuilder.java:133)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:756)
at hudson.model.Build$BuildExecution.build(Build.java:198)
at hudson.model.Build$BuildExecution.doRun(Build.java:159)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
at hudson.model.Run.execute(Run.java:1706)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:232)
I just marked JENKINS-28968 as a duplicate of this. Over there, I have a simple procedure for creating an offending Maven build type job that has this problem. One thing I noticed was that one process is killed after aborting.
Just before aborting:
$ ps aux | grep sure
[user] 4220 0.0 0.0 113120 1188 ? S 10:12 0:00 /bin/sh -c cd /home/ussuser/jenkins/workspace/sleeptest && /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51-2.4.5.5.el7.x86_64/jre/bin/java -jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefirebooter449566822541979931.jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire7684357083774633779tmp /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire_01856690741005869733tmp
[user] 4222 13.0 0.1 6102248 31668 ? Sl 10:12 0:00 java -jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefirebooter449566822541979931.jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire7684357083774633779tmp /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire_01856690741005869733tmp
After aborting:
$ ps aux | grep sure
[user] 4222 5.2 0.1 6102248 31612 ? Sl 10:12 0:00 java -jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefirebooter449566822541979931.jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire7684357083774633779tmp /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire_01856690741005869733tmp
I had a chance to look at it a little more. I don't understand it fully, but I think the big difference is that the process tree is not referenced in the maven job. As such, we kill the maven instance itself but child processes (such as surefire) aren't stopped. I'm not sure why maven isn't taking care of that itself – does Jenkins kill -9 maven?
I think the other question is why Jenkins isn't using the process tree to kill the entire tree in maven jobs. It seems like it would be easy for a maven job to spin of a child (like surefire), which could exec another child.
This has been a massive impediment for our team for almost a year now. We have a story branch workflow with 10+ jobs in parallel on average. Every time a build is aborted I have to manually log onto the server via SSH and kill Surefire or Failsafe instances along with their child PhantomJS instances because otherwise we run out of memory quickly.
Please fix the bug and in the meantime offer a workaround for Maven jobs, if possible.
We get bitten by this too. A bunch of our jobs have a pre build step that basically does a:
ps aux | grep [the-thing-you-want-to-kill] | grep -v grep | awk "{ print \$2 }"
and pass those pids to kill -9
That kills every "zombie/parent-less" process BEFORE the build runs. This ensures the build will run in a relatively clean environment.
Just be careful what you kill
This is only a workaround, but it works for us.
Hope this helps!
After a bit of investigation, we found out that all of the processes that stay after a job has been terminated "abnormally" will have INIT as their parent process. So if you kill the process(tree) that starts with a java process which has INIT as parent, you should be safe (and in my opinion safer as grepping for strings).
You could for example create a cronjob for this task.
I've kept my company on Jenkins version 1.586 this whole time. It does not suffer from this bug and cleans up all surefire and phantomjs processes nicely.
I have a simple repro case with a freestyle project and "Execute shell" build steps. I'm on Jenkins 1.625.3, Linux master, Linux slave.
I hit this because we have freestyle projects where several build processes execute in parallel, but use flock to gate access to a shared resource. Sometimes, when a job is aborted, one of more of these child processes persist and prevent future jobs on the slave from acquiring the shared lock. The flock stuff below isn't necessary to repro - just sleep gives similar behavior - but it matches my use case and makes it easy to identify affected processes with fuser.
Create a freestyle project with two build steps:
- Execute shell
#!/bin/bash -ex nohup flock /var/lock/mylockfile sleep 1h &
- Execute shell
#!/bin/bash -ex sleep 1h
Then abort the job (manually or by timeout). flock and its child sleep process persist, and continue to hold the lock.
This is the simplest project configuration I could construct. In all of these cases, the child processes are killed as expected:
- Omitting the second "Execute shell."
- Combining them into a single "Execute shell."
- Failing by means other than abort, e.g. /bin/false in the second "Execute shell."
Sample results below. While the job is running, the lock is in use as expected:
$ fuser /var/lock/mylockfile 22733 22734 $ ps -p 22733,22734 -o pid,ppid,stat,lstart,args PID PPID STAT STARTED COMMAND 22733 1 S Wed Feb 24 00:57:51 2016 flock /var/lock/mylockfile sleep 1h 22734 22733 S Wed Feb 24 00:57:51 2016 sleep 1h
Then abort the job:
[experimental_jenkins_26048] $ /bin/bash -ex /tmp/hudson8042917752397215577.sh + nohup flock /var/lock/mylockfile sleep 1h [experimental_jenkins_26048] $ /bin/bash -ex /tmp/hudson4924658810125221857.sh + sleep 1h Build timed out (after 3 minutes). Marking the build as aborted. Build was aborted Finished: ABORTED
Afterwards, the processes are still alive:
$ ps -p 22733,22734 -o pid,ppid,stat,lstart,args PID PPID STAT STARTED COMMAND 22733 1 S Wed Feb 24 00:57:51 2016 flock /var/lock/mylockfile sleep 1h 22734 22733 S Wed Feb 24 00:57:51 2016 sleep 1h
BUILD_ID is unchanged, so ProcessTreeKiller should find them:
$ strings /proc/22733/environ | grep BUILD_ID BUILD_ID=17 $ strings /proc/22734/environ | grep BUILD_ID BUILD_ID=17
I faced the same bug with Jenkins 2.9 and some shell jobs started via pipelines. It does not always happen so I don't really know what is causing it.
Possibly related discussion: https://github.com/jenkinsci/docker/issues/54
Hi, I have the same problem on my windows server when build running on the slave machine.
Right now I have resolved by running builds criticism on the master.
I use Jenkins 2.11.
I also have seen Jenkins not killing off subprocesses when a step (regular, multiphase, and the new pipeline plugin) jobs, using Jenkins 2.15, on Windows 10 systems. I'm happy to provide further machine specs & repro.
My test case was running 'python -c "import time;time.sleep(100000)"' and aborting/terminating the job (both) from Jenkins. Using pipeline.
This can leave file handles open and block subsequent attempts to build in the same workspace, causing failures.
claudiocurzi What is the build criticism on master you are referring to?
I have only builds using "Execute Windows Batch Command".
All my builds when running on the slave server and it being aborted by the timeout settings on the log, the process on the slave server continue to run.
For us it seems to work fine when jobs are run on master, but processes are not killed when job runs on slave (Maven job type).
I recently encountered this issue in the context of using the build-timeout plugin to stop a maven build; I don't have an actual fix for the bug in the maven module, but was able to put together some code to make the build timeout plugin kill off appropriate child processes. See my comment on JENKINS-28125 if you think that code might be of use to you.
We are seeing this issue.
Linux master, Windows slaves.
No Maven. Just regular pipeline jobs, starting a background process like this:
bat "start MyProc.exe"
From my understanding the process should be killed when the job is done? But it keeps running.
tib currently Pipeline sh/bat steps kill all processes when the build is interrupted, but not when the main process simply exits on its own. Could be filed under durable-task-plugin. Unrelated to this issue.
Just a summary of more or less recent changes on this front, in Windows:
- In Jenkins 2.34 (WinP 1.24) there were fixes of the process management logic in Windows:
JENKINS-20913andJENKINS-24453. Before this version process termination is not reliable - In Jenkins 2.50+ (WinSW 2.0) there were many changes improving the process termination logic if the service gets terminated/restarted. Changelog: https://github.com/kohsuke/winsw/blob/master/CHANGELOG.md
- In Jenkins 2.50+ (WinSW 2.0) I have added the Runaway Process Killer extension, which kills spawned processes if Windows Service Wrapper executable gets aborted. In some cases it may also cause build processes to runaway
What is known to NOT work:
- Termination of 64bit processes in Windows if your Jenkins master/agent runs on 32bit JRE/JDK. Won't fix though it needs to be documented somewhere
I feel for Windows the issue is more or less resolved. Or it needs retesting at least.
Regarding Unix OS, we need to understand if it still happens in modern Jenkins versions.
I am closing the issue according to the comment above. If you still experience the issue, please feel free to create follow-up issues for your cases.
Verified this is still happening in latest jenkins version on CentOS 6.
For the test I'm running java tests with the failsafe plugin, which launches multiple jvms for concurrent execution and many phantomjs processes.
1.586 - When build is stopped, all child processes are stopped
1.587 - When build is stopped, all child processes keep running
2.66.1 - When build is stopped, all child processes keep running
dbogardus Are you able to reproduce this issue with a simpler environment? What are the complete steps to reproduce from scratch?
When you write 2.66.1, do you mean 2.46.1?
danielbeck , I used the latest weekly version - 2.66.1 . (It's just what came down from yum)
I am able to reproduce it with the simple java/mvn/surefire scenario created by rddesmond :
With this small example, 1.586 cleans up the surefire process, and 2.66.1 does not.
dbogardus I propose to move the discussion back to the JENKINS-28968 thread. Maven Project plugin is a very specific case in terms of the architecture, so I would rather vote for handling it separately until we confirm there is a generic issue in the core. So far I am not conviced
oleg_nenashev Acknowledged. I did try a few things to reproduce the bug outside of the maven/surefire scenario but was unable to.
Hi,
We're facing a similar issue - but we're using gradle and not using maven or surefire.
We use pipeline and call gradle using the bat step (windows 64 bit slaves, master is on windows too).
This happens on 2 separate projects in which gradle spawns child processes for testing using various other tools. Sometimes 'gradle test' hangs due to bugs in the test code/scenarios and when we stop the build - there are always java processes left up which we have to kill manually.
Verified that ProcessTreeKiller is not disabled.
Using Jenkins 2.73.3, pipeline suite 2.5.
Apparently the 32bit OS support was partially broken in WinP: https://github.com/kohsuke/winp/issues/48
Does anybody see the issue on 64bit systems?
I will have no time to work on it anytime soon, please see https://groups.google.com/d/msg/jenkinsci-dev/uc6NsMoCFQI/AIO4WG1UCwAJ for the context. I will unassign it so that somebody else can work on it
Hi,
I'm going to fix this bug, this is quite critical for my daily work with jenkins.
oleg_nenashev it seems like it doesn't appear in 64 bit slaves.
Is there a solution in sight? Or is it solved by running a slave at 64 bit?
Is this specific to the Maven project type? Or to freestyle?