Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26048

Jenkins no longer cleaning up child processes when build stopped - as of 1.587

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • core, maven-plugin
    • CentOS 6, JRE 7 or 8
      Windows Server 2008 R2, JRE 7 or 8

      Similar to bug 22641, Jenkins no longer cleaning up child processes when a job is stopped.

      To reproduce : Start any job with a surefire execution, once the tests start running, stop the job through the jenkins interface. The surefire process continues to run.

      Just like bug 22641, this behavior started in 1.553, was resolved in 1.565.2, then started again in 1.587.

      Rating this as critical because processes build up eventually causing issues with running out of memory or hitting the nproc limit.

          [JENKINS-26048] Jenkins no longer cleaning up child processes when build stopped - as of 1.587

          Daniel Beck added a comment -

          Is this specific to the Maven project type? Or to freestyle?

          Daniel Beck added a comment - Is this specific to the Maven project type? Or to freestyle?

          Don Bogardus added a comment -

          Only tested it with Maven projects.

          Don Bogardus added a comment - Only tested it with Maven projects.

          Daniel Beck added a comment -

          When you write it started in 1.587, do you mean that it worked in 1.586? Or do you not know when between 1.565.2 and 1.587 it started?

          Daniel Beck added a comment - When you write it started in 1.587, do you mean that it worked in 1.586? Or do you not know when between 1.565.2 and 1.587 it started?

          Don Bogardus added a comment -

          The problem is NOT present pre 1.553, and also NOT present from 1.565.2 through version 1.586.

          So yes, the problem is NOT occurring in 1.586, which is the version we reverted to.

          The problem IS present in 1.553 until fixed in 1.565.2, then it emerged again in 1.587 and continues in the current version.

          Don Bogardus added a comment - The problem is NOT present pre 1.553, and also NOT present from 1.565.2 through version 1.586. So yes, the problem is NOT occurring in 1.586, which is the version we reverted to. The problem IS present in 1.553 until fixed in 1.565.2, then it emerged again in 1.587 and continues in the current version.

          Daniel Beck added a comment -

          If you could help pinpoint this issue, that'd help a lot.

          • Is it only Maven or also Freestyle projects? Test the latter.
          • Is it occurring when building on master, or on slave nodes? Both?

          If you can reliably reproduce this, and know how to build Jenkins, a git bisect would help in pinpointing the responsible commit. Otherwise, a really idiot(me)-proof step-by-step instruction how to reproduce and test for this when compiling e.g. Jenkins itself would be great.

          Daniel Beck added a comment - If you could help pinpoint this issue, that'd help a lot. Is it only Maven or also Freestyle projects? Test the latter. Is it occurring when building on master, or on slave nodes? Both? If you can reliably reproduce this, and know how to build Jenkins, a git bisect would help in pinpointing the responsible commit. Otherwise, a really idiot(me)-proof step-by-step instruction how to reproduce and test for this when compiling e.g. Jenkins itself would be great.

          Don Bogardus added a comment -

          The issue happens on both master and slaves, no difference there.

          For the freestyle test - Would it tell us what we need to know if we created a freestyle project, but configure it to perform the same maven steps?

          Don Bogardus added a comment - The issue happens on both master and slaves, no difference there. For the freestyle test - Would it tell us what we need to know if we created a freestyle project, but configure it to perform the same maven steps?

          Daniel Beck added a comment -

          For the freestyle test - Would it tell us what we need to know if we created a freestyle project, but configure it to perform the same maven steps?

          Yes, that would be ideal.

          Daniel Beck added a comment - For the freestyle test - Would it tell us what we need to know if we created a freestyle project, but configure it to perform the same maven steps? Yes, that would be ideal.

          Don Bogardus added a comment -

          With the Holiday break, it will be a few weeks before I can work on this more.

          Has anyone else encountered this problem? Please chime in with details.

          Don Bogardus added a comment - With the Holiday break, it will be a few weeks before I can work on this more. Has anyone else encountered this problem? Please chime in with details.

          One of our maven jobs aborts because of a timeout during a test but the process on the slave is never stopped.

          I created a freestyle job with an 'Invoke top-level Maven targets' build step. When the job aborts the process on the slave is also stopped.

          During the holidays we updated from 1.574 to 1.594

          Clifford Sanders added a comment - One of our maven jobs aborts because of a timeout during a test but the process on the slave is never stopped. I created a freestyle job with an 'Invoke top-level Maven targets' build step. When the job aborts the process on the slave is also stopped. During the holidays we updated from 1.574 to 1.594

          Daniel Beck added a comment -

          Comment indicates this issue is specific to the Maven project type.

          Daniel Beck added a comment - Comment indicates this issue is specific to the Maven project type.

          I'm not sure if this is related:

          I have configured my jobs to fail on timeout. This works in the freestyle project but not in the Maven project.
          The Maven type build is marked as 'aborted'

          Clifford Sanders added a comment - I'm not sure if this is related: I have configured my jobs to fail on timeout. This works in the freestyle project but not in the Maven project. The Maven type build is marked as 'aborted'

          Orgad Shaneh added a comment -

          Happens to me with a freestyle project with 1.598, Windows master, Linux slave (debian wheezy).

          When I abort a job, it appears as aborted in Jenkins, but keeps running.

          The following exception appears in the log.

          Project #362 aborted
          java.lang.InterruptedException
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:146)
          	at hudson.remoting.Channel.call(Channel.java:751)
          	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:179)
          	at com.sun.proxy.$Proxy47.isAlive(Unknown Source)
          	at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:984)
          	at hudson.plugins.xshell.XShellBuilder.perform(XShellBuilder.java:140)
          	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:761)
          	at hudson.model.Build$BuildExecution.build(Build.java:199)
          	at hudson.model.Build$BuildExecution.doRun(Build.java:160)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:536)
          	at hudson.model.Run.execute(Run.java:1718)
          	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          	at hudson.model.ResourceController.execute(ResourceController.java:89)
          	at hudson.model.Executor.run(Executor.java:240)

          Orgad Shaneh added a comment - Happens to me with a freestyle project with 1.598, Windows master, Linux slave (debian wheezy). When I abort a job, it appears as aborted in Jenkins, but keeps running. The following exception appears in the log. Project #362 aborted java.lang.InterruptedException at java.lang. Object .wait(Native Method) at hudson.remoting.Request.call(Request.java:146) at hudson.remoting.Channel.call(Channel.java:751) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:179) at com.sun.proxy.$Proxy47.isAlive(Unknown Source) at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:984) at hudson.plugins.xshell.XShellBuilder.perform(XShellBuilder.java:140) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:761) at hudson.model.Build$BuildExecution.build(Build.java:199) at hudson.model.Build$BuildExecution.doRun(Build.java:160) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:536) at hudson.model.Run.execute(Run.java:1718) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)

          Sarah Woodall added a comment -

          Added XShell Plugin to components list, as it is mentioned in the stack trace.

          Sarah Woodall added a comment - Added XShell Plugin to components list, as it is mentioned in the stack trace.

          Sarah Woodall added a comment -

          Here is my stack trace. I'm executing a freestyle matrix job. My master is on a Mac, and this matrix job runs on multiple heterogeneous slaves (hence the use of the XShell plugin), but on this particular occasion it was a slave instance on a Mac that we wanted to abort. This is what appeared in the Jenkins log (with project name and slave name edited):
          INFO: TestMatrixMyProjectName/SLAVE=MyMacSlave #308 aborted
          java.lang.InterruptedException
          at java.lang.Object.wait(Native Method)
          at hudson.remoting.Request.call(Request.java:146)
          at hudson.remoting.Channel.call(Channel.java:742)
          at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:168)
          at com.sun.proxy.$Proxy54.isAlive(Unknown Source)
          at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:961)
          at hudson.plugins.xshell.XShellBuilder.perform(XShellBuilder.java:140)
          at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:756)
          at hudson.model.Build$BuildExecution.build(Build.java:198)
          at hudson.model.Build$BuildExecution.doRun(Build.java:159)
          at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
          at hudson.model.Run.execute(Run.java:1706)
          at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:232)

          The processes on the Mac carried on running, even though it appeared from the Jenkins dashboard that the job had been aborted successfully.

          Sarah Woodall added a comment - Here is my stack trace. I'm executing a freestyle matrix job. My master is on a Mac, and this matrix job runs on multiple heterogeneous slaves (hence the use of the XShell plugin), but on this particular occasion it was a slave instance on a Mac that we wanted to abort. This is what appeared in the Jenkins log (with project name and slave name edited): INFO: TestMatrixMyProjectName/SLAVE=MyMacSlave #308 aborted java.lang.InterruptedException at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:146) at hudson.remoting.Channel.call(Channel.java:742) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:168) at com.sun.proxy.$Proxy54.isAlive(Unknown Source) at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:961) at hudson.plugins.xshell.XShellBuilder.perform(XShellBuilder.java:140) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:756) at hudson.model.Build$BuildExecution.build(Build.java:198) at hudson.model.Build$BuildExecution.doRun(Build.java:159) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529) at hudson.model.Run.execute(Run.java:1706) at hudson.matrix.MatrixRun.run(MatrixRun.java:146) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:232) The processes on the Mac carried on running, even though it appeared from the Jenkins dashboard that the job had been aborted successfully.

          Sarah Woodall added a comment -

          I have now changed my jobs so that they don't use the XShell plugin (instead they use conditional build steps to run either a Windows BAT file or a Unix shell script). This has solved the problem with the child processes not being cleaned up. HOWEVER, I still get a stack trace in the Jenkins log file. This suggests to me that there are two separate problems here. The failure to clean up the child processes is definitely connected with the XShell plugin. The stacktrace is separate.

          Here's an example of the stacktrace I get now (this particular one is from a different slave):
          INFO: TestMatrixMyProjectName/SLAVE=MyUbuntuSlave #311 aborted
          java.lang.InterruptedException
          at java.lang.Object.wait(Native Method)
          at hudson.remoting.Request.call(Request.java:146)
          at hudson.remoting.Channel.call(Channel.java:742)
          at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:168)
          at com.sun.proxy.$Proxy73.join(Unknown Source)
          at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:956)
          at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:137)
          at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:97)
          at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
          at org.jenkinsci.plugins.conditionalbuildstep.BuilderChain.perform(BuilderChain.java:71)
          at org.jenkins_ci.plugins.run_condition.BuildStepRunner$2.run(BuildStepRunner.java:110)
          at org.jenkins_ci.plugins.run_condition.BuildStepRunner$Fail.conditionalRun(BuildStepRunner.java:154)
          at org.jenkins_ci.plugins.run_condition.BuildStepRunner.perform(BuildStepRunner.java:105)
          at org.jenkinsci.plugins.conditionalbuildstep.ConditionalBuilder.perform(ConditionalBuilder.java:133)
          at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:756)
          at hudson.model.Build$BuildExecution.build(Build.java:198)
          at hudson.model.Build$BuildExecution.doRun(Build.java:159)
          at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
          at hudson.model.Run.execute(Run.java:1706)
          at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:232)

          Sarah Woodall added a comment - I have now changed my jobs so that they don't use the XShell plugin (instead they use conditional build steps to run either a Windows BAT file or a Unix shell script). This has solved the problem with the child processes not being cleaned up. HOWEVER, I still get a stack trace in the Jenkins log file. This suggests to me that there are two separate problems here. The failure to clean up the child processes is definitely connected with the XShell plugin. The stacktrace is separate. Here's an example of the stacktrace I get now (this particular one is from a different slave): INFO: TestMatrixMyProjectName/SLAVE=MyUbuntuSlave #311 aborted java.lang.InterruptedException at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:146) at hudson.remoting.Channel.call(Channel.java:742) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:168) at com.sun.proxy.$Proxy73.join(Unknown Source) at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:956) at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:137) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:97) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66) at org.jenkinsci.plugins.conditionalbuildstep.BuilderChain.perform(BuilderChain.java:71) at org.jenkins_ci.plugins.run_condition.BuildStepRunner$2.run(BuildStepRunner.java:110) at org.jenkins_ci.plugins.run_condition.BuildStepRunner$Fail.conditionalRun(BuildStepRunner.java:154) at org.jenkins_ci.plugins.run_condition.BuildStepRunner.perform(BuildStepRunner.java:105) at org.jenkinsci.plugins.conditionalbuildstep.ConditionalBuilder.perform(ConditionalBuilder.java:133) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:756) at hudson.model.Build$BuildExecution.build(Build.java:198) at hudson.model.Build$BuildExecution.doRun(Build.java:159) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529) at hudson.model.Run.execute(Run.java:1706) at hudson.matrix.MatrixRun.run(MatrixRun.java:146) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:232)

          Ryan Desmond added a comment - - edited

          I just marked JENKINS-28968 as a duplicate of this. Over there, I have a simple procedure for creating an offending Maven build type job that has this problem. One thing I noticed was that one process is killed after aborting.

          Just before aborting:

          $ ps aux | grep sure
          [user] 4220 0.0 0.0 113120 1188 ? S 10:12 0:00 /bin/sh -c cd /home/ussuser/jenkins/workspace/sleeptest && /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51-2.4.5.5.el7.x86_64/jre/bin/java -jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefirebooter449566822541979931.jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire7684357083774633779tmp /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire_01856690741005869733tmp
          [user] 4222 13.0 0.1 6102248 31668 ? Sl 10:12 0:00 java -jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefirebooter449566822541979931.jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire7684357083774633779tmp /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire_01856690741005869733tmp

          After aborting:

          $ ps aux | grep sure
          [user] 4222 5.2 0.1 6102248 31612 ? Sl 10:12 0:00 java -jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefirebooter449566822541979931.jar /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire7684357083774633779tmp /home/[user]/jenkins/workspace/sleeptest/target/surefire/surefire_01856690741005869733tmp

          Ryan Desmond added a comment - - edited I just marked JENKINS-28968 as a duplicate of this. Over there, I have a simple procedure for creating an offending Maven build type job that has this problem. One thing I noticed was that one process is killed after aborting. Just before aborting: $ ps aux | grep sure [user] 4220 0.0 0.0 113120 1188 ? S 10:12 0:00 /bin/sh -c cd /home/ussuser/jenkins/workspace/sleeptest && /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51-2.4.5.5.el7.x86_64/jre/bin/java -jar /home/ [user] /jenkins/workspace/sleeptest/target/surefire/surefirebooter449566822541979931.jar /home/ [user] /jenkins/workspace/sleeptest/target/surefire/surefire7684357083774633779tmp /home/ [user] /jenkins/workspace/sleeptest/target/surefire/surefire_01856690741005869733tmp [user] 4222 13.0 0.1 6102248 31668 ? Sl 10:12 0:00 java -jar /home/ [user] /jenkins/workspace/sleeptest/target/surefire/surefirebooter449566822541979931.jar /home/ [user] /jenkins/workspace/sleeptest/target/surefire/surefire7684357083774633779tmp /home/ [user] /jenkins/workspace/sleeptest/target/surefire/surefire_01856690741005869733tmp After aborting: $ ps aux | grep sure [user] 4222 5.2 0.1 6102248 31612 ? Sl 10:12 0:00 java -jar /home/ [user] /jenkins/workspace/sleeptest/target/surefire/surefirebooter449566822541979931.jar /home/ [user] /jenkins/workspace/sleeptest/target/surefire/surefire7684357083774633779tmp /home/ [user] /jenkins/workspace/sleeptest/target/surefire/surefire_01856690741005869733tmp

          Ryan Desmond added a comment -

          I had a chance to look at it a little more. I don't understand it fully, but I think the big difference is that the process tree is not referenced in the maven job. As such, we kill the maven instance itself but child processes (such as surefire) aren't stopped. I'm not sure why maven isn't taking care of that itself – does Jenkins kill -9 maven?

          I think the other question is why Jenkins isn't using the process tree to kill the entire tree in maven jobs. It seems like it would be easy for a maven job to spin of a child (like surefire), which could exec another child.

          Ryan Desmond added a comment - I had a chance to look at it a little more. I don't understand it fully, but I think the big difference is that the process tree is not referenced in the maven job. As such, we kill the maven instance itself but child processes (such as surefire) aren't stopped. I'm not sure why maven isn't taking care of that itself – does Jenkins kill -9 maven? I think the other question is why Jenkins isn't using the process tree to kill the entire tree in maven jobs. It seems like it would be easy for a maven job to spin of a child (like surefire), which could exec another child.

          This has been a massive impediment for our team for almost a year now. We have a story branch workflow with 10+ jobs in parallel on average. Every time a build is aborted I have to manually log onto the server via SSH and kill Surefire or Failsafe instances along with their child PhantomJS instances because otherwise we run out of memory quickly.

          Please fix the bug and in the meantime offer a workaround for Maven jobs, if possible.

          Alexander Kriegisch added a comment - This has been a massive impediment for our team for almost a year now. We have a story branch workflow with 10+ jobs in parallel on average. Every time a build is aborted I have to manually log onto the server via SSH and kill Surefire or Failsafe instances along with their child PhantomJS instances because otherwise we run out of memory quickly. Please fix the bug and in the meantime offer a workaround for Maven jobs, if possible.

          Alex Gray added a comment - - edited

          We get bitten by this too. A bunch of our jobs have a pre build step that basically does a:

          ps aux | grep [the-thing-you-want-to-kill] | grep -v grep | awk "{ print \$2 }"
          

          and pass those pids to kill -9

          That kills every "zombie/parent-less" process BEFORE the build runs. This ensures the build will run in a relatively clean environment.
          Just be careful what you kill

          This is only a workaround, but it works for us.
          Hope this helps!

          Alex Gray added a comment - - edited We get bitten by this too. A bunch of our jobs have a pre build step that basically does a: ps aux | grep [the-thing-you-want-to-kill] | grep -v grep | awk "{ print \$2 }" and pass those pids to kill -9 That kills every "zombie/parent-less" process BEFORE the build runs. This ensures the build will run in a relatively clean environment. Just be careful what you kill This is only a workaround, but it works for us. Hope this helps!

          After a bit of investigation, we found out that all of the processes that stay after a job has been terminated "abnormally" will have INIT as their parent process. So if you kill the process(tree) that starts with a java process which has INIT as parent, you should be safe (and in my opinion safer as grepping for strings).

          You could for example create a cronjob for this task.

          Steffen Breitbach added a comment - After a bit of investigation, we found out that all of the processes that stay after a job has been terminated "abnormally" will have INIT as their parent process. So if you kill the process(tree) that starts with a java process which has INIT as parent, you should be safe (and in my opinion safer as grepping for strings). You could for example create a cronjob for this task.

          Don Bogardus added a comment -

          I've kept my company on Jenkins version 1.586 this whole time. It does not suffer from this bug and cleans up all surefire and phantomjs processes nicely.

          Don Bogardus added a comment - I've kept my company on Jenkins version 1.586 this whole time. It does not suffer from this bug and cleans up all surefire and phantomjs processes nicely.

          I have a simple repro case with a freestyle project and "Execute shell" build steps. I'm on Jenkins 1.625.3, Linux master, Linux slave.

          I hit this because we have freestyle projects where several build processes execute in parallel, but use flock to gate access to a shared resource. Sometimes, when a job is aborted, one of more of these child processes persist and prevent future jobs on the slave from acquiring the shared lock. The flock stuff below isn't necessary to repro - just sleep gives similar behavior - but it matches my use case and makes it easy to identify affected processes with fuser.

          Create a freestyle project with two build steps:

          1. Execute shell
            #!/bin/bash -ex
            nohup flock /var/lock/mylockfile sleep 1h &
            
          2. Execute shell
            #!/bin/bash -ex
            sleep 1h
            

          Then abort the job (manually or by timeout). flock and its child sleep process persist, and continue to hold the lock.

          This is the simplest project configuration I could construct. In all of these cases, the child processes are killed as expected:

          • Omitting the second "Execute shell."
          • Combining them into a single "Execute shell."
          • Failing by means other than abort, e.g. /bin/false in the second "Execute shell."

          Sample results below. While the job is running, the lock is in use as expected:

          $ fuser /var/lock/mylockfile
          22733 22734
          
          $ ps -p 22733,22734 -o pid,ppid,stat,lstart,args
            PID  PPID STAT                  STARTED COMMAND
          22733     1 S    Wed Feb 24 00:57:51 2016 flock /var/lock/mylockfile sleep 1h
          22734 22733 S    Wed Feb 24 00:57:51 2016 sleep 1h
          

          Then abort the job:

          [experimental_jenkins_26048] $ /bin/bash -ex /tmp/hudson8042917752397215577.sh
          + nohup flock /var/lock/mylockfile sleep 1h
          [experimental_jenkins_26048] $ /bin/bash -ex /tmp/hudson4924658810125221857.sh
          + sleep 1h
          Build timed out (after 3 minutes). Marking the build as aborted.
          Build was aborted
          Finished: ABORTED
          

          Afterwards, the processes are still alive:

          $ ps -p 22733,22734 -o pid,ppid,stat,lstart,args
            PID  PPID STAT                  STARTED COMMAND
          22733     1 S    Wed Feb 24 00:57:51 2016 flock /var/lock/mylockfile sleep 1h
          22734 22733 S    Wed Feb 24 00:57:51 2016 sleep 1h
          

          BUILD_ID is unchanged, so ProcessTreeKiller should find them:

          $ strings /proc/22733/environ | grep BUILD_ID
          BUILD_ID=17
          $ strings /proc/22734/environ | grep BUILD_ID
          BUILD_ID=17
          

          Patrick Mihelich added a comment - I have a simple repro case with a freestyle project and "Execute shell" build steps. I'm on Jenkins 1.625.3, Linux master, Linux slave. I hit this because we have freestyle projects where several build processes execute in parallel, but use flock to gate access to a shared resource. Sometimes, when a job is aborted, one of more of these child processes persist and prevent future jobs on the slave from acquiring the shared lock. The flock stuff below isn't necessary to repro - just sleep gives similar behavior - but it matches my use case and makes it easy to identify affected processes with fuser. Create a freestyle project with two build steps: Execute shell #!/bin/bash -ex nohup flock /var/lock/mylockfile sleep 1h & Execute shell #!/bin/bash -ex sleep 1h Then abort the job (manually or by timeout). flock and its child sleep process persist, and continue to hold the lock. This is the simplest project configuration I could construct. In all of these cases, the child processes are killed as expected: Omitting the second "Execute shell." Combining them into a single "Execute shell." Failing by means other than abort, e.g. /bin/false in the second "Execute shell." Sample results below. While the job is running, the lock is in use as expected: $ fuser /var/lock/mylockfile 22733 22734 $ ps -p 22733,22734 -o pid,ppid,stat,lstart,args PID PPID STAT STARTED COMMAND 22733 1 S Wed Feb 24 00:57:51 2016 flock /var/lock/mylockfile sleep 1h 22734 22733 S Wed Feb 24 00:57:51 2016 sleep 1h Then abort the job: [experimental_jenkins_26048] $ /bin/bash -ex /tmp/hudson8042917752397215577.sh + nohup flock /var/lock/mylockfile sleep 1h [experimental_jenkins_26048] $ /bin/bash -ex /tmp/hudson4924658810125221857.sh + sleep 1h Build timed out (after 3 minutes). Marking the build as aborted. Build was aborted Finished: ABORTED Afterwards, the processes are still alive: $ ps -p 22733,22734 -o pid,ppid,stat,lstart,args PID PPID STAT STARTED COMMAND 22733 1 S Wed Feb 24 00:57:51 2016 flock /var/lock/mylockfile sleep 1h 22734 22733 S Wed Feb 24 00:57:51 2016 sleep 1h BUILD_ID is unchanged, so ProcessTreeKiller should find them: $ strings /proc/22733/environ | grep BUILD_ID BUILD_ID=17 $ strings /proc/22734/environ | grep BUILD_ID BUILD_ID=17

          Adding core component, since I can repro without using plugins.

          Patrick Mihelich added a comment - Adding core component, since I can repro without using plugins.

          Sorin Sbarnea added a comment -

          I faced the same bug with Jenkins 2.9 and some shell jobs started via pipelines. It does not always happen so I don't really know what is causing it.

          Sorin Sbarnea added a comment - I faced the same bug with Jenkins 2.9 and some shell jobs started via pipelines. It does not always happen so I don't really know what is causing it.

          Daniel Beck added a comment -

          Possibly related discussion: https://github.com/jenkinsci/docker/issues/54

          Daniel Beck added a comment - Possibly related discussion: https://github.com/jenkinsci/docker/issues/54

          Claudio Curzi added a comment - - edited

          Hi, I have the same problem on my windows server when build running on the slave machine.

          Right now I have resolved by running builds criticism on the master.

          I use Jenkins 2.11.

          Claudio Curzi added a comment - - edited Hi, I have the same problem on my windows server when build running on the slave machine. Right now I have resolved by running builds criticism on the master. I use Jenkins 2.11.

          Jordan Stefanelli added a comment - - edited

          I also have seen Jenkins not killing off subprocesses when a step (regular, multiphase, and the new pipeline plugin) jobs, using Jenkins 2.15, on Windows 10 systems. I'm happy to provide further machine specs & repro.

          My test case was running 'python -c "import time;time.sleep(100000)"' and aborting/terminating the job (both) from Jenkins. Using pipeline.

          This can leave file handles open and block subsequent attempts to build in the same workspace, causing failures.

          claudiocurzi What is the build criticism on master you are referring to?

          Jordan Stefanelli added a comment - - edited I also have seen Jenkins not killing off subprocesses when a step (regular, multiphase, and the new pipeline plugin) jobs, using Jenkins 2.15, on Windows 10 systems. I'm happy to provide further machine specs & repro. My test case was running 'python -c "import time;time.sleep(100000)"' and aborting/terminating the job (both) from Jenkins. Using pipeline. This can leave file handles open and block subsequent attempts to build in the same workspace, causing failures. claudiocurzi What is the build criticism on master you are referring to?

          Claudio Curzi added a comment -

          I have only builds using "Execute Windows Batch Command".

          All my builds when running on the slave server and it being aborted by the timeout settings on the log, the process on the slave server continue to run.

          Claudio Curzi added a comment - I have only builds using "Execute Windows Batch Command". All my builds when running on the slave server and it being aborted by the timeout settings on the log, the process on the slave server continue to run.

          For us it seems to work fine when jobs are run on master, but processes are not killed when job runs on slave (Maven job type).

          Piotr Paczyński added a comment - For us it seems to work fine when jobs are run on master, but processes are not killed when job runs on slave (Maven job type).

          jchatham added a comment -

          I recently encountered this issue in the context of using the build-timeout plugin to stop a maven build; I don't have an actual fix for the bug in the maven module, but was able to put together some code to make the build timeout plugin kill off appropriate child processes. See my comment on JENKINS-28125 if you think that code might be of use to you.

          jchatham added a comment - I recently encountered this issue in the context of using the build-timeout plugin to stop a maven build; I don't have an actual fix for the bug in the maven module, but was able to put together some code to make the build timeout plugin kill off appropriate child processes. See my comment on JENKINS-28125 if you think that code might be of use to you.

          Timmy Brolin added a comment - - edited

          We are seeing this issue.
          Linux master, Windows slaves.
          No Maven. Just regular pipeline jobs, starting a background process like this:

          bat "start MyProc.exe"

          From my understanding the process should be killed when the job is done? But it keeps running.

          Timmy Brolin added a comment - - edited We are seeing this issue. Linux master, Windows slaves. No Maven. Just regular pipeline jobs, starting a background process like this: bat "start MyProc.exe" From my understanding the process should be killed when the job is done? But it keeps running.

          Jesse Glick added a comment -

          tib currently Pipeline sh/bat steps kill all processes when the build is interrupted, but not when the main process simply exits on its own. Could be filed under durable-task-plugin. Unrelated to this issue.

          Jesse Glick added a comment - tib currently Pipeline sh / bat steps kill all processes when the build is interrupted, but not when the main process simply exits on its own. Could be filed under durable-task-plugin . Unrelated to this issue.

          Oleg Nenashev added a comment -

          Just a summary of more or less recent changes on this front, in Windows:

          • In Jenkins 2.34 (WinP 1.24) there were fixes of the process management logic in Windows: JENKINS-20913 and JENKINS-24453. Before this version process termination is not reliable
          • In Jenkins 2.50+ (WinSW 2.0) there were many changes improving the process termination logic if the service gets terminated/restarted. Changelog: https://github.com/kohsuke/winsw/blob/master/CHANGELOG.md 
          • In Jenkins 2.50+ (WinSW 2.0) I have added the Runaway Process Killer extension, which kills spawned processes if Windows Service Wrapper executable gets aborted. In some cases it may also cause build processes to runaway

          What is known to NOT work:

          • Termination of 64bit processes in Windows if your Jenkins master/agent runs on 32bit JRE/JDK. Won't fix though it needs to be documented somewhere

          I feel for Windows the issue is more or less resolved. Or it needs retesting at least. 

          Regarding Unix OS, we need to understand if it still happens in modern Jenkins versions.

          Oleg Nenashev added a comment - Just a summary of more or less recent changes on this front, in Windows: In Jenkins 2.34 (WinP 1.24) there were fixes of the process management logic in Windows:  JENKINS-20913 and  JENKINS-24453 . Before this version process termination is not reliable In Jenkins 2.50+ (WinSW 2.0) there were many changes improving the process termination logic if the service gets terminated/restarted. Changelog: https://github.com/kohsuke/winsw/blob/master/CHANGELOG.md   In Jenkins 2.50+ (WinSW 2.0) I have added the Runaway Process Killer extension, which kills spawned processes if Windows Service Wrapper executable gets aborted. In some cases it may also cause build processes to runaway What is known to NOT work: Termination of 64bit processes in Windows if your Jenkins master/agent runs on 32bit JRE/JDK. Won't fix though it needs to be documented somewhere I feel for Windows the issue is more or less resolved. Or it needs retesting at least.  Regarding Unix OS, we need to understand if it still happens in modern Jenkins versions.

          Oleg Nenashev added a comment -

          I am closing the issue according to the comment above. If you still experience the issue, please feel free to create follow-up issues for your cases.

          Oleg Nenashev added a comment - I am closing the issue according to the comment above. If you still experience the issue, please feel free to create follow-up issues for your cases.

          Don Bogardus added a comment - - edited

          Verified this is still happening in latest jenkins version on CentOS 6. 

          For the test I'm running java tests with the failsafe plugin, which launches multiple jvms for concurrent execution and many phantomjs processes. 

          1.586 - When build is stopped, all child processes are stopped

          1.587 - When build is stopped, all child processes keep running

          2.66.1 - When build is stopped, all child processes keep running

           

          Don Bogardus added a comment - - edited Verified this is still happening in latest jenkins version on CentOS 6.  For the test I'm running java tests with the failsafe plugin, which launches multiple jvms for concurrent execution and many phantomjs processes.  1.586 - When build is stopped, all child processes are stopped 1.587 - When build is stopped, all child processes keep running 2.66.1 - When build is stopped, all child processes keep running  

          Daniel Beck added a comment -

          dbogardus Are you able to reproduce this issue with a simpler environment? What are the complete steps to reproduce from scratch?

          When you write 2.66.1, do you mean 2.46.1?

          Daniel Beck added a comment - dbogardus Are you able to reproduce this issue with a simpler environment? What are the complete steps to reproduce from scratch? When you write 2.66.1, do you mean 2.46.1?

          Don Bogardus added a comment - - edited

          danielbeck , I used the latest weekly version - 2.66.1 . (It's just what came down from yum)

          I am able to reproduce it with the simple java/mvn/surefire scenario created by rddesmond :

          https://issues.jenkins-ci.org/browse/JENKINS-28968?focusedCommentId=230600&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-230600

          With this small example, 1.586 cleans up the surefire process, and 2.66.1 does not. 

           

           

          Don Bogardus added a comment - - edited danielbeck , I used the latest weekly version - 2.66.1 . (It's just what came down from yum) I am able to reproduce it with the simple java/mvn/surefire scenario created by rddesmond : https://issues.jenkins-ci.org/browse/JENKINS-28968?focusedCommentId=230600&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-230600 With this small example, 1.586 cleans up the surefire process, and 2.66.1 does not.     

          Oleg Nenashev added a comment -

          dbogardus I propose to move the discussion back to the JENKINS-28968 thread. Maven Project plugin is a very specific case in terms of the architecture, so I would rather vote for handling it separately until we confirm there is a generic issue in the core. So far I am not conviced

          Oleg Nenashev added a comment - dbogardus  I propose to move the discussion back to the  JENKINS-28968 thread. Maven Project plugin is a very specific case in terms of the architecture, so I would rather vote for handling it separately until we confirm there is a generic issue in the core. So far I am not conviced

          Don Bogardus added a comment -

          oleg_nenashev Acknowledged. I did try a few things to reproduce the bug outside of the maven/surefire scenario but was unable to. 

          Don Bogardus added a comment - oleg_nenashev Acknowledged. I did try a few things to reproduce the bug outside of the maven/surefire scenario but was unable to. 

          Mor L added a comment -

          Hi,

          We're facing a similar issue - but we're using gradle and not using maven or surefire.

          We use pipeline and call gradle using the bat step (windows 64 bit slaves, master is on windows too).

          This happens on 2 separate projects in which gradle spawns child processes for testing using various other tools. Sometimes 'gradle test' hangs due to bugs in the test code/scenarios and when we stop the build - there are always java processes left up which we have to kill manually.

          Verified that ProcessTreeKiller is not disabled.

          Using Jenkins 2.73.3, pipeline suite 2.5.

          Mor L added a comment - Hi, We're facing a similar issue - but we're using gradle and not using maven or surefire. We use pipeline and call gradle using the bat step (windows 64 bit slaves, master is on windows too). This happens on 2 separate projects in which gradle spawns child processes for testing using various other tools. Sometimes 'gradle test' hangs due to bugs in the test code/scenarios and when we stop the build - there are always java processes left up which we have to kill manually. Verified that ProcessTreeKiller is not disabled. Using Jenkins 2.73.3, pipeline suite 2.5.

          Oleg Nenashev added a comment -

          Apparently the 32bit OS support was partially broken in WinP: https://github.com/kohsuke/winp/issues/48
          Does anybody see the issue on 64bit systems?

          Oleg Nenashev added a comment - Apparently the 32bit OS support was partially broken in WinP: https://github.com/kohsuke/winp/issues/48 Does anybody see the issue on 64bit systems?

          Oleg Nenashev added a comment -

          I will have no time to work on it anytime soon, please see https://groups.google.com/d/msg/jenkinsci-dev/uc6NsMoCFQI/AIO4WG1UCwAJ for the context. I will unassign it so that somebody else can work on it

          Oleg Nenashev added a comment - I will have no time to work on it anytime soon, please see https://groups.google.com/d/msg/jenkinsci-dev/uc6NsMoCFQI/AIO4WG1UCwAJ for the context. I will unassign it so that somebody else can work on it

          Piotr Zolnacz added a comment -

          Hi, 

          I'm going to fix this bug, this is quite critical for my daily work with jenkins. 

          oleg_nenashev it seems like it doesn't appear in 64 bit slaves.

          Piotr Zolnacz added a comment - Hi,  I'm going to fix this bug, this is quite critical for my daily work with jenkins.  oleg_nenashev it seems like it doesn't appear in 64 bit slaves.

          Is there a solution in sight? Or is it solved by running a slave at 64 bit?

          Johannes Schmieder added a comment - Is there a solution in sight? Or is it solved by running a slave at 64 bit?

            pi0tras Piotr Zolnacz
            dbogardus Don Bogardus
            Votes:
            39 Vote for this issue
            Watchers:
            39 Start watching this issue

              Created:
              Updated: