Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-55256

ssh-agent in pipeline leaves defunct processes on swarm client

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • ssh-agent-plugin
    • None
    • Jenkins ver. 2.150.1, SSH Agent Plugin 1.17, Self-Organizing Swarm Plug-in Modules 3.15

      We run build nodes via Docker and the Swarm Plugin. After a while defunct processes start to pile up on the nodes:

       

      1000 9386 9371 1 Dec18 ? 00:10:49 java -jar /usr/share/jenkins/swarm-client-3.15-jar-with-dependencies.jar -fsroot /var/jenkins-node_home -master https://jenkins.cosmos.local -username XXX -password XXX -executors 10 -mode exclusive -labels linux basic build -name basic-node-dc -disableSslVerification -description Basic node

      [root@XXX:/root]# ps -ef | grep defu 
      1000 2489 9386 0 Dec18 ? 00:00:00 [ssh-agent] <defunct> 
      1000 2514 9386 0 Dec18 ? 00:00:00 [ssh-agent] <defunct> 
      1000 2544 9386 0 Dec18 ? 00:00:00 [ssh-agent] <defunct> 
      1000 2618 9386 0 Dec18 ? 00:00:00 [ssh-agent] <defunct> 

      ...

       

      We run ssh-agent often through many scripted pipelines so it is hard to trace it down to a specific Pipeline, but this behavior shouldn't occur to begin with.

       

          [JENKINS-55256] ssh-agent in pipeline leaves defunct processes on swarm client

          Philipp Moeller added a comment - - edited

          I just confirmed that this happens during normal execution of code like:

          node('basic') {
          sshagent(['XXXXX']) {
            sh "echo foo"
          } 
          }
          

          Philipp Moeller added a comment - - edited I just confirmed that this happens during normal execution of code like: node( 'basic' ) { sshagent([ 'XXXXX' ]) { sh "echo foo" } }

          Hi,

          We're seeing these defunct processes independently of Swarm, with Docker 1.13.1 on RHEL7.

          Johannes Meixner added a comment - Hi, We're seeing these defunct processes independently of Swarm, with Docker 1.13.1 on RHEL7.

          Basil Crow added a comment -

          This bug doesn't seem to be specific to the SSH slaves plugin or the Swarm Plugin. I'm reassigning this to the SSH Agent plugin component.

          Basil Crow added a comment - This bug doesn't seem to be specific to the SSH slaves plugin or the Swarm Plugin. I'm reassigning this to the SSH Agent plugin component.

          We're also seeing this on an OpenShift environment:
          OpenShift 3.10
          Jenkins 2.164.3
          SSH Agent Plugin 1.17

          Job runs without error but every time job has finished a new zombie process is born on underlying host.

          Christian Wehrli added a comment - We're also seeing this on an OpenShift environment: OpenShift 3.10 Jenkins 2.164.3 SSH Agent Plugin 1.17 Job runs without error but every time job has finished a new zombie process is born on underlying host.

          Jesse Glick added a comment -

          Jesse Glick added a comment - If true, could perhaps be reproduced using something like https://github.com/jenkinsci/durable-task-plugin/blob/28ad9826c25f57d58f8ded28b727f357c838d12a/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java#L526-L565

          Yacine added a comment - - edited

          running in a k8s pod - jenkins agent 

          I am not able to start some processes in a pipeline at some point because of this issue, so I tried to list the zombie processes every now and then in a test pipeline that has a lot of sh and ssh-agent calls

          [2021-10-06T12:21:49.842Z] + echo 'Number of Zombie Processes:'
          [2021-10-06T12:21:49.842Z] Number of Zombie Processes:
          [2021-10-06T12:21:49.842Z] + ps axo pid=,stat=
          [2021-10-06T12:21:49.842Z] + awk '$2~/^Z/ { print }'
          [2021-10-06T12:21:49.842Z] + wc -l
          [2021-10-06T12:21:49.842Z] 1173
          [2021-10-06T12:21:49.842Z] + ps axo pid=,stat=,ppid=,command=
          [2021-10-06T12:21:49.842Z] + awk '$2~/^Z/ { print }'
          [2021-10-06T12:21:49.842Z]   152 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   166 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   289 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   297 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   308 Zs       1 [ssh-agent] <defunct>
          [2021-10-06T12:21:49.842Z]   314 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   326 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   338 Zs       1 [ssh-agent] <defunct>
          [2021-10-06T12:21:49.842Z]   344 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   355 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   367 Zs       1 [ssh-agent] <defunct>
          [2021-10-06T12:21:49.842Z]   373 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   384 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   396 Zs       1 [ssh-agent] <defunct>
          ... 

          sh, and ssh-agent are leaving back a lot of defunct processes ( from that log for example 1173 Zombies.. ), all of them have the same parent ( PPID=1)

          All the Zombies seem to have the same Parent Process with PID 1 ( which I can't kill )

          Is there a way to solve this?
          ( other than having to start a new (k8s-pod) agent node..)

          Yacine added a comment - - edited running in a k8s pod - jenkins agent  I am not able to start some processes in a pipeline at some point because of this issue, so I tried to list the zombie processes every now and then in a test pipeline that has a lot of sh and ssh-agent calls [2021-10-06T12:21:49.842Z] + echo ' Number of Zombie Processes:' [2021-10-06T12:21:49.842Z] Number of Zombie Processes: [2021-10-06T12:21:49.842Z] + ps axo pid=,stat= [2021-10-06T12:21:49.842Z] + awk '$2~/^Z/ { print }' [2021-10-06T12:21:49.842Z] + wc -l [2021-10-06T12:21:49.842Z] 1173 [2021-10-06T12:21:49.842Z] + ps axo pid=,stat=,ppid=,command= [2021-10-06T12:21:49.842Z] + awk '$2~/^Z/ { print }' [2021-10-06T12:21:49.842Z] 152 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 166 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 289 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 297 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 308 Zs 1 [ssh-agent] <defunct> [2021-10-06T12:21:49.842Z] 314 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 326 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 338 Zs 1 [ssh-agent] <defunct> [2021-10-06T12:21:49.842Z] 344 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 355 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 367 Zs 1 [ssh-agent] <defunct> [2021-10-06T12:21:49.842Z] 373 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 384 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 396 Zs 1 [ssh-agent] <defunct> ... sh, and ssh-agent are leaving back a lot of defunct processes ( from that log for example 1173 Zombies.. ), all of them have the same parent ( PPID=1) All the Zombies seem to have the same Parent Process with PID 1 ( which I can't kill ) Is there a way to solve this? ( other than having to start a new (k8s-pod) agent node..)

          Stephen added a comment - - edited

          I have noticed this myself running the jenkins agent as a k8s pod, not using ssh-agent-plugin at all; just adding my findings here for anyone who comes across this.

          You see in the logs 

          java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached

          Although it says out of memory error, it seems it is not but is process related (as the message says at the end) and it seems it might be expected behaviour, you can see if it you set up a pod running the agent and a very simple pipeline

          node('test') {    
              stage('ls') {
                  sh 'ls'
              }
          } 

          if you have jenkins run this on your newly setup agent, you will see the defunct sh process, it is the same if you use a scripted or declarative pipeline; but it doesn't happen if you use a freestyle project. I am pretty sure it is to do with the durable-task plugin (which is a dependency of Pipeline Nodes and Processes amongst others https://plugins.jenkins.io/durable-task/dependencies/ ). The changelog for which mentions in 2019 "The means that there is an expectation of orphaned-child cleanup (i.e. zombie-reaping) within the underlying environment." java is running as PID 1 on the inbound-agent and JVM does not do any clean up of zombie processes. If you look at the Jenkins controller PID 1 is tini, and here the author of Tini explains why https://github.com/krallin/tini/issues/8 and it is also mentioned in the documentation for docker official images https://github.com/docker-library/official-images?tab=readme-ov-file#init

          So it seems they know it's an issue, as Jenkins controller has been using it for at least 9 years, I am not sure why the agent image isn't

          https://hub.docker.com/layers/jenkins/jenkins/2.492.1-lts-jdk17/images/sha256-d09e6172a0c88f41c56a7d98bbc1817aeb8d3086e70e8bd2b2640502ceb30f3b you can see here it being used as the entrypoint for the controller image, so I changed my inbound agent to include tini and use 

          ENTRYPOINT ["/usr/local/bin/tini", "--", "jenkins-agent"]

          and this solves the issue

          Stephen added a comment - - edited I have noticed this myself running the jenkins agent as a k8s pod, not using ssh-agent-plugin at all; just adding my findings here for anyone who comes across this. You see in the logs  java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached Although it says out of memory error, it seems it is not but is process related (as the message says at the end) and it seems it might be expected behaviour, you can see if it you set up a pod running the agent and a very simple pipeline node( 'test' ) {    stage( 'ls' ) {         sh 'ls'     } } if you have jenkins run this on your newly setup agent, you will see the defunct sh process, it is the same if you use a scripted or declarative pipeline; but it doesn't happen if you use a freestyle project. I am pretty sure it is to do with the durable-task plugin (which is a dependency of Pipeline Nodes and Processes amongst others https://plugins.jenkins.io/durable-task/dependencies/ ). The changelog for which mentions in 2019 "The means that there is an expectation of orphaned-child cleanup (i.e. zombie-reaping) within the underlying environment." java is running as PID 1 on the inbound-agent and JVM does not do any clean up of zombie processes. If you look at the Jenkins controller PID 1 is tini, and here the author of Tini explains why https://github.com/krallin/tini/issues/8 and it is also mentioned in the documentation for docker official images https://github.com/docker-library/official-images?tab=readme-ov-file#init So it seems they know it's an issue, as Jenkins controller has been using it for at least 9 years, I am not sure why the agent image isn't https://hub.docker.com/layers/jenkins/jenkins/2.492.1-lts-jdk17/images/sha256-d09e6172a0c88f41c56a7d98bbc1817aeb8d3086e70e8bd2b2640502ceb30f3b you can see here it being used as the entrypoint for the controller image, so I changed my inbound agent to include tini and use  ENTRYPOINT [ "/usr/local/bin/tini" , "--" , "jenkins-agent" ] and this solves the issue

          Basil Crow added a comment -

          rebelinblue Makes sense and thanks for tracking this down. Can you please file a PR (or at least an issue) at https://github.com/jenkinsci/docker-agent to follow up on this discovery?

          Basil Crow added a comment - rebelinblue Makes sense and thanks for tracking this down. Can you please file a PR (or at least an issue) at https://github.com/jenkinsci/docker-agent to follow up on this discovery?

          Stephen added a comment -

          basil 

          Interestingly, looking at that there was a PR which was rejected https://github.com/jenkinsci/docker-agent/issues/714 and https://github.com/jenkinsci/docker-agent/issues/325#issuecomment-1329520608 implies the java agent process should deal with it now 

          Stephen added a comment - basil   Interestingly, looking at that there was a PR which was rejected https://github.com/jenkinsci/docker-agent/issues/714 and https://github.com/jenkinsci/docker-agent/issues/325#issuecomment-1329520608 implies the java agent process should deal with it now 

          Basil Crow added a comment -

          rebelinblue That rejection seems incorrect to me.

          Basil Crow added a comment - rebelinblue That rejection seems incorrect to me.

            Unassigned Unassigned
            pmr Philipp Moeller
            Votes:
            5 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: