Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-55256

ssh-agent in pipeline leaves defunct processes on swarm client

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • ssh-agent-plugin
    • None
    • Jenkins ver. 2.150.1, SSH Agent Plugin 1.17, Self-Organizing Swarm Plug-in Modules 3.15

    Description

      We run build nodes via Docker and the Swarm Plugin. After a while defunct processes start to pile up on the nodes:

       

      1000 9386 9371 1 Dec18 ? 00:10:49 java -jar /usr/share/jenkins/swarm-client-3.15-jar-with-dependencies.jar -fsroot /var/jenkins-node_home -master https://jenkins.cosmos.local -username XXX -password XXX -executors 10 -mode exclusive -labels linux basic build -name basic-node-dc -disableSslVerification -description Basic node

      [root@XXX:/root]# ps -ef | grep defu 
      1000 2489 9386 0 Dec18 ? 00:00:00 [ssh-agent] <defunct> 
      1000 2514 9386 0 Dec18 ? 00:00:00 [ssh-agent] <defunct> 
      1000 2544 9386 0 Dec18 ? 00:00:00 [ssh-agent] <defunct> 
      1000 2618 9386 0 Dec18 ? 00:00:00 [ssh-agent] <defunct> 

      ...

       

      We run ssh-agent often through many scripted pipelines so it is hard to trace it down to a specific Pipeline, but this behavior shouldn't occur to begin with.

       

      Attachments

        Activity

          pmr Philipp Moeller added a comment - - edited

          I just confirmed that this happens during normal execution of code like:

          node('basic') {
          sshagent(['XXXXX']) {
            sh "echo foo"
          } 
          }
          
          pmr Philipp Moeller added a comment - - edited I just confirmed that this happens during normal execution of code like: node( 'basic' ) { sshagent([ 'XXXXX' ]) { sh "echo foo" } }

          Hi,

          We're seeing these defunct processes independently of Swarm, with Docker 1.13.1 on RHEL7.

          xmj Johannes Meixner added a comment - Hi, We're seeing these defunct processes independently of Swarm, with Docker 1.13.1 on RHEL7.
          basil Basil Crow added a comment -

          This bug doesn't seem to be specific to the SSH slaves plugin or the Swarm Plugin. I'm reassigning this to the SSH Agent plugin component.

          basil Basil Crow added a comment - This bug doesn't seem to be specific to the SSH slaves plugin or the Swarm Plugin. I'm reassigning this to the SSH Agent plugin component.

          We're also seeing this on an OpenShift environment:
          OpenShift 3.10
          Jenkins 2.164.3
          SSH Agent Plugin 1.17

          Job runs without error but every time job has finished a new zombie process is born on underlying host.

          christian_wehrli Christian Wehrli added a comment - We're also seeing this on an OpenShift environment: OpenShift 3.10 Jenkins 2.164.3 SSH Agent Plugin 1.17 Job runs without error but every time job has finished a new zombie process is born on underlying host.
          jglick Jesse Glick added a comment - If true, could perhaps be reproduced using something like https://github.com/jenkinsci/durable-task-plugin/blob/28ad9826c25f57d58f8ded28b727f357c838d12a/src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java#L526-L565
          ysmaoui Yacine added a comment - - edited

          running in a k8s pod - jenkins agent 

          I am not able to start some processes in a pipeline at some point because of this issue, so I tried to list the zombie processes every now and then in a test pipeline that has a lot of sh and ssh-agent calls

          [2021-10-06T12:21:49.842Z] + echo 'Number of Zombie Processes:'
          [2021-10-06T12:21:49.842Z] Number of Zombie Processes:
          [2021-10-06T12:21:49.842Z] + ps axo pid=,stat=
          [2021-10-06T12:21:49.842Z] + awk '$2~/^Z/ { print }'
          [2021-10-06T12:21:49.842Z] + wc -l
          [2021-10-06T12:21:49.842Z] 1173
          [2021-10-06T12:21:49.842Z] + ps axo pid=,stat=,ppid=,command=
          [2021-10-06T12:21:49.842Z] + awk '$2~/^Z/ { print }'
          [2021-10-06T12:21:49.842Z]   152 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   166 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   289 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   297 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   308 Zs       1 [ssh-agent] <defunct>
          [2021-10-06T12:21:49.842Z]   314 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   326 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   338 Zs       1 [ssh-agent] <defunct>
          [2021-10-06T12:21:49.842Z]   344 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   355 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   367 Zs       1 [ssh-agent] <defunct>
          [2021-10-06T12:21:49.842Z]   373 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   384 Z        1 [sh] <defunct>
          [2021-10-06T12:21:49.842Z]   396 Zs       1 [ssh-agent] <defunct>
          ... 

          sh, and ssh-agent are leaving back a lot of defunct processes ( from that log for example 1173 Zombies.. ), all of them have the same parent ( PPID=1)

          All the Zombies seem to have the same Parent Process with PID 1 ( which I can't kill )

          Is there a way to solve this?
          ( other than having to start a new (k8s-pod) agent node..)

          ysmaoui Yacine added a comment - - edited running in a k8s pod - jenkins agent  I am not able to start some processes in a pipeline at some point because of this issue, so I tried to list the zombie processes every now and then in a test pipeline that has a lot of sh and ssh-agent calls [2021-10-06T12:21:49.842Z] + echo ' Number of Zombie Processes:' [2021-10-06T12:21:49.842Z] Number of Zombie Processes: [2021-10-06T12:21:49.842Z] + ps axo pid=,stat= [2021-10-06T12:21:49.842Z] + awk '$2~/^Z/ { print }' [2021-10-06T12:21:49.842Z] + wc -l [2021-10-06T12:21:49.842Z] 1173 [2021-10-06T12:21:49.842Z] + ps axo pid=,stat=,ppid=,command= [2021-10-06T12:21:49.842Z] + awk '$2~/^Z/ { print }' [2021-10-06T12:21:49.842Z] 152 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 166 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 289 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 297 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 308 Zs 1 [ssh-agent] <defunct> [2021-10-06T12:21:49.842Z] 314 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 326 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 338 Zs 1 [ssh-agent] <defunct> [2021-10-06T12:21:49.842Z] 344 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 355 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 367 Zs 1 [ssh-agent] <defunct> [2021-10-06T12:21:49.842Z] 373 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 384 Z 1 [sh] <defunct> [2021-10-06T12:21:49.842Z] 396 Zs 1 [ssh-agent] <defunct> ... sh, and ssh-agent are leaving back a lot of defunct processes ( from that log for example 1173 Zombies.. ), all of them have the same parent ( PPID=1) All the Zombies seem to have the same Parent Process with PID 1 ( which I can't kill ) Is there a way to solve this? ( other than having to start a new (k8s-pod) agent node..)

          People

            Unassigned Unassigned
            pmr Philipp Moeller
            Votes:
            5 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated: