Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49097

Ssh-agent-plugin doesn't kill ssh-agent in top-level matrix jobs

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • Major
    • Resolution: Fixed
    • ssh-agent-plugin
    • None
    • Jenkins 2.32.3
      ssh-agent-plugin 1.15

    Description

      Ssh-agent-plugin starts, but does not kill ssh-agent processes in top-level matrix jobs.

      00:00:00.052 [ssh-agent] Looking for ssh-agent implementation...
      00:00:00.167 [ssh-agent]   Exec ssh-agent (binary ssh-agent on a remote machine)
      00:00:00.189 $ ssh-agent
      00:00:00.278 SSH_AUTH_SOCK=/tmp/ssh-T6i78P9tKd5A/agent.28069
      00:00:00.278 SSH_AGENT_PID=28071
      00:00:00.278 [ssh-agent] Started.
      00:00:00.389 $ ssh-add /home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key
      00:00:00.408 Identity added: /home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key (/home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key)
      00:00:00.520 [ssh-agent] Using credentials tcwg-buildslave (buildslave for TCWG machines)
      00:00:00.542 Set build name.
      00:00:00.543 Triggering TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build
      00:00:05.545 Configuration TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build is still in the queue: Waiting for next available executor on tcwg-x86_64-build
      06:43:08.741 TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build completed with result FAILURE
      06:43:08.902 Set build name.
      06:43:08.905 Unrecognized macro 'branch' in '${branch} #399'
      06:43:08.907 Finished: FAILURE
      

      Since top-level matrix job only spawns child jobs, it doesn't really need access to ssh-agent keys (note that SCM clones/checkouts use their own interface to ssh-agent-plugin).  Therefore ssh-agent-plugin can either not start ssh-agent for top-level matrix jobs at all, or terminate them during cleanup.  It is not clear why existing cleanup code does not trigger for top-level matrix jobs.

      This issue is causing thousands of ssh-agent processes to accumulate on busy systems.  To cleanup these jobs one needs to wait till system is idle to avoid killing the few active ssh-agent processes.  Busy systems, unfortunately, are rarely idle.

      Attachments

        Issue Links

          Activity

            maxim_kuvyrkov Maxim Kuvyrkov created issue -
            maxim_kuvyrkov Maxim Kuvyrkov made changes -
            Field Original Value New Value
            Description When a job with the {{SSHAgentBuildWrapper}} enabled fails very early (for instance during SCM checkout), an {{ssh-agent}} process is left behind. The issue is that the {{SSHAgentEnvironment}} is instantiated very early (from {{preCheckout}}), but its {{tearDown}} method will only be called if execution reaches {{BuildExecution.doRun}} (which comes after the SCM checkout phase in {{AbstractBuildExecution.run}}).

            Before {{ssh-agent-plugin 1.14}}, there was no {{ssh-agent}} process, so the issue with some {{SSHAgentEnvironment}} not being teared down was less visible (but probably there was already some other kind of less obvious resources leaks with {{AgentServer}} not being properly closed).

            This kind of issue with some {{Environment}} not being properly teared down can happen as soon as they are not instantiated from {{BuildWrapper.setUp}}, but from earlier phases (like {{BuildWrapper.preCheckout}} or {{RunListener.setUpEnvironment}}). As such, maybe that's something that should be fixed in core (maybe in {{AbstractBuildExecution.run}}) rather than specifically in the {{ssh-agent-plugin}}, I don't know...

            I've written and attached a "generic workaround" {{RunListener}}, which tries to detect this situation from {{onComplete}}, and call {{tearDown}} for all {{Environment}} if it has not been done already. It's not something I propose for inclusion, but rather some code to exhibit the issue. If an ssh-agent specific fix is desirable, then a similar approach might be an option (but targeting {{SSHAgentEnvironment}} only).
            maxim_kuvyrkov Maxim Kuvyrkov made changes -
            Description Ssh-agent-plugin starts, but does not kill ssh-agent processes in top-level matrix jobs.
            {code:java}
            00:00:00.052 [ssh-agent] Looking for ssh-agent implementation...
            00:00:00.167 [ssh-agent] Exec ssh-agent (binary ssh-agent on a remote machine)
            00:00:00.189 $ ssh-agent
            00:00:00.278 SSH_AUTH_SOCK=/tmp/ssh-T6i78P9tKd5A/agent.28069
            00:00:00.278 SSH_AGENT_PID=28071
            00:00:00.278 [ssh-agent] Started.
            00:00:00.389 $ ssh-add /home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key
            00:00:00.408 Identity added: /home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key (/home/tcwg-buildslave/workspace/tcwg-upstream-monitoring_tmp/private_key_1495902688254844701.key)
            00:00:00.520 [ssh-agent] Using credentials tcwg-buildslave (buildslave for TCWG machines)
            00:00:00.542 Set build name.
            00:00:00.543 Triggering TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build
            00:00:05.545 Configuration TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build is still in the queue: Waiting for next available executor on tcwg-x86_64-build
            06:43:08.741 TCWG Upstream Monitoring » gcc-master,tcwg-x86_64-build completed with result FAILURE
            06:43:08.902 Set build name.
            06:43:08.905 Unrecognized macro 'branch' in '${branch} #399'
            06:43:08.907 Finished: FAILURE
            {code}
            Since top-level matrix job only spawns child jobs, it doesn't really need access to ssh-agent keys (note that SCM clones/checkouts use their own interface to ssh-agent-plugin).  Therefore ssh-agent-plugin can either not start ssh-agent for top-level matrix jobs at all, or terminate them during cleanup.  It is not clear why existing cleanup code does not trigger for top-level matrix jobs.

            This issue is causing thousands of ssh-agent processes to accumulate on busy systems.  To cleanup these jobs one needs to wait till system is idle to avoid killing the few active ssh-agent processes.  Busy systems, unfortunately, are rarely idle.

            fabo , fyi.

            maxim_kuvyrkov Maxim Kuvyrkov added a comment - fabo , fyi.
            maxim_kuvyrkov Maxim Kuvyrkov added a comment - stephenconnolly , fyi.
            jglick Jesse Glick made changes -
            Link This issue duplicates JENKINS-44877 [ JENKINS-44877 ]
            jglick Jesse Glick made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]

            People

              Unassigned Unassigned
              maxim_kuvyrkov Maxim Kuvyrkov
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: