Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43889

ssh-agent-plugin leaking some ssh-agent processes

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core, (1)
      ssh-agent-plugin
    • None
    • Jenkins 2.32.3, 2.190.2
      ssh-agent-plugin 1.15, 1.17
    • Jenkins 2.257

      When a job with the SSHAgentBuildWrapper enabled fails very early (for instance during SCM checkout), an ssh-agent process is left behind. The issue is that the SSHAgentEnvironment is instantiated very early (from preCheckout), but its tearDown method will only be called if execution reaches BuildExecution.doRun (which comes after the SCM checkout phase in AbstractBuildExecution.run).

      Before ssh-agent-plugin 1.14, there was no ssh-agent process, so the issue with some SSHAgentEnvironment not being torn down was less visible (but probably there was already some other kind of less obvious resources leaks with AgentServer not being properly closed).

      This kind of issue with some Environment not being properly torn down can happen as soon as they are not instantiated from BuildWrapper.setUp, but from earlier phases (like BuildWrapper.preCheckout or RunListener.setUpEnvironment). As such, maybe that's something that should be fixed in core (maybe in AbstractBuildExecution.run) rather than specifically in the ssh-agent-plugin, I don't know...

      I've written and attached a "generic workaround" RunListener, which tries to detect this situation from onComplete, and call tearDown for all Environment if it has not been done already. It's not something I propose for inclusion, but rather some code to exhibit the issue. If an ssh-agent specific fix is desirable, then a similar approach might be an option (but targeting SSHAgentEnvironment only).

          [JENKINS-43889] ssh-agent-plugin leaking some ssh-agent processes

          Deepak Raut added a comment -

          Facing this same ssh-agent process leftover in version 1.15 but in different scenario. In multi configuration main hosting job it is starting ssh agent at start but not stopping at end. For each individual configuration it is starting at beginning and stopping at end but same not happening in main job.

          Deepak Raut added a comment - Facing this same ssh-agent process leftover in version 1.15 but in different scenario. In multi configuration main hosting job it is starting ssh agent at start but not stopping at end. For each individual configuration it is starting at beginning and stopping at end but same not happening in main job.

          I had kind of forgotten about this issue, because we've been using a RunListener to work around it, similar to the one I had attached already, but if anyone is interested, the report is still relevant (just checked with ssh-agent plugin code from master, and Jenkins 2.190.2).

          Here is a failing test case one can try, to be added in SSHAgentBuildWrapperTest:

              @Issue("JENKINS-43889")
              @Test
              public void sshAgentStoppedOnEarlyBuildFailure() throws Exception {
                  List<String> credentialIds = new ArrayList<String>();
                  credentialIds.add(CREDENTIAL_ID);
          
                  SSHUserPrivateKey key = new BasicSSHUserPrivateKey(CredentialsScope.GLOBAL, credentialIds.get(0), "cloudbees",
                          new BasicSSHUserPrivateKey.DirectEntryPrivateKeySource(getPrivateKey()), "cloudbees", "test");
                  SystemCredentialsProvider.getInstance().getCredentials().add(key);
                  SystemCredentialsProvider.getInstance().save();
          
                  FreeStyleProject job = r.createFreeStyleProject("I_will_die_during_SCM_checkout");
                  job.setAssignedNode(r.createSlave());
          
                  SSHAgentBuildWrapper sshAgent = new SSHAgentBuildWrapper(credentialIds, false);
                  job.getBuildWrappersList().add(sshAgent);
          
                  // make sure this job fails during SCM checkout
                  job.setScm(new FailingSCM());
          
                  Future<? extends FreeStyleBuild> build = job.scheduleBuild2(0);
                  r.assertBuildStatus(Result.FAILURE, build);
                  r.assertLogContains(Messages.SSHAgentBuildWrapper_Started(), build.get());
                  r.assertLogContains(Messages.SSHAgentBuildWrapper_Stopped(), build.get());
              }
          
              static class FailingSCM extends SCM {
                  @Override
                  public ChangeLogParser createChangeLogParser() {
                      return null;
                  }
                  // default implementation of checkout(...) method will fail, that's what we want
              }
          
          

          (you will then have some `ssh-agent` processes to kill after running this test)

          I'm still not sure where this should get fixed:

          • either in core, by moving the Environment.tearDown calls up from BuildExecution.doRun to AbstractBuildExecution.run
          • or in the ssh-agent plugin itself, if what it does (ie., adding an Environment to the build from its BuildWrapper.preCheckout implementation, rather than from BuildWrapper.setUp, so that its already set up during SCM checkout) is really bad/unsupported

           

           

          Thomas de Grenier de Latour added a comment - - edited I had kind of forgotten about this issue, because we've been using a  RunListener to work around it, similar to the one I had attached already, but if anyone is interested, the report is still relevant (just checked with ssh-agent plugin code from master, and Jenkins 2.190.2). Here is a failing test case one can try, to be added in SSHAgentBuildWrapperTest : @Issue( "JENKINS-43889" ) @Test public void sshAgentStoppedOnEarlyBuildFailure() throws Exception { List< String > credentialIds = new ArrayList< String >(); credentialIds.add(CREDENTIAL_ID); SSHUserPrivateKey key = new BasicSSHUserPrivateKey(CredentialsScope.GLOBAL, credentialIds.get(0), "cloudbees" , new BasicSSHUserPrivateKey.DirectEntryPrivateKeySource(getPrivateKey()), "cloudbees" , "test" ); SystemCredentialsProvider.getInstance().getCredentials().add(key); SystemCredentialsProvider.getInstance().save(); FreeStyleProject job = r.createFreeStyleProject( "I_will_die_during_SCM_checkout" ); job.setAssignedNode(r.createSlave()); SSHAgentBuildWrapper sshAgent = new SSHAgentBuildWrapper(credentialIds, false ); job.getBuildWrappersList().add(sshAgent); // make sure this job fails during SCM checkout job.setScm( new FailingSCM()); Future<? extends FreeStyleBuild> build = job.scheduleBuild2(0); r.assertBuildStatus(Result.FAILURE, build); r.assertLogContains(Messages.SSHAgentBuildWrapper_Started(), build.get()); r.assertLogContains(Messages.SSHAgentBuildWrapper_Stopped(), build.get()); } static class FailingSCM extends SCM { @Override public ChangeLogParser createChangeLogParser() { return null ; } // default implementation of checkout(...) method will fail, that's what we want } (you will then have some `ssh-agent` processes to kill after running this test) I'm still not sure where this should get fixed: either in core, by moving the  Environment.tearDown calls up from BuildExecution.doRun to AbstractBuildExecution.run or in the ssh-agent plugin itself, if what it does (ie., adding an Environment to the build from its BuildWrapper.preCheckout implementation, rather than from BuildWrapper.setUp , so that its already set up during SCM checkout) is really bad/unsupported    

          To be extra clear in my explanations, here is how the ssh-agent gets launched, starting from AbstractBuild.AbstractBuildExecution#run:

          And here is how it gets stopped (when it does), again starting from AbstractBuild.AbstractBuildExecution#run (a few lines below):

          Thomas de Grenier de Latour added a comment - To be extra clear in my explanations, here is how the ssh-agent gets launched, starting from AbstractBuild.AbstractBuildExecution#run : AbstractBuild.AbstractBuildExecution#run(...) - AbstractBuild.java#L498 SCMCheckoutStrategy#preCheckout(...) - SCMCheckoutStrategy.java#L76 SSHAgentBuildWrapper#preCheckout(...) - SSHAgentBuildWrapper.java#L228 SSHAgentBuildWrapper#createSSHAgentEnvironment(...) - SSHAgentBuildWrapper.java#L248 SSHAgentBuildWrapper.SSHAgentEnvironment#SSHAgentEnvironment(...) - SSHAgentBuildWrapper.java#L363 And here is how it gets stopped (when it does), again starting from AbstractBuild.AbstractBuildExecution#run (a few lines below): AbstractBuild.AbstractBuildExecution#run(...) - AbstractBuild.java#L504 Build#doRun(...) - Build.java#L174 SSHAgentBuildWrapper.SSHAgentEnvironment#tearDown(...) - SSHAgentBuildWrapper.java#L417

          Added "core" to Component/s, because I really don't know who's wrong here (the plugin code or Jenkins code).

          Thomas de Grenier de Latour added a comment - Added " core " to Component/s , because I really don't know who's wrong here (the plugin code or Jenkins code).

          Revisiting this old issue again, I realize that SimpleBuildWrapper exposes a variation of what's done in the SSHAgentBuildWrapper, allowing to setup an Environment before checkout: SimpleBuildWrapper#runPreCheckout

          If such a SimpleWrapperWrapper provides a Disposer and is run as part of an AbstractBuild which fails early (in SCM checkout), then the disposer won't be called.

          This flaw is again not documented (just like Environment#tearDown says nothing about not always being called on build failure), and makes me lean toward the "it's a core bug" interpretation of this issue.

          I will submit a PR with a test case (using SimpleBuildWrapper), and a possible fix.

          Thomas de Grenier de Latour added a comment - Revisiting this old issue again, I realize that SimpleBuildWrapper exposes a variation of what's done in the SSHAgentBuildWrapper , allowing to setup an Environment before checkout: SimpleBuildWrapper#runPreCheckout If such a SimpleWrapperWrapper provides a Disposer and is run as part of an AbstractBuild which fails early (in SCM checkout), then the disposer won't be called. This flaw is again not documented (just like Environment#tearDown says nothing about not always being called on build failure), and makes me lean toward the " it's a core bug " interpretation of this issue. I will submit a PR with a test case (using SimpleBuildWrapper ), and a possible fix.

          Thomas de Grenier de Latour added a comment - PR submitted: https://github.com/jenkinsci/jenkins/pull/4517

          Oleg Nenashev added a comment -

          This ticket was mis-categorized in https://www.jenkins.io/changelog/#v2.257 as a Developer fix while the issue seems to be a real defect. It looks like a real defect which would benefit from backporting

          Oleg Nenashev added a comment - This ticket was mis-categorized in https://www.jenkins.io/changelog/#v2.257  as a Developer fix while the issue seems to be a real defect. It looks like a real defect which would benefit from backporting

          Oleg Nenashev added a comment - - edited

          Backporting note: There is new API introduced. Maybe it makes sense to mark API as restricted while backporting

          rfe: Developer: new static utility method Result#combine(Result,Result) to get the worst of two (nullable) build results

          Oleg Nenashev added a comment - - edited Backporting note: There is new API introduced. Maybe it makes sense to mark API as restricted while backporting rfe: Developer: new static utility method  Result#combine(Result,Result)  to get the worst of two (nullable) build results

          Not relevant given the choice of LTS baseline: 2.263

          Oliver Gondža added a comment - Not relevant given the choice of LTS baseline: 2.263

            tom_gl Thomas de Grenier de Latour
            tom_gl Thomas de Grenier de Latour
            Votes:
            4 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: