Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-28182

Revisit use of $JENKINS_SERVER_COOKIE and Launcher.kill

      PlaceholderExecutable needs an environment variable to add to the context to ensure that Launcher.kill, called when the block exits, will clean up any stray external processes run on this node.

      Originally the name of this variable was chosen to be JENKINS_SERVER_COOKIE, which is set by Job (to a confidential hex key) and normally used by AbstractBuild for this purpose. Unlike in that case, for Workflow the value is overridden (to a random UUID) for each node block, since there may be several. I think the intent was to use the same variable name to suppress any unwanted kills coming from Jenkins core code, but there seem to be none. Anyway after the many refactorings of environment variable handling in Workflow, it turns out this is dead code!

      • FileMonitoringTask sets its own value for JENKINS_SERVER_COOKIE, to durable-<workspaceHash>, which overrides any other value, and is used by stop. So within a shell step, $JENKINS_SERVER_COOKIE is the one from durable-task, not PlaceholderExecutable.
      • env.JENKINS_SERVER_COOKIE picks up the value from Job, due to the precedence order in getEffectiveEnvironment.

      So it seems that while the intention is still good (we would like a final check that any processes started inside node are killed when the block exits), the implementation is not right. Probably PlaceholderExecutable just needs to pick an unrelated environment variable for this purpose. And this logic needs to be tested; for example:

      def file = new File(...);
      node {
        sh "(sleep 5; touch $file) &"
      }
      sleep 10
      assert !file.exists()
      

          [JENKINS-28182] Revisit use of $JENKINS_SERVER_COOKIE and Launcher.kill

          Jesse Glick added a comment -

          JENKINS-28131 also discusses passing some other environment variables from PlaceholderExecutable, so would be easy to do at the same time.

          Jesse Glick added a comment - JENKINS-28131 also discusses passing some other environment variables from PlaceholderExecutable , so would be easy to do at the same time.

          Jesse Glick added a comment -

          The use of JENKINS-25938 in the lts-609 branch also moves the Launcher.kill to the finish method (where it belonged to begin with), and moves the definition of COOKIE_VAR, so there could be merge conflicts if done before that branch is merged.

          Jesse Glick added a comment - The use of JENKINS-25938 in the lts-609 branch also moves the Launcher.kill to the finish method (where it belonged to begin with), and moves the definition of COOKIE_VAR , so there could be merge conflicts if done before that branch is merged.

          This is a big issue in some deployments (windows especially) that launch VC++ and C# compiler processes, since these sometimes want to leave processes around after execution.  This can cause failures down the road.  I'm looking into this issue now based on the info above

          Matthew Mitchell added a comment - This is a big issue in some deployments (windows especially) that launch VC++ and C# compiler processes, since these sometimes want to leave processes around after execution.  This can cause failures down the road.  I'm looking into this issue now based on the info above

          Jesse Glick added a comment -

          mmitche I think your comment is misplaced. Fixing this architectural / code style issue should not have any user-visible behavior.

          Probably what you are asking for is accomplished simply by unsetting the environment variable for selected processes spawned by your script.

          Jesse Glick added a comment - mmitche I think your comment is misplaced. Fixing this architectural / code style issue should not have any user-visible behavior. Probably what you are asking for is accomplished simply by unsetting the environment variable for selected processes spawned by your script.

          Actually it does have user visible behavior.  I did a bunch of experimentation and instrumentation of Jenkins.

          The process killer is looking for the original value of JENKINS_SERVER_COOKE (some hash). But, all processes within the node get the durable-<hash>, including the problematic ones I'm seeing. I understand the intention here.  Multiple node {} blocks running concurrently on the same node for the same pipeline job would be problematic if you got to the end of the node block and it killed the other node blocks running at the same time.

          However, what this does mean is that nothing is actually cleaned up except what is normally killed by parent processes exiting.  On Windows, this is more problematic because the ease of breaking parent->child relationships

          Matthew Mitchell added a comment - Actually it does have user visible behavior.  I did a bunch of experimentation and instrumentation of Jenkins. The process killer is looking for the original value of JENKINS_SERVER_COOKE (some hash). But, all processes within the node get the durable-<hash>, including the problematic ones I'm seeing. I understand the intention here.  Multiple node {} blocks running concurrently on the same node for the same pipeline job would be problematic if you got to the end of the node block and it killed the other node blocks running at the same time. However, what this does mean is that nothing is actually cleaned up except what is normally killed by parent processes exiting.  On Windows, this is more problematic because the ease of breaking parent->child relationships

          Jesse Glick added a comment -

          IOW that the killer logic is just broken. I would need to spend some time on it in a debugger to confirm.

          Jesse Glick added a comment - IOW that the killer logic is just broken. I would need to spend some time on it in a debugger to confirm.

          I can take this over the next day or two (it's necessary for us to roll this out).  It might just mean building more logic about what the process env var match needs to be when dealing with pipeline jobs.

          Matthew Mitchell added a comment - I can take this over the next day or two (it's necessary for us to roll this out).  It might just mean building more logic about what the process env var match needs to be when dealing with pipeline jobs.

          Jesse Glick added a comment -

          I suggested the probable fix in the issue description, along with a sketch of a test.

          Jesse Glick added a comment - I suggested the probable fix in the issue description, along with a sketch of a test.

          jglick your hunch was correct.  Fixed in https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/39 and added test case.

          Matthew Mitchell added a comment - jglick your hunch was correct.  Fixed in https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/39  and added test case.

          Also, as an aside, the original motivation for this was to fix our workspace cleanup call at the end of node steps.  We were having processes left around.  However, after fixing this bug, it's clear that this won't fix that issue, since we actually need to have the cleanup happen after killing child processes.  Is there a hook (through plugin or otherwise) that can ensure that the WsCleanup runs after process cleanup?

          Matthew Mitchell added a comment - Also, as an aside, the original motivation for this was to fix our workspace cleanup call at the end of node steps.  We were having processes left around.  However, after fixing this bug, it's clear that this won't fix that issue, since we actually need to have the cleanup happen after killing child processes.  Is there a hook (through plugin or otherwise) that can ensure that the WsCleanup runs after process cleanup?

          Jesse Glick added a comment -

          Well the process cleanup should happen at the end of the sh/bat step, in addition to or instead of at the end of the node block.

          Jesse Glick added a comment - Well the process cleanup should happen at the end of the sh / bat step, in addition to or instead of at the end of the node block.

          Interesting.  I don't think I was seeing it at the end of an sh/bat step.  Let me check that.  But you're right, this should fix the overall issue if it was killed at the end of each sh/bat

          Matthew Mitchell added a comment - Interesting.  I don't think I was seeing it at the end of an sh/bat step.  Let me check that.  But you're right, this should fix the overall issue if it was killed at the end of each sh/bat

          jglick The kill here is in ExecutorStepExecution, which is scoped to the node level, not the step level, so the process killing happens at the end of the node block.  There is no kill for DurableStep currently.  My take on this is that this would be the desired behavior, but I'm open to changing it.  This would certainly solve the problem, but might have unintended consequences for some user's pipeline workflows.

          Matthew Mitchell added a comment - jglick  The kill here is in ExecutorStepExecution, which is scoped to the node level, not the step level, so the process killing happens at the end of the node block.  There is no kill for DurableStep currently.  My take on this is that this would be the desired behavior, but I'm open to changing it.  This would certainly solve the problem, but might have unintended consequences for some user's pipeline workflows.

          Jesse Glick added a comment -

          Hmm, probably FileMonitoringController.cleanup should be doing this.

          Jesse Glick added a comment - Hmm, probably FileMonitoringController.cleanup should be doing this.

          Reagan Elm added a comment -

          mmitche is there any way to opt-out of Launcher.kill for a specific node? Some of our jobs intentionally start log running scripts we need to survive after the node has completed.

          Reagan Elm added a comment - mmitche is there any way to opt-out of Launcher.kill for a specific node? Some of our jobs intentionally start log running scripts we need to survive after the node has completed.

          Jesse Glick added a comment -

          Clear the environment I suppose.

          Jesse Glick added a comment - Clear the environment I suppose.

          The documented way was always to alter the environment variable that is being checked, so for instance, in Freestyle jobs you'd do:

          JENKINS_SERVER_COOKIE=do_not_kill

          in the pipeline jobs you'd do:

          JENKINS_NODE_COOKIE=do_not_kill

          Matthew Mitchell added a comment - The documented way was always to alter the environment variable that is being checked, so for instance, in Freestyle jobs you'd do: JENKINS_SERVER_COOKIE=do_not_kill in the pipeline jobs you'd do: JENKINS_NODE_COOKIE=do_not_kill

          Hi all,

          sorry for reopening, I just received new jenkins update...

          Is it possible to don't reinvent variable names, but use the stable well-known approach from usual jobs?
          I am about BUILD_ID=dontKillMe variable.
          documented here https://wiki.jenkins.io/display/JENKINS/ProcessTreeKiller 

          everyone who uses jenkins a lot – know about this variable and it is extremely simple to receive info about this variable and process killing via searching.
          it is very strange and unusual to have different variables for old-school jobs and for pipelines.

           

           

          Mykola Marzhan added a comment - Hi all, sorry for reopening, I just received new jenkins update... Is it possible to don't reinvent variable names, but use the stable well-known approach from usual jobs? I am about BUILD_ID=dontKillMe variable. documented here  https://wiki.jenkins.io/display/JENKINS/ProcessTreeKiller   everyone who uses jenkins a lot – know about this variable and it is extremely simple to receive info about this variable and process killing via searching. it is very strange and unusual to have different variables for old-school jobs and for pipelines.    

          I don't think BUILD_ID was being used for killing processes before.

          Anyways, the classic approach doesn't work with pipelines.  Two executors on the same machine could run parts of the same build in parallel.  This means that when the process killer attempts to kill BUILD_ID it will kill the other executor's processes.

          We could perhaps introduce an additional environment variable, and check for both.  Thoughts jglick?

          Matthew Mitchell added a comment - I don't think BUILD_ID was being used for killing processes before. Anyways, the classic approach doesn't work with pipelines.  Two executors on the same machine could run parts of the same build in parallel.  This means that when the process killer attempts to kill BUILD_ID it will kill the other executor's processes. We could perhaps introduce an additional environment variable, and check for both.  Thoughts jglick ?

          Jesse Glick added a comment -

          Does not need to be reopened.

          I see no reason to look for BUILD_ID. As pointed out, this does not generalize well to parallel blocks.

          Jesse Glick added a comment - Does not need to be reopened. I see no reason to look for BUILD_ID . As pointed out, this does not generalize well to parallel blocks.

          Hi jglick
          it is not needed to look only to BUILD_ID but it is possible to look also to BUILD_ID.
          it means kill process only in case if JENKINS_NODE_COOKIE and BUILD_ID are unchanged.

          I know that pipelines are a completely new thing, but jenkins itself is not a new thing and it very important to keep backward compatibility.
          Many people work with jenkins many years and these new variables are completely unexpected. 

          Mykola Marzhan added a comment - Hi  jglick it is not needed to look only to BUILD_ID but it is possible to look also to BUILD_ID . it means kill process only in case if JENKINS_NODE_COOKIE and BUILD_ID are unchanged. I know that pipelines are a completely new thing, but jenkins itself is not a new thing and it very important to keep backward compatibility. Many people work with jenkins many years and these new variables are completely unexpected. 

          Jesse Glick added a comment -

          Could be a follow-up RFE (linked issue).

          Jesse Glick added a comment - Could be a follow-up RFE (linked issue).

            mmitche Matthew Mitchell
            jglick Jesse Glick
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: