[JENKINS-28182] Revisit use of $JENKINS_SERVER_COOKIE and Launcher.kill

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: workflow-durable-task-step-plugin
Labels:
- environment-variables
- process-killer

Similar Issues:
Powered by SuggestiMate

Show
Epic Link:
Pipeline Durability

PlaceholderExecutable needs an environment variable to add to the context to ensure that Launcher.kill, called when the block exits, will clean up any stray external processes run on this node.

Originally the name of this variable was chosen to be JENKINS_SERVER_COOKIE, which is set by Job (to a confidential hex key) and normally used by AbstractBuild for this purpose. Unlike in that case, for Workflow the value is overridden (to a random UUID) for each node block, since there may be several. I think the intent was to use the same variable name to suppress any unwanted kills coming from Jenkins core code, but there seem to be none. Anyway after the many refactorings of environment variable handling in Workflow, it turns out this is dead code!

FileMonitoringTask sets its own value for JENKINS_SERVER_COOKIE, to durable-<workspaceHash>, which overrides any other value, and is used by stop. So within a shell step, $JENKINS_SERVER_COOKIE is the one from durable-task, not PlaceholderExecutable.
env.JENKINS_SERVER_COOKIE picks up the value from Job, due to the precedence order in getEffectiveEnvironment.

So it seems that while the intention is still good (we would like a final check that any processes started inside node are killed when the block exits), the implementation is not right. Probably PlaceholderExecutable just needs to pick an unrelated environment variable for this purpose. And this logic needs to be tested; for example:

def file = new File(...);
node {
  sh "(sleep 5; touch $file) &"
}
sleep 10
assert !file.exists()

depends on

JENKINS-25938 Lock an Executor without creating a Thread

Resolved

is related to

JENKINS-28131 Pass NODE_NAME into node{}

Resolved

relates to

JENKINS-46089 ProcessTreeKiller broken in pipeline jobs

Reopened

Jesse Glick added a comment - 2015-05-01 13:31

~~JENKINS-28131~~ also discusses passing some other environment variables from PlaceholderExecutable, so would be easy to do at the same time.

Jesse Glick added a comment - 2015-05-01 13:31 JENKINS-28131 also discusses passing some other environment variables from PlaceholderExecutable , so would be easy to do at the same time.

Jesse Glick added a comment - 2015-05-01 13:36

The use of ~~JENKINS-25938~~ in the lts-609 branch also moves the Launcher.kill to the finish method (where it belonged to begin with), and moves the definition of COOKIE_VAR, so there could be merge conflicts if done before that branch is merged.

Jesse Glick added a comment - 2015-05-01 13:36 The use of JENKINS-25938 in the lts-609 branch also moves the Launcher.kill to the finish method (where it belonged to begin with), and moves the definition of COOKIE_VAR , so there could be merge conflicts if done before that branch is merged.

Matthew Mitchell added a comment - 2017-05-08 20:35

This is a big issue in some deployments (windows especially) that launch VC++ and C# compiler processes, since these sometimes want to leave processes around after execution. This can cause failures down the road. I'm looking into this issue now based on the info above

Matthew Mitchell added a comment - 2017-05-08 20:35 This is a big issue in some deployments (windows especially) that launch VC++ and C# compiler processes, since these sometimes want to leave processes around after execution. This can cause failures down the road. I'm looking into this issue now based on the info above

Jesse Glick added a comment - 2017-05-08 20:51

mmitche I think your comment is misplaced. Fixing this architectural / code style issue should not have any user-visible behavior.

Probably what you are asking for is accomplished simply by unsetting the environment variable for selected processes spawned by your script.

Jesse Glick added a comment - 2017-05-08 20:51 mmitche I think your comment is misplaced. Fixing this architectural / code style issue should not have any user-visible behavior. Probably what you are asking for is accomplished simply by unsetting the environment variable for selected processes spawned by your script.

Matthew Mitchell added a comment - 2017-05-08 21:02

Actually it does have user visible behavior. I did a bunch of experimentation and instrumentation of Jenkins.

The process killer is looking for the original value of JENKINS_SERVER_COOKE (some hash). But, all processes within the node get the durable-<hash>, including the problematic ones I'm seeing. I understand the intention here. Multiple node {} blocks running concurrently on the same node for the same pipeline job would be problematic if you got to the end of the node block and it killed the other node blocks running at the same time.

However, what this does mean is that nothing is actually cleaned up except what is normally killed by parent processes exiting. On Windows, this is more problematic because the ease of breaking parent->child relationships

Matthew Mitchell added a comment - 2017-05-08 21:02 Actually it does have user visible behavior. I did a bunch of experimentation and instrumentation of Jenkins. The process killer is looking for the original value of JENKINS_SERVER_COOKE (some hash). But, all processes within the node get the durable-<hash>, including the problematic ones I'm seeing. I understand the intention here. Multiple node {} blocks running concurrently on the same node for the same pipeline job would be problematic if you got to the end of the node block and it killed the other node blocks running at the same time. However, what this does mean is that nothing is actually cleaned up except what is normally killed by parent processes exiting. On Windows, this is more problematic because the ease of breaking parent->child relationships

Jesse Glick added a comment - 2017-05-08 22:25

IOW that the killer logic is just broken. I would need to spend some time on it in a debugger to confirm.

Jesse Glick added a comment - 2017-05-08 22:25 IOW that the killer logic is just broken. I would need to spend some time on it in a debugger to confirm.

Matthew Mitchell added a comment - 2017-05-08 22:32

I can take this over the next day or two (it's necessary for us to roll this out). It might just mean building more logic about what the process env var match needs to be when dealing with pipeline jobs.

Matthew Mitchell added a comment - 2017-05-08 22:32 I can take this over the next day or two (it's necessary for us to roll this out). It might just mean building more logic about what the process env var match needs to be when dealing with pipeline jobs.

Jesse Glick added a comment - 2017-05-09 13:22

I suggested the probable fix in the issue description, along with a sketch of a test.

Jesse Glick added a comment - 2017-05-09 13:22 I suggested the probable fix in the issue description, along with a sketch of a test.

Matthew Mitchell added a comment - 2017-05-10 21:17

jglick your hunch was correct. Fixed in https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/39 and added test case.

Matthew Mitchell added a comment - 2017-05-10 21:17 jglick your hunch was correct. Fixed in https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/39 and added test case.

Matthew Mitchell added a comment - 2017-05-10 21:47

Also, as an aside, the original motivation for this was to fix our workspace cleanup call at the end of node steps. We were having processes left around. However, after fixing this bug, it's clear that this won't fix that issue, since we actually need to have the cleanup happen after killing child processes. Is there a hook (through plugin or otherwise) that can ensure that the WsCleanup runs after process cleanup?

Matthew Mitchell added a comment - 2017-05-10 21:47 Also, as an aside, the original motivation for this was to fix our workspace cleanup call at the end of node steps. We were having processes left around. However, after fixing this bug, it's clear that this won't fix that issue, since we actually need to have the cleanup happen after killing child processes. Is there a hook (through plugin or otherwise) that can ensure that the WsCleanup runs after process cleanup?

Jesse Glick added a comment - 2017-05-16 13:28

Well the process cleanup should happen at the end of the sh/bat step, in addition to or instead of at the end of the node block.

Jesse Glick added a comment - 2017-05-16 13:28 Well the process cleanup should happen at the end of the sh / bat step, in addition to or instead of at the end of the node block.

Matthew Mitchell added a comment - 2017-05-16 15:53

Interesting. I don't think I was seeing it at the end of an sh/bat step. Let me check that. But you're right, this should fix the overall issue if it was killed at the end of each sh/bat

Matthew Mitchell added a comment - 2017-05-16 15:53 Interesting. I don't think I was seeing it at the end of an sh/bat step. Let me check that. But you're right, this should fix the overall issue if it was killed at the end of each sh/bat

Matthew Mitchell added a comment - 2017-05-17 21:36

jglick The kill here is in ExecutorStepExecution, which is scoped to the node level, not the step level, so the process killing happens at the end of the node block. There is no kill for DurableStep currently. My take on this is that this would be the desired behavior, but I'm open to changing it. This would certainly solve the problem, but might have unintended consequences for some user's pipeline workflows.

Matthew Mitchell added a comment - 2017-05-17 21:36 jglick The kill here is in ExecutorStepExecution, which is scoped to the node level, not the step level, so the process killing happens at the end of the node block. There is no kill for DurableStep currently. My take on this is that this would be the desired behavior, but I'm open to changing it. This would certainly solve the problem, but might have unintended consequences for some user's pipeline workflows.

Jesse Glick added a comment - 2017-05-18 16:13

Hmm, probably FileMonitoringController.cleanup should be doing this.

Jesse Glick added a comment - 2017-05-18 16:13 Hmm, probably FileMonitoringController.cleanup should be doing this.

Reagan Elm added a comment - 2017-06-22 02:15

mmitche is there any way to opt-out of Launcher.kill for a specific node? Some of our jobs intentionally start log running scripts we need to survive after the node has completed.

Reagan Elm added a comment - 2017-06-22 02:15 mmitche is there any way to opt-out of Launcher.kill for a specific node? Some of our jobs intentionally start log running scripts we need to survive after the node has completed.

Jesse Glick added a comment - 2017-06-22 14:06

Clear the environment I suppose.

Jesse Glick added a comment - 2017-06-22 14:06 Clear the environment I suppose.

Matthew Mitchell added a comment - 2017-06-22 15:07

The documented way was always to alter the environment variable that is being checked, so for instance, in Freestyle jobs you'd do:

JENKINS_SERVER_COOKIE=do_not_kill

in the pipeline jobs you'd do:

JENKINS_NODE_COOKIE=do_not_kill

Matthew Mitchell added a comment - 2017-06-22 15:07 The documented way was always to alter the environment variable that is being checked, so for instance, in Freestyle jobs you'd do: JENKINS_SERVER_COOKIE=do_not_kill in the pipeline jobs you'd do: JENKINS_NODE_COOKIE=do_not_kill

Mykola Marzhan added a comment - 2017-07-19 07:35

Hi all,

sorry for reopening, I just received new jenkins update...

Is it possible to don't reinvent variable names, but use the stable well-known approach from usual jobs?
I am about BUILD_ID=dontKillMe variable.
documented here https://wiki.jenkins.io/display/JENKINS/ProcessTreeKiller

everyone who uses jenkins a lot – know about this variable and it is extremely simple to receive info about this variable and process killing via searching.
it is very strange and unusual to have different variables for old-school jobs and for pipelines.

Mykola Marzhan added a comment - 2017-07-19 07:35 Hi all, sorry for reopening, I just received new jenkins update... Is it possible to don't reinvent variable names, but use the stable well-known approach from usual jobs? I am about BUILD_ID=dontKillMe variable. documented here https://wiki.jenkins.io/display/JENKINS/ProcessTreeKiller everyone who uses jenkins a lot – know about this variable and it is extremely simple to receive info about this variable and process killing via searching. it is very strange and unusual to have different variables for old-school jobs and for pipelines.

Matthew Mitchell added a comment - 2017-07-19 16:05

I don't think BUILD_ID was being used for killing processes before.

Anyways, the classic approach doesn't work with pipelines. Two executors on the same machine could run parts of the same build in parallel. This means that when the process killer attempts to kill BUILD_ID it will kill the other executor's processes.

We could perhaps introduce an additional environment variable, and check for both. Thoughts jglick?

Matthew Mitchell added a comment - 2017-07-19 16:05 I don't think BUILD_ID was being used for killing processes before. Anyways, the classic approach doesn't work with pipelines. Two executors on the same machine could run parts of the same build in parallel. This means that when the process killer attempts to kill BUILD_ID it will kill the other executor's processes. We could perhaps introduce an additional environment variable, and check for both. Thoughts jglick ?

Jesse Glick added a comment - 2017-07-26 18:17

Does not need to be reopened.

I see no reason to look for BUILD_ID. As pointed out, this does not generalize well to parallel blocks.

Jesse Glick added a comment - 2017-07-26 18:17 Does not need to be reopened. I see no reason to look for BUILD_ID . As pointed out, this does not generalize well to parallel blocks.

Mykola Marzhan added a comment - 2017-07-26 19:06

Hi jglick
it is not needed to look only to BUILD_ID but it is possible to look also to BUILD_ID.
it means kill process only in case if JENKINS_NODE_COOKIE and BUILD_ID are unchanged.

I know that pipelines are a completely new thing, but jenkins itself is not a new thing and it very important to keep backward compatibility.
Many people work with jenkins many years and these new variables are completely unexpected.

Mykola Marzhan added a comment - 2017-07-26 19:06 Hi jglick it is not needed to look only to BUILD_ID but it is possible to look also to BUILD_ID . it means kill process only in case if JENKINS_NODE_COOKIE and BUILD_ID are unchanged. I know that pipelines are a completely new thing, but jenkins itself is not a new thing and it very important to keep backward compatibility. Many people work with jenkins many years and these new variables are completely unexpected.

Jesse Glick added a comment - 2017-07-27 23:03

Could be a follow-up RFE (linked issue).

Jesse Glick added a comment - 2017-07-27 23:03 Could be a follow-up RFE (linked issue).

Assignee:: Matthew Mitchell

Reporter:: Jesse Glick

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2015-05-01 13:31

Updated:: 2018-10-08 19:45

Resolved:: 2017-07-27 23:03

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Jesse Glick added a comment - 2015-05-01 13:31

Expand comment: Jesse Glick added a comment - 2015-05-01 13:31

Collapse comment: Jesse Glick added a comment - 2015-05-01 13:36

Expand comment: Jesse Glick added a comment - 2015-05-01 13:36

Collapse comment: Matthew Mitchell added a comment - 2017-05-08 20:35

Expand comment: Matthew Mitchell added a comment - 2017-05-08 20:35

Collapse comment: Jesse Glick added a comment - 2017-05-08 20:51

Expand comment: Jesse Glick added a comment - 2017-05-08 20:51

Collapse comment: Matthew Mitchell added a comment - 2017-05-08 21:02

Expand comment: Matthew Mitchell added a comment - 2017-05-08 21:02

Collapse comment: Jesse Glick added a comment - 2017-05-08 22:25

Expand comment: Jesse Glick added a comment - 2017-05-08 22:25

Collapse comment: Matthew Mitchell added a comment - 2017-05-08 22:32

Expand comment: Matthew Mitchell added a comment - 2017-05-08 22:32

Collapse comment: Jesse Glick added a comment - 2017-05-09 13:22

Expand comment: Jesse Glick added a comment - 2017-05-09 13:22

Collapse comment: Matthew Mitchell added a comment - 2017-05-10 21:17

Expand comment: Matthew Mitchell added a comment - 2017-05-10 21:17

Collapse comment: Matthew Mitchell added a comment - 2017-05-10 21:47

Expand comment: Matthew Mitchell added a comment - 2017-05-10 21:47

Collapse comment: Jesse Glick added a comment - 2017-05-16 13:28

Expand comment: Jesse Glick added a comment - 2017-05-16 13:28

Collapse comment: Matthew Mitchell added a comment - 2017-05-16 15:53

Expand comment: Matthew Mitchell added a comment - 2017-05-16 15:53

Collapse comment: Matthew Mitchell added a comment - 2017-05-17 21:36

Expand comment: Matthew Mitchell added a comment - 2017-05-17 21:36

Collapse comment: Jesse Glick added a comment - 2017-05-18 16:13

Expand comment: Jesse Glick added a comment - 2017-05-18 16:13

Collapse comment: Reagan Elm added a comment - 2017-06-22 02:15

Expand comment: Reagan Elm added a comment - 2017-06-22 02:15

Collapse comment: Jesse Glick added a comment - 2017-06-22 14:06

Expand comment: Jesse Glick added a comment - 2017-06-22 14:06

Collapse comment: Matthew Mitchell added a comment - 2017-06-22 15:07

Expand comment: Matthew Mitchell added a comment - 2017-06-22 15:07

Collapse comment: Mykola Marzhan added a comment - 2017-07-19 07:35

Expand comment: Mykola Marzhan added a comment - 2017-07-19 07:35

Collapse comment: Matthew Mitchell added a comment - 2017-07-19 16:05

Expand comment: Matthew Mitchell added a comment - 2017-07-19 16:05

Collapse comment: Jesse Glick added a comment - 2017-07-26 18:17

Expand comment: Jesse Glick added a comment - 2017-07-26 18:17

Collapse comment: Mykola Marzhan added a comment - 2017-07-26 19:06

Expand comment: Mykola Marzhan added a comment - 2017-07-26 19:06

Collapse comment: Jesse Glick added a comment - 2017-07-27 23:03

Expand comment: Jesse Glick added a comment - 2017-07-27 23:03

People

Dates