[JENKINS-47821] vsphere plugin 2.16 not respecting slave disconnect settings

Type: Bug
Resolution: Unresolved
Priority: Minor
Component/s: vsphere-cloud-plugin
Labels:
- regression
- slave
Environment:
Jenkins 2.86, Ubuntu-16.04-3

Similar Issues:
Powered by SuggestiMate

Show

Starting in VSphere Plugin 2.16, the behaviour at the end of a job has broken.

I configure the node to disconnect after 1 build, and to shutdown at that point. This, along with snapping back to the snapshot upon startup, gives me a guaranteed-clean machine at the start of every build.

Starting in version 2.16, the plugin seems to opportunistically ignoring the "disconnect after (1) builds", and is re-using the node to run the next queued job without enforcing a snap back to the snapshot. This next build then has high odds of failing or mis-building, as the node is unclean.

WORKAROUND: Revert back to plugin version 2.15, where the error does not occur.

John Mellor created issue - 2017-11-03 19:57

pjdarton made changes - 2018-03-23 13:29

Assignee

New: pjdarton [ pjdarton ]

pjdarton added a comment - 2018-03-23 13:33

You're saying it's a regression since 2.15? Hmm, ok... I certainly hadn't intended to cause this behavior but I'll see if I can find the cause and fix it...

If you can provide any further information then that'd greatly simplify the debugging process.

e.g. what do you mean by "opportunistically"? What's the scenario in which the VM gets re-used (when it shouldn't) vs being disposed of correctly?

pjdarton added a comment - 2018-03-23 13:33 You're saying it's a regression since 2.15? Hmm, ok... I certainly hadn't intended to cause this behavior but I'll see if I can find the cause and fix it... If you can provide any further information then that'd greatly simplify the debugging process. e.g. what do you mean by "opportunistically"? What's the scenario in which the VM gets re-used (when it shouldn't) vs being disposed of correctly?

John Mellor added a comment - 2018-03-23 13:53

Yes, exactly. I never do incremental builds as they are a severely-broken dev practice. I typically setup a node to disconnect after one build, and reset back to a vmware snapshot upon restart. That way I can easily debug a build problem because the machine is left in the state where the build failed, and the next job does not have artifacts present like dependency packages, config files or docker images added by the previous build.

However I am now in a situation where sometimes a queued build runs on the node without going through the reset-back-to-snapshot step, breaking it.

I have a crude workaround of powering the node down after every build, forcing it to go through the power-up steps which will then revert back to snapshot. However, this maximizes the downtime for the node between builds, and prevents some debugging actions because you lose the in-memory structures this way.

John Mellor added a comment - 2018-03-23 13:53 Yes, exactly. I never do incremental builds as they are a severely-broken dev practice. I typically setup a node to disconnect after one build, and reset back to a vmware snapshot upon restart. That way I can easily debug a build problem because the machine is left in the state where the build failed, and the next job does not have artifacts present like dependency packages, config files or docker images added by the previous build. However I am now in a situation where sometimes a queued build runs on the node without going through the reset-back-to-snapshot step, breaking it. I have a crude workaround of powering the node down after every build, forcing it to go through the power-up steps which will then revert back to snapshot. However, this maximizes the downtime for the node between builds, and prevents some debugging actions because you lose the in-memory structures this way.

pjdarton added a comment - 2018-03-23 14:42

I share your opinions - I never (intentionally) do incremental builds either

If you can figure out a reproducible test case that I can follow here to reproduce the issue (i.e. see it reuse a node using plugin version 2,16 where it didn't on 2.15) then that'll greatly assist (and hence speed up) the diagnostic process and hence dramatically reduce the time-to-fix it. Debugging something that happens "sometimes" is way more difficult than debugging something that happens "every time you do X".

i.e. If you help me to help you, you'll get a solution a lot quicker

pjdarton added a comment - 2018-03-23 14:42 I share your opinions - I never (intentionally) do incremental builds either If you can figure out a reproducible test case that I can follow here to reproduce the issue (i.e. see it reuse a node using plugin version 2,16 where it didn't on 2.15) then that'll greatly assist (and hence speed up) the diagnostic process and hence dramatically reduce the time-to-fix it. Debugging something that happens "sometimes" is way more difficult than debugging something that happens "every time you do X". i.e. If you help me to help you, you'll get a solution a lot quicker

pjdarton made changes - 2018-03-27 16:01

Issue Type

Original: Improvement [ 4 ]

New: Bug [ 1 ]

pjdarton made changes - 2018-03-27 16:02

Status

Original: Open [ 1 ]

New: In Progress [ 3 ]

pjdarton added a comment - 2018-03-27 16:24 - edited

alt_jmellor I've spotted what might have been a race condition in the code, giving it an opportunity to go wrong where it didn't before but, without further information regarding your configuration, I have no means to test whether or not it's fixed the issue.

I've made some changes in vsphere-cloud PR#91 and you can download a built plugin from the ci.jenkins.io Jenkins server vsphere-cloud PR-91 CI build job (see "Last Successful Artifacts" - "vsphere-cloud.hpi").
If you download that file you can then install it using "Manage Jenkins" -> "Manage Plugins" -> "Advanced" -> "Upload Plugin".

Give that version of the plugin it a try and see if it makes a difference. If it doesn't help, you'll have to go into way more detail about how you've got things set up so that I can reproduce the issue locally. If it does help then please let me know.

pjdarton added a comment - 2018-03-27 16:24 - edited alt_jmellor I've spotted what might have been a race condition in the code, giving it an opportunity to go wrong where it didn't before but, without further information regarding your configuration, I have no means to test whether or not it's fixed the issue. I've made some changes in vsphere-cloud PR#91 and you can download a built plugin from the ci.jenkins.io Jenkins server vsphere-cloud PR-91 CI build job (see "Last Successful Artifacts" - " vsphere-cloud.hpi "). If you download that file you can then install it using "Manage Jenkins" -> "Manage Plugins" -> "Advanced" -> "Upload Plugin". Give that version of the plugin it a try and see if it makes a difference. If it doesn't help, you'll have to go into way more detail about how you've got things set up so that I can reproduce the issue locally. If it does help then please let me know.

SCM/JIRA link daemon added a comment - 2018-04-04 16:14

Code changed in jenkins
User: Peter Darton
Path:
src/main/java/org/jenkinsci/plugins/vSphereCloudLauncher.java
src/main/java/org/jenkinsci/plugins/vSphereCloudSlave.java
src/main/java/org/jenkinsci/plugins/vSphereCloudSlaveTemplate.java
src/main/java/org/jenkinsci/plugins/vsphere/RunOnceCloudRetentionStrategy.java
src/main/java/org/jenkinsci/plugins/vsphere/VSphereOfflineCause.java
src/main/resources/org/jenkinsci/plugins/Messages.properties
src/main/resources/org/jenkinsci/plugins/vsphere/Messages.properties
http://jenkins-ci.org/commit/vsphere-cloud-plugin/620868e4808f0df6772c11331dc86bd3ea8413eb
Log:
Merge pull request #91 from pjdarton/prevent-reuse-of-single-use-slaves

JENKINS-47821 Prevent run-once slave from accepting more jobs.

Compare: https://github.com/jenkinsci/vsphere-cloud-plugin/compare/6f78bb0aa164...620868e4808f

SCM/JIRA link daemon added a comment - 2018-04-04 16:14 Code changed in jenkins User: Peter Darton Path: src/main/java/org/jenkinsci/plugins/vSphereCloudLauncher.java src/main/java/org/jenkinsci/plugins/vSphereCloudSlave.java src/main/java/org/jenkinsci/plugins/vSphereCloudSlaveTemplate.java src/main/java/org/jenkinsci/plugins/vsphere/RunOnceCloudRetentionStrategy.java src/main/java/org/jenkinsci/plugins/vsphere/VSphereOfflineCause.java src/main/resources/org/jenkinsci/plugins/Messages.properties src/main/resources/org/jenkinsci/plugins/vsphere/Messages.properties http://jenkins-ci.org/commit/vsphere-cloud-plugin/620868e4808f0df6772c11331dc86bd3ea8413eb Log: Merge pull request #91 from pjdarton/prevent-reuse-of-single-use-slaves JENKINS-47821 Prevent run-once slave from accepting more jobs. Compare: https://github.com/jenkinsci/vsphere-cloud-plugin/compare/6f78bb0aa164...620868e4808f

John Mellor added a comment - 2018-04-05 16:27

With the limited testing that I've been able to perform, it looks like this change is provisionally not working.
If I queue up multiple jobs for a single node, and configure the node to do nothing upon end-of-job and reset back to snapshot upon startup, then I do not see an expected reset-to-startup between each job. It looks like it just starts the job on the already-polluted machine and skips the reset-to-snapshot for some reason.

John Mellor added a comment - 2018-04-05 16:27 With the limited testing that I've been able to perform, it looks like this change is provisionally not working. If I queue up multiple jobs for a single node, and configure the node to do nothing upon end-of-job and reset back to snapshot upon startup, then I do not see an expected reset-to-startup between each job. It looks like it just starts the job on the already-polluted machine and skips the reset-to-snapshot for some reason.

Assignee:: pjdarton

Reporter:: John Mellor

Votes:: 4 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2017-11-03 19:57

Updated:: 2020-01-02 12:22

Jenkins

Details

Description

Attachments

Activity

Collapse comment: pjdarton added a comment - 2018-03-23 13:33

Expand comment: pjdarton added a comment - 2018-03-23 13:33

Collapse comment: John Mellor added a comment - 2018-03-23 13:53

Expand comment: John Mellor added a comment - 2018-03-23 13:53

Collapse comment: pjdarton added a comment - 2018-03-23 14:42

Expand comment: pjdarton added a comment - 2018-03-23 14:42

Collapse comment: pjdarton added a comment - 2018-03-27 16:24, Edited by pjdarton - 2018-03-27 16:25

Expand comment: pjdarton added a comment - 2018-03-27 16:24, Edited by pjdarton - 2018-03-27 16:25

Collapse comment: SCM/JIRA link daemon added a comment - 2018-04-04 16:14

Expand comment: SCM/JIRA link daemon added a comment - 2018-04-04 16:14

Collapse comment: John Mellor added a comment - 2018-04-05 16:27

Expand comment: John Mellor added a comment - 2018-04-05 16:27

People

Dates