Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47821

vsphere plugin 2.16 not respecting slave disconnect settings

      Starting in VSphere Plugin 2.16, the behaviour at the end of a job has broken.

      I configure the node to disconnect after 1 build, and to shutdown at that point.  This, along with snapping back to the snapshot upon startup, gives me a guaranteed-clean machine at the start of every build.

      Starting in version 2.16, the plugin seems to opportunistically ignoring the "disconnect after (1) builds", and is re-using the node to run the next queued job without enforcing a snap back to the snapshot.  This next build then has high odds of failing or mis-building, as the node is unclean.

      WORKAROUND: Revert back to plugin version 2.15, where the error does not occur.

          [JENKINS-47821] vsphere plugin 2.16 not respecting slave disconnect settings

          John Mellor created issue -
          pjdarton made changes -
          Assignee New: pjdarton [ pjdarton ]

          pjdarton added a comment -

          You're saying it's a regression since 2.15?  Hmm, ok...  I certainly hadn't intended to cause this behavior but I'll see if I can find the cause and fix it...

          If you can provide any further information then that'd greatly simplify the debugging process.

          e.g. what do you mean by "opportunistically"?  What's the scenario in which the VM gets re-used (when it shouldn't) vs being disposed of correctly?

          pjdarton added a comment - You're saying it's a regression since 2.15?  Hmm, ok...  I certainly hadn't intended to cause this behavior but I'll see if I can find the cause and fix it... If you can provide any further information then that'd greatly simplify the debugging process. e.g. what do you mean by "opportunistically"?  What's the scenario in which the VM gets re-used (when it shouldn't) vs being disposed of correctly?

          John Mellor added a comment -

          Yes, exactly. I never do incremental builds as they are a severely-broken dev practice. I typically setup a node to disconnect after one build, and reset back to a vmware snapshot upon restart. That way I can easily debug a build problem because the machine is left in the state where the build failed, and the next job does not have artifacts present like dependency packages, config files or docker images added by the previous build.

          However I am now in a situation where sometimes a queued build runs on the node without going through the reset-back-to-snapshot step, breaking it.

          I have a crude workaround of powering the node down after every build, forcing it to go through the power-up steps which will then revert back to snapshot. However, this maximizes the downtime for the node between builds, and prevents some debugging actions because you lose the in-memory structures this way.

          John Mellor added a comment - Yes, exactly. I never do incremental builds as they are a severely-broken dev practice. I typically setup a node to disconnect after one build, and reset back to a vmware snapshot upon restart. That way I can easily debug a build problem because the machine is left in the state where the build failed, and the next job does not have artifacts present like dependency packages, config files or docker images added by the previous build. However I am now in a situation where sometimes a queued build runs on the node without going through the reset-back-to-snapshot step, breaking it. I have a crude workaround of powering the node down after every build, forcing it to go through the power-up steps which will then revert back to snapshot. However, this maximizes the downtime for the node between builds, and prevents some debugging actions because you lose the in-memory structures this way.

          pjdarton added a comment -

          I share your opinions - I never (intentionally) do incremental builds either

          If you can figure out a reproducible test case that I can follow here to reproduce the issue (i.e. see it reuse a node using plugin version 2,16 where it didn't on 2.15) then that'll greatly assist (and hence speed up) the diagnostic process and hence dramatically reduce the time-to-fix it.  Debugging something that happens "sometimes" is way more difficult than debugging something that happens "every time you do X".

          i.e. If you help me to help you, you'll get a solution a lot quicker

          pjdarton added a comment - I share your opinions - I never (intentionally) do incremental builds either If you can figure out a reproducible test case that I can follow here to reproduce the issue (i.e. see it reuse a node using plugin version 2,16 where it didn't on 2.15) then that'll greatly assist (and hence speed up) the diagnostic process and hence dramatically reduce the time-to-fix it.  Debugging something that happens "sometimes" is way more difficult than debugging something that happens "every time you do X". i.e. If you help me to help you, you'll get a solution a lot quicker
          pjdarton made changes -
          Issue Type Original: Improvement [ 4 ] New: Bug [ 1 ]
          pjdarton made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]

          pjdarton added a comment - - edited

          alt_jmellor  I've spotted what might have been a race condition in the code, giving it an opportunity to go wrong where it didn't before but, without further information regarding your configuration, I have no means to test whether or not it's fixed the issue.

          I've made some changes in vsphere-cloud PR#91 and you can download a built plugin from the ci.jenkins.io Jenkins server vsphere-cloud PR-91 CI build job (see "Last Successful Artifacts" - "vsphere-cloud.hpi").
          If you download that file you can then install it using "Manage Jenkins" -> "Manage Plugins" -> "Advanced" -> "Upload Plugin".

          Give that version of the plugin it a try and see if it makes a difference. If it doesn't help, you'll have to go into way more detail about how you've got things set up so that I can reproduce the issue locally. If it does help then please let me know.

          pjdarton added a comment - - edited alt_jmellor   I've spotted what might have been a race condition in the code, giving it an opportunity to go wrong where it didn't before but, without further information regarding your configuration, I have no means to test whether or not it's fixed the issue. I've made some changes in vsphere-cloud PR#91 and you can download a built plugin from the ci.jenkins.io Jenkins server vsphere-cloud PR-91 CI build job (see "Last Successful Artifacts" - " vsphere-cloud.hpi "). If you download that file you can then install it using "Manage Jenkins" -> "Manage Plugins" -> "Advanced" -> "Upload Plugin". Give that version of the plugin it a try and see if it makes a difference. If it doesn't help, you'll have to go into way more detail about how you've got things set up so that I can reproduce the issue locally. If it does help then please let me know.

          Code changed in jenkins
          User: Peter Darton
          Path:
          src/main/java/org/jenkinsci/plugins/vSphereCloudLauncher.java
          src/main/java/org/jenkinsci/plugins/vSphereCloudSlave.java
          src/main/java/org/jenkinsci/plugins/vSphereCloudSlaveTemplate.java
          src/main/java/org/jenkinsci/plugins/vsphere/RunOnceCloudRetentionStrategy.java
          src/main/java/org/jenkinsci/plugins/vsphere/VSphereOfflineCause.java
          src/main/resources/org/jenkinsci/plugins/Messages.properties
          src/main/resources/org/jenkinsci/plugins/vsphere/Messages.properties
          http://jenkins-ci.org/commit/vsphere-cloud-plugin/620868e4808f0df6772c11331dc86bd3ea8413eb
          Log:
          Merge pull request #91 from pjdarton/prevent-reuse-of-single-use-slaves

          JENKINS-47821 Prevent run-once slave from accepting more jobs.

          Compare: https://github.com/jenkinsci/vsphere-cloud-plugin/compare/6f78bb0aa164...620868e4808f

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Peter Darton Path: src/main/java/org/jenkinsci/plugins/vSphereCloudLauncher.java src/main/java/org/jenkinsci/plugins/vSphereCloudSlave.java src/main/java/org/jenkinsci/plugins/vSphereCloudSlaveTemplate.java src/main/java/org/jenkinsci/plugins/vsphere/RunOnceCloudRetentionStrategy.java src/main/java/org/jenkinsci/plugins/vsphere/VSphereOfflineCause.java src/main/resources/org/jenkinsci/plugins/Messages.properties src/main/resources/org/jenkinsci/plugins/vsphere/Messages.properties http://jenkins-ci.org/commit/vsphere-cloud-plugin/620868e4808f0df6772c11331dc86bd3ea8413eb Log: Merge pull request #91 from pjdarton/prevent-reuse-of-single-use-slaves JENKINS-47821 Prevent run-once slave from accepting more jobs. Compare: https://github.com/jenkinsci/vsphere-cloud-plugin/compare/6f78bb0aa164...620868e4808f

          John Mellor added a comment -

          With the limited testing that I've been able to perform, it looks like this change is provisionally not working.
          If I queue up multiple jobs for a single node, and configure the node to do nothing upon end-of-job and reset back to snapshot upon startup, then I do not see an expected reset-to-startup between each job. It looks like it just starts the job on the already-polluted machine and skips the reset-to-snapshot for some reason.

          John Mellor added a comment - With the limited testing that I've been able to perform, it looks like this change is provisionally not working. If I queue up multiple jobs for a single node, and configure the node to do nothing upon end-of-job and reset back to snapshot upon startup, then I do not see an expected reset-to-startup between each job. It looks like it just starts the job on the already-polluted machine and skips the reset-to-snapshot for some reason.

            pjdarton pjdarton
            alt_jmellor John Mellor
            Votes:
            4 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: