-
Bug
-
Resolution: Unresolved
-
Minor
-
Jenkins 2.86, Ubuntu-16.04-3
-
Powered by SuggestiMate
Starting in VSphere Plugin 2.16, the behaviour at the end of a job has broken.
I configure the node to disconnect after 1 build, and to shutdown at that point. This, along with snapping back to the snapshot upon startup, gives me a guaranteed-clean machine at the start of every build.
Starting in version 2.16, the plugin seems to opportunistically ignoring the "disconnect after (1) builds", and is re-using the node to run the next queued job without enforcing a snap back to the snapshot. This next build then has high odds of failing or mis-building, as the node is unclean.
WORKAROUND: Revert back to plugin version 2.15, where the error does not occur.
[JENKINS-47821] vsphere plugin 2.16 not respecting slave disconnect settings
Yes, exactly. I never do incremental builds as they are a severely-broken dev practice. I typically setup a node to disconnect after one build, and reset back to a vmware snapshot upon restart. That way I can easily debug a build problem because the machine is left in the state where the build failed, and the next job does not have artifacts present like dependency packages, config files or docker images added by the previous build.
However I am now in a situation where sometimes a queued build runs on the node without going through the reset-back-to-snapshot step, breaking it.
I have a crude workaround of powering the node down after every build, forcing it to go through the power-up steps which will then revert back to snapshot. However, this maximizes the downtime for the node between builds, and prevents some debugging actions because you lose the in-memory structures this way.
I share your opinions - I never (intentionally) do incremental builds either
If you can figure out a reproducible test case that I can follow here to reproduce the issue (i.e. see it reuse a node using plugin version 2,16 where it didn't on 2.15) then that'll greatly assist (and hence speed up) the diagnostic process and hence dramatically reduce the time-to-fix it. Debugging something that happens "sometimes" is way more difficult than debugging something that happens "every time you do X".
i.e. If you help me to help you, you'll get a solution a lot quicker
alt_jmellor I've spotted what might have been a race condition in the code, giving it an opportunity to go wrong where it didn't before but, without further information regarding your configuration, I have no means to test whether or not it's fixed the issue.
I've made some changes in vsphere-cloud PR#91 and you can download a built plugin from the ci.jenkins.io Jenkins server vsphere-cloud PR-91 CI build job (see "Last Successful Artifacts" - "vsphere-cloud.hpi").
If you download that file you can then install it using "Manage Jenkins" -> "Manage Plugins" -> "Advanced" -> "Upload Plugin".
Give that version of the plugin it a try and see if it makes a difference. If it doesn't help, you'll have to go into way more detail about how you've got things set up so that I can reproduce the issue locally. If it does help then please let me know.
Code changed in jenkins
User: Peter Darton
Path:
src/main/java/org/jenkinsci/plugins/vSphereCloudLauncher.java
src/main/java/org/jenkinsci/plugins/vSphereCloudSlave.java
src/main/java/org/jenkinsci/plugins/vSphereCloudSlaveTemplate.java
src/main/java/org/jenkinsci/plugins/vsphere/RunOnceCloudRetentionStrategy.java
src/main/java/org/jenkinsci/plugins/vsphere/VSphereOfflineCause.java
src/main/resources/org/jenkinsci/plugins/Messages.properties
src/main/resources/org/jenkinsci/plugins/vsphere/Messages.properties
http://jenkins-ci.org/commit/vsphere-cloud-plugin/620868e4808f0df6772c11331dc86bd3ea8413eb
Log:
Merge pull request #91 from pjdarton/prevent-reuse-of-single-use-slaves
JENKINS-47821 Prevent run-once slave from accepting more jobs.
Compare: https://github.com/jenkinsci/vsphere-cloud-plugin/compare/6f78bb0aa164...620868e4808f
With the limited testing that I've been able to perform, it looks like this change is provisionally not working.
If I queue up multiple jobs for a single node, and configure the node to do nothing upon end-of-job and reset back to snapshot upon startup, then I do not see an expected reset-to-startup between each job. It looks like it just starts the job on the already-polluted machine and skips the reset-to-snapshot for some reason.
In that case then I'm going to need you to describe your setup, as that's not what I see here (but then, I mostly use the plugin's "Cloud" functionality and am unfamiliar with its other functionality, which I'm guessing is what you're using).
If you can provide a description of how to set up a Jenkins server (that has the vSphere plugin installed) to reproduce this issue, I'll see if I can reproduce it. If I can reproduce it, there's a chance I might be able to fix it.
(FYI fixing bugs in this plugin is not my official day job, so the easier you can make it for me to see the issue for myself, the better the chances are that I can come up with a fix before my boss tells me to do something that is my official day job)
For some reason, I am unable to screenshot a typical config into this ticket.
When I configure a high-use build node, I generally set it up for:
Availability: Take this agent online when in demand, and offline when idle Disconnect after limited builds: 1 What to do when the slave is disconnected: Revert and Restart
If it is a low-use node, then I instead configure for:
What to do when the slave is disconnected: Shutdown
Can we start with where you define the node?
FYI there's multiple ways the plugin can define a slave node, so how you get to the point where you make the choices you've described (can) make a difference.
I need instructions that start from "I've installed Jenkins and I've installed the plugin". I'm guessing that the next step would be to define a vSphere cloud and tell Jenkins the URL of vSphere and login details, and I presume that there will have to be some stuff in that vSphere server too, but I need to know what it consists of.
Name of this Cloud: QA Cluster vSphere Host: https://vsphere.internal Disable Certificate Verification: checked Credentials: <valid non-interactive user/password in credentials> Templates: <none>
FYI, There are several types of clouds configured at this site: google, vmware, k8s, etc.
The target vsphere cloud is running an esxi-5.5 cluster managed by vcenter-5.5, and using unshared local disks in RAID-6 as the VMFS volumes. Not sure what else I can give you.
How (by which method?) did you define the slave nodes in Jenkins?
(All my vSphere slaves are created from templates defined in the cloud section; I am aware it's possible to define non-cloud ones by a couple of routes but I've never done that myself)
Got the same issue here, as in slaves not respecting disconnect after limited builds setting (Jenkins 2.107.2 , vSphere 2.17). Nodes have been defined via Jenkins->Nodes->Slave virtual computer running under vSphere Cloud.
To add a bit of context, I'm running pipeline projects on those nodes and they do not seem to be treated as 'builds' per say, as no executed instances of those are being displayed in node's "Build History" sections.
vmarin So you've got staticly-defined slaves... How are they connecting to Jenkins? SSH? JNLP? If JNLP, which protocol version? And are you passing in a JNLP_SECRET or are they allowed in unauthenticated? Also, what version of slave.jar are you using on the slave VMs?
I've been tracing oddities in my own Jenkins build environment where slaves that start and then connect via JNLP often "stay online" (briefly) after they've gone offline due to a reboot-induced disconnection (long enough to start a new build job, which then fails because the slave had disconnected), but I've yet to get to the bottom of it (race-conditions are always difficult to debug). It may be that the issue I'm trying to track down and this issue are all related...
FYI I don't think that the lack of pipeline history is a vSphere plugin issue. I've got a pipeline job that reboots my static (non-VM) Windows slaves and that doesn't show up on their build history, so if a pipeline segment doesn't show up on a normal Jenkins slave's build history, I don't think we can expect it to show up on a vSphere slave's history either, as that'd be common code (the vSphere slave code "extends" the Jenkins core Slave code).
Slaves are connected via JNLP (windows service, while passing the JNLP secret), remoting version 3.17.
Found a ticket regarding build history and pipelines JENKINS-38877
Experiencing this still on 2.18, even though the text "Limited Builds is not currently used" no longer appears in the config help. Note this is combined with "Take this agent offline when not in demand...."
I've seen this issue also with version 2.16 and 2.18 of the vSphere Cloud plugin, however - it seems like it's not a problem in the plugin, but a limitation of the "cloud" Jenkins interface that the plugin implements.
If you're trying to ensure a slave is always in a "clean" state when allocated, here's my workaround, after hours of painful google-search, trial and error:
1. Node configuration: fill the "Snapshot Name" field (eg "Clean")
2. Node configuration: Availability: "Take this agent online when in demand, and offline when idle"
3. Node configuration: What to do when the slave is disconnected: "Shutdown"
4. Pipeline job configuration: include the following code:
import jenkins.slaves.* import jenkins.model.* import hudson.slaves.* import hudson.model.* def SafelyDisposeNode() { print "Safely disposing node..." def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave if (slave == null) { error "ERROR: Could not get slave object for node!" } try { slave.getComputer().setTemporarilyOffline(true, null) if(isUnix()) { sh "(sleep 2; poweroff)&" } else { bat "shutdown -t 2 -s" } slave.getComputer().disconnect(null) sleep 10 } catch (err) { print "ERROR: could not safely dispose node!" } finally { slave.getComputer().setTemporarilyOffline(false, null) } print "...node safely disposed." slave = null } def DisposableNode(String nodeLabel, Closure body) { node(nodeLabel) { try { body() } catch (err) { throw err } finally { SafelyDisposeNode() } } }
5. When you want to ensure the node will NOT be used by another job (or another run of the same job), use a "DisposableNode" block instead of "node" block:
DisposableNode('MyNodeLabel') { // run your pipeline code here. // it will make sure the node is shutdown at the end of the block, even if it fails. // no other job or build will be able to use the node in its "dirty" state, // and vSphere plugin will revert to "clean" snapshot before starting the node again. }
6. If other Jobs are using this node (or node label), they all must use the above workaround, to avoid leaving a "dirty" machine for each other.
7. As of the "why is it so important to have node in a clean state?" question, my use case is integration tests of kernel-mode drivers (both Windows and Linux O/S) that typically "break" the O/S and leave it in an unstable state (BSODs and Kernel Panics are common).
8. If your pipeline job is running under a Groovy sandbox, you will need to permit some classes (The job will fail and offer you to whitelist a class, repeat carefully several times).
Any progress on this? I have just come up against what looks like the same issue. Statically defined Windows slaves connecting via JNLPv4.
They seem to completely ignore the 'Disconnect After Limited Builds' option, which re-reading the Wiki seems to be the expected behaviour?
orenchapo your work-around doesn't seem to work for me, at least not when using it within declarative pipeline.
I modified the workaround to reset the vm in the pipeline itself.
Advantages:
- Shutdown activities are not required in the node configuration.
- The node is resetted before executing the pipeline to the given snapshot
def ResettedNode(String vm, String serverName, String snapshotName, Closure body) { node(vm) { // Reset the computer in the context of the node to avoid running other jobs on this node in the meanwhile stage('Reset node') { def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave if (slave == null) { error "ERROR: Could not get slave object for node!" } try { slave.getComputer().setTemporarilyOffline(true, null) vSphere buildStep: [$class: 'PowerOff', vm: vm, evenIfSuspended: true, shutdownGracefully: false, ignoreIfNotExists: false], serverName: serverName vSphere buildStep: [$class: 'RevertToSnapshot', vm: vm, snapshotName: snapshotName], serverName: serverName vSphere buildStep: [$class: 'PowerOn', timeoutInSeconds: 240, vm: vm], serverName: serverName slave.getComputer().disconnect(null) sleep 10 // wait, while the agent on the slave is starting up } catch (err) { print "ERROR: could not reset node!" } finally { slave.getComputer().setTemporarilyOffline(false, null) } slave = null } } // Wait for node to come online again node(vm) { body() } } ResettedNode('vm', 'vCloud', 'clean') { }
You're saying it's a regression since 2.15? Hmm, ok... I certainly hadn't intended to cause this behavior but I'll see if I can find the cause and fix it...
If you can provide any further information then that'd greatly simplify the debugging process.
e.g. what do you mean by "opportunistically"? What's the scenario in which the VM gets re-used (when it shouldn't) vs being disposed of correctly?