Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47821

vsphere plugin 2.16 not respecting slave disconnect settings

      Starting in VSphere Plugin 2.16, the behaviour at the end of a job has broken.

      I configure the node to disconnect after 1 build, and to shutdown at that point.  This, along with snapping back to the snapshot upon startup, gives me a guaranteed-clean machine at the start of every build.

      Starting in version 2.16, the plugin seems to opportunistically ignoring the "disconnect after (1) builds", and is re-using the node to run the next queued job without enforcing a snap back to the snapshot.  This next build then has high odds of failing or mis-building, as the node is unclean.

      WORKAROUND: Revert back to plugin version 2.15, where the error does not occur.

          [JENKINS-47821] vsphere plugin 2.16 not respecting slave disconnect settings

          John Mellor added a comment -
          Name of this Cloud: QA Cluster
          vSphere Host: https://vsphere.internal
          Disable Certificate Verification: checked
          Credentials: <valid non-interactive user/password in credentials>
          Templates: <none>
          

          FYI, There are several types of clouds configured at this site: google, vmware, k8s, etc.

          John Mellor added a comment - Name of this Cloud: QA Cluster vSphere Host: https: //vsphere.internal Disable Certificate Verification: checked Credentials: <valid non-interactive user/password in credentials> Templates: <none> FYI, There are several types of clouds configured at this site: google, vmware, k8s, etc.

          John Mellor added a comment -

          The target vsphere cloud is running an esxi-5.5 cluster managed by vcenter-5.5, and using unshared local disks in RAID-6 as the VMFS volumes. Not sure what else I can give you.

          John Mellor added a comment - The target vsphere cloud is running an esxi-5.5 cluster managed by vcenter-5.5, and using unshared local disks in RAID-6 as the VMFS volumes. Not sure what else I can give you.

          pjdarton added a comment -

          How (by which method?) did you define the slave nodes in Jenkins?
          (All my vSphere slaves are created from templates defined in the cloud section; I am aware it's possible to define non-cloud ones by a couple of routes but I've never done that myself)

          pjdarton added a comment - How (by which method?) did you define the slave nodes in Jenkins? (All my vSphere slaves are created from templates defined in the cloud section; I am aware it's possible to define non-cloud ones by a couple of routes but I've never done that myself)

          Valentin Marin added a comment - - edited

          Got the same issue here, as in slaves not respecting disconnect after limited builds setting (Jenkins 2.107.2 , vSphere 2.17). Nodes have been defined via Jenkins->Nodes->Slave virtual computer running under vSphere Cloud.

          To add a bit of context, I'm running pipeline projects on those nodes and they do not seem to be treated as 'builds' per say, as no executed instances of those are being displayed in node's "Build History" sections.

          Valentin Marin added a comment - - edited Got the same issue here, as in slaves not respecting disconnect after limited builds setting (Jenkins 2.107.2 , vSphere 2.17). Nodes have been defined via Jenkins->Nodes->Slave virtual computer running under vSphere Cloud. To add a bit of context, I'm running pipeline projects on those nodes and they do not seem to be treated as 'builds' per say, as no executed instances of those are being displayed in node's "Build History" sections.

          pjdarton added a comment -

          vmarin So you've got staticly-defined slaves...  How are they connecting to Jenkins?  SSH?  JNLP?  If JNLP, which protocol version?  And are you passing in a JNLP_SECRET or are they allowed in unauthenticated?  Also, what version of slave.jar are you using on the slave VMs?

          I've been tracing oddities in my own Jenkins build environment where slaves that start and then connect via JNLP often "stay online" (briefly) after they've gone offline due to a reboot-induced disconnection (long enough to start a new build job, which then fails because the slave had disconnected), but I've yet to get to the bottom of it (race-conditions are always difficult to debug).  It may be that the issue I'm trying to track down and this issue are all related...

           

          FYI I don't think that the lack of pipeline history is a vSphere plugin issue.  I've got a pipeline job that reboots my static (non-VM) Windows slaves and that doesn't show up on their build history, so if a pipeline segment doesn't show up on a normal Jenkins slave's build history, I don't think we can expect it to show up on a vSphere slave's history either, as that'd be common code (the vSphere slave code "extends" the Jenkins core Slave code).

          pjdarton added a comment - vmarin So you've got staticly-defined slaves...  How are they connecting to Jenkins?  SSH?  JNLP?  If JNLP, which protocol version?  And are you passing in a JNLP_SECRET or are they allowed in unauthenticated?  Also, what version of slave.jar are you using on the slave VMs? I've been tracing oddities in my own Jenkins build environment where slaves that start and then connect via JNLP often "stay online" (briefly) after they've gone offline due to a reboot-induced disconnection (long enough to start a new build job, which then fails because the slave had disconnected), but I've yet to get to the bottom of it (race-conditions are always difficult to debug).  It may be that the issue I'm trying to track down and this issue are all related...   FYI I don't think that the lack of pipeline history is a vSphere plugin issue.  I've got a pipeline job that reboots my static (non-VM) Windows slaves and that doesn't show up on their build history, so if a pipeline segment doesn't show up on a normal Jenkins slave's build history, I don't think we can expect it to show up on a vSphere slave's history either, as that'd be common code (the vSphere slave code "extends" the Jenkins core Slave code).

          Valentin Marin added a comment - - edited

          Slaves are connected via JNLP (windows service, while passing the JNLP secret), remoting version 3.17.

          Valentin Marin added a comment - - edited Slaves are connected via JNLP (windows service, while passing the JNLP secret), remoting version 3.17.

          Josiah Eubank added a comment - - edited

          Found a ticket regarding build history and pipelines JENKINS-38877

          Experiencing this still on 2.18, even though the text "Limited Builds is not currently used" no longer appears in the config help.  Note this is combined with "Take this agent offline when not in demand...."

          Josiah Eubank added a comment - - edited Found a ticket regarding build history and pipelines JENKINS-38877 Experiencing this still on 2.18, even though the text "Limited Builds is not currently used" no longer appears in the config help.  Note this is combined with "Take this agent offline when not in demand...."

          Oren Chapo added a comment -

          I've seen this issue also with version 2.16 and 2.18 of the vSphere Cloud plugin, however - it seems like it's not a problem in the plugin, but a limitation of the "cloud" Jenkins interface that the plugin implements.

          If you're trying to ensure a slave is always in a "clean" state when allocated, here's my workaround, after hours of painful google-search, trial and error:
          1. Node configuration: fill the "Snapshot Name" field (eg "Clean")
          2. Node configuration: Availability: "Take this agent online when in demand, and offline when idle"
          3. Node configuration: What to do when the slave is disconnected: "Shutdown"
          4. Pipeline job configuration: include the following code:

          	import jenkins.slaves.*
          	import jenkins.model.*
          	import hudson.slaves.*
          	import hudson.model.*
          	
          	def SafelyDisposeNode() {
          		print "Safely disposing node..."
          		def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave
          		if (slave == null) {
          			error "ERROR: Could not get slave object for node!"
          		}
          		try
          		{
          			slave.getComputer().setTemporarilyOffline(true, null)
          			if(isUnix()) {
          				sh "(sleep 2; poweroff)&"
          			} else {
          				bat "shutdown -t 2 -s"
          			}
          			slave.getComputer().disconnect(null)
          			sleep 10
          		} catch (err) {
          			print "ERROR: could not safely dispose node!"
          		} finally {
          			slave.getComputer().setTemporarilyOffline(false, null)
          		}
          		print "...node safely disposed."
          		slave = null
          	}
          	
          	def DisposableNode(String nodeLabel, Closure body) {
          		node(nodeLabel) {
          			try {
          				body()
          			} catch (err) {
          				throw err
          			} finally {
          				SafelyDisposeNode()
          			}
          		}
          	}
          
          

          5. When you want to ensure the node will NOT be used by another job (or another run of the same job), use a "DisposableNode" block instead of "node" block:

          	DisposableNode('MyNodeLabel') {
          		// run your pipeline code here.
          		// it will make sure the node is shutdown at the end of the block, even if it fails.
          		// no other job or build will be able to use the node in its "dirty" state,
          		// and vSphere plugin will revert to "clean" snapshot before starting the node again.
          	}
          

          6. If other Jobs are using this node (or node label), they all must use the above workaround, to avoid leaving a "dirty" machine for each other.
          7. As of the "why is it so important to have node in a clean state?" question, my use case is integration tests of kernel-mode drivers (both Windows and Linux O/S) that typically "break" the O/S and leave it in an unstable state (BSODs and Kernel Panics are common).
          8. If your pipeline job is running under a Groovy sandbox, you will need to permit some classes (The job will fail and offer you to whitelist a class, repeat carefully several times).

          Oren Chapo added a comment - I've seen this issue also with version 2.16 and 2.18 of the vSphere Cloud plugin, however - it seems like it's not a problem in the plugin, but a limitation of the "cloud" Jenkins interface that the plugin implements. If you're trying to ensure a slave is always in a "clean" state when allocated, here's my workaround, after hours of painful google-search, trial and error: 1. Node configuration: fill the "Snapshot Name" field (eg "Clean") 2. Node configuration: Availability: "Take this agent online when in demand, and offline when idle" 3. Node configuration: What to do when the slave is disconnected: "Shutdown" 4. Pipeline job configuration: include the following code: import jenkins.slaves.* import jenkins.model.* import hudson.slaves.* import hudson.model.* def SafelyDisposeNode() { print "Safely disposing node..." def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave if (slave == null ) { error "ERROR: Could not get slave object for node!" } try { slave.getComputer().setTemporarilyOffline( true , null ) if (isUnix()) { sh "(sleep 2; poweroff)&" } else { bat "shutdown -t 2 -s" } slave.getComputer().disconnect( null ) sleep 10 } catch (err) { print "ERROR: could not safely dispose node!" } finally { slave.getComputer().setTemporarilyOffline( false , null ) } print "...node safely disposed." slave = null } def DisposableNode( String nodeLabel, Closure body) { node(nodeLabel) { try { body() } catch (err) { throw err } finally { SafelyDisposeNode() } } } 5. When you want to ensure the node will NOT be used by another job (or another run of the same job), use a "DisposableNode" block instead of "node" block: DisposableNode( 'MyNodeLabel' ) { // run your pipeline code here. // it will make sure the node is shutdown at the end of the block, even if it fails. // no other job or build will be able to use the node in its "dirty" state, // and vSphere plugin will revert to "clean" snapshot before starting the node again. } 6. If other Jobs are using this node (or node label), they all must use the above workaround, to avoid leaving a "dirty" machine for each other. 7. As of the "why is it so important to have node in a clean state?" question, my use case is integration tests of kernel-mode drivers (both Windows and Linux O/S) that typically "break" the O/S and leave it in an unstable state (BSODs and Kernel Panics are common). 8. If your pipeline job is running under a Groovy sandbox, you will need to permit some classes (The job will fail and offer you to whitelist a class, repeat carefully several times).

          James Telfer added a comment -

          Any progress on this?  I have just come up against what looks like the same issue.  Statically defined Windows slaves connecting via JNLPv4.

          They seem to completely ignore the 'Disconnect After Limited Builds' option, which re-reading the Wiki seems to be the expected behaviour?

          orenchapo your work-around doesn't seem to work for me, at least not when using it within declarative pipeline.

          James Telfer added a comment - Any progress on this?  I have just come up against what looks like the same issue.  Statically defined Windows slaves connecting via JNLPv4. They seem to completely ignore the 'Disconnect After Limited Builds' option, which re-reading the Wiki seems to be the expected behaviour? orenchapo your work-around doesn't seem to work for me, at least not when using it within declarative pipeline.

          I modified the workaround to reset the vm in the pipeline itself.

          Advantages:

          • Shutdown activities are not required in the  node configuration.
          • The node is resetted before executing the pipeline to the given snapshot

           

          def ResettedNode(String vm, String serverName, String snapshotName, Closure body) {
              node(vm) {
                  // Reset the computer in the context of the node to avoid running other jobs on this node in the meanwhile
                  stage('Reset node')
                  {
                      def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave
                      if (slave == null) {
                          error "ERROR: Could not get slave object for node!"
                      }
                      try
                      {
                          slave.getComputer().setTemporarilyOffline(true, null)
                          vSphere buildStep: [$class: 'PowerOff', vm: vm, evenIfSuspended: true, shutdownGracefully: false, ignoreIfNotExists: false], serverName: serverName
                          vSphere buildStep: [$class: 'RevertToSnapshot', vm: vm, snapshotName: snapshotName], serverName: serverName
                          vSphere buildStep: [$class: 'PowerOn', timeoutInSeconds: 240, vm: vm], serverName: serverName
                          slave.getComputer().disconnect(null)
                          sleep 10 // wait, while the agent on the slave is starting up
                      } catch (err) {
                          print "ERROR: could not reset node!"
                      } finally {
                          slave.getComputer().setTemporarilyOffline(false, null)
                      }
                      slave = null
                  }
              }
              // Wait for node to come online again
              node(vm) {
                  body()
              }
          }
          
          ResettedNode('vm', 'vCloud', 'clean') 
          {
          
          }
          

           

          Werner Müller added a comment - I modified the workaround to reset the vm in the pipeline itself. Advantages: Shutdown activities are not required in the  node configuration. The node is resetted before executing the pipeline to the given snapshot   def ResettedNode( String vm, String serverName, String snapshotName, Closure body) { node(vm) { // Reset the computer in the context of the node to avoid running other jobs on this node in the meanwhile stage( 'Reset node' ) { def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave if (slave == null ) { error "ERROR: Could not get slave object for node!" } try { slave.getComputer().setTemporarilyOffline( true , null ) vSphere buildStep: [$class: 'PowerOff' , vm: vm, evenIfSuspended: true , shutdownGracefully: false , ignoreIfNotExists: false ], serverName: serverName vSphere buildStep: [$class: 'RevertToSnapshot' , vm: vm, snapshotName: snapshotName], serverName: serverName vSphere buildStep: [$class: 'PowerOn' , timeoutInSeconds: 240, vm: vm], serverName: serverName slave.getComputer().disconnect( null ) sleep 10 // wait, while the agent on the slave is starting up } catch (err) { print "ERROR: could not reset node!" } finally { slave.getComputer().setTemporarilyOffline( false , null ) } slave = null } } // Wait for node to come online again node(vm) { body() } } ResettedNode( 'vm' , 'vCloud' , 'clean' ) { }  

            pjdarton pjdarton
            alt_jmellor John Mellor
            Votes:
            4 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: