Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-30183

Openstack Cloud Plugin does not wait until the user data script completes

      I created a script in "Openstack User Data" section in Jenkins Management that is supposed to load some files to the new node before any jenkins job can be launched.
      it's a shell script that calls a python script at the end.

      I see that sometimes the jenkins job is started on the new node BEFORE the init script completes, which is not acceptable.

      are there any examples available on how to make sure "Openstack User Data" script completes before any jobs are started on that node?

          [JENKINS-30183] Openstack Cloud Plugin does not wait until the user data script completes

          There is a way to do this using JNLP launch initiated as the last thing from your cloud-init. There is no way known to me to wait for cloud-init completion when connecting over SSH.

          Oliver Gondža added a comment - There is a way to do this using JNLP launch initiated as the last thing from your cloud-init. There is no way known to me to wait for cloud-init completion when connecting over SSH.

          What we do is have our Jenkins user created as the very last step of the user data scripts.

          That keeps Jenkins from connecting into the node until it's actually ready.

          Andrew Grimberg added a comment - What we do is have our Jenkins user created as the very last step of the user data scripts. That keeps Jenkins from connecting into the node until it's actually ready.

          Joshua Harlow added a comment -

          So over at godaddy we are moving towards this plugin and our existing solution has been to spin-up a tiny webserver (either via apache) or via python via something like the following (this would happen as the last thing cloud-init via user data script does):

          $ python -m SimpleHTTPServer 8888
          

          Then have the code that provisioned that server (ie, the openstack cloud plugin here), poll for a given period of time trying to access the above port, if the port doesn't become accessible after some configurable amount of time, then assume the server is busted (and the VM is dead); otherwise assume everything worked out.

          Another option (and I'm not really sure how/if it could work) is to take advvantage of the cloud-init feature to callback to a URL as the last thing that it does when running (ie after all the other user data scripts have ran): example @ https://github.com/number5/cloud-init/blob/master/doc/examples/cloud-config-phone-home.txt#L3

          If that could be used to somehow post back to the jenkins master that the 'slave is now ready for usage' that'd be even better (less polling required). Anyone know which approach people would be ok with? Both would require a version of cloud-init installed (which is in most/all images), the later would be nice, but unsure if jenkins has such an ability in the first place (how does it know slaves are ready to be used?).

          Joshua Harlow added a comment - So over at godaddy we are moving towards this plugin and our existing solution has been to spin-up a tiny webserver (either via apache) or via python via something like the following (this would happen as the last thing cloud-init via user data script does): $ python -m SimpleHTTPServer 8888 Then have the code that provisioned that server (ie, the openstack cloud plugin here), poll for a given period of time trying to access the above port, if the port doesn't become accessible after some configurable amount of time, then assume the server is busted (and the VM is dead); otherwise assume everything worked out. Another option (and I'm not really sure how/if it could work) is to take advvantage of the cloud-init feature to callback to a URL as the last thing that it does when running (ie after all the other user data scripts have ran): example @ https://github.com/number5/cloud-init/blob/master/doc/examples/cloud-config-phone-home.txt#L3 If that could be used to somehow post back to the jenkins master that the 'slave is now ready for usage' that'd be even better (less polling required). Anyone know which approach people would be ok with? Both would require a version of cloud-init installed (which is in most/all images), the later would be nice, but unsure if jenkins has such an ability in the first place (how does it know slaves are ready to be used?).

          Oliver Gondža added a comment - - edited

          I have considered the phone home feature though there are several caveats:

          • Cloud-init can be user written and therefore the directive might be missing. Adding it at the end of the script by the plugin sounds problematic.
          • Cloud init does not necessarily have to use #cloud-config format, it can be bash, powershell or other script. I do not think plugin should manipulate any of them.
          • Not all images have that feature enabled. If user-data are sent and not supported by image - nothing really happens. If Jenkins would wait for a signal to arrive it would be a problem for such images.

          Oliver Gondža added a comment - - edited I have considered the phone home feature though there are several caveats: Cloud-init can be user written and therefore the directive might be missing. Adding it at the end of the script by the plugin sounds problematic. Cloud init does not necessarily have to use #cloud-config format, it can be bash, powershell or other script. I do not think plugin should manipulate any of them. Not all images have that feature enabled. If user-data are sent and not supported by image - nothing really happens. If Jenkins would wait for a signal to arrive it would be a problem for such images.

          Why do you need a web server running on your slaves ahead of time? Please elaborate further.

          Oliver Gondža added a comment - Why do you need a web server running on your slaves ahead of time? Please elaborate further.

          My current thinking is using marker entries in openstack server console log. Iff there is OPENSTACK_USER_DATA_START logged, the plugin would wait for OPENSTACK_USER_DATA_DONE. Existing scripts would work as before since the first marker would never appear. So does images/setups without cloud-init/cloud-config. If someone is doing complicated setup, they will log extra 2 lines to the output from the script they prefer.

          Oliver Gondža added a comment - My current thinking is using marker entries in openstack server console log. Iff there is OPENSTACK_USER_DATA_START logged, the plugin would wait for OPENSTACK_USER_DATA_DONE . Existing scripts would work as before since the first marker would never appear. So does images/setups without cloud-init/cloud-config. If someone is doing complicated setup, they will log extra 2 lines to the output from the script they prefer.

          Joshua Harlow added a comment -

          Agreed to the caveats, btw. I'm one of the cloud-init developers (one of the like three)!

          There is a feature that folks have used that is called multi-part messages into cloud-init, this would allow u to combine user provided scripts and jenkins provided ones. So that could handle having both scripts and ...

          http://cloudinit.readthedocs.io/en/latest/topics/format.html#mime-multi-part-archive

          So as for the question, its not running ahead of time, its installed via a cloud-init script then started by cloud-init as close to being done as it can be.

          The other option that I've seen (below in ansible) and I guess its similar to what jenkins is doing (when trying to setup the slave?) is wait for ssh to get started, and if that never happens, assume the server is dead.

          https://github.com/kubespray/kargo-cli/blob/master/src/kargo/cloud.py#L382 (for example),

          It'd be nice to pick a strategy, or just use the existing jenkins ssh retries and such (and/or make that more robust)?

          Joshua Harlow added a comment - Agreed to the caveats, btw. I'm one of the cloud-init developers (one of the like three)! There is a feature that folks have used that is called multi-part messages into cloud-init, this would allow u to combine user provided scripts and jenkins provided ones. So that could handle having both scripts and ... http://cloudinit.readthedocs.io/en/latest/topics/format.html#mime-multi-part-archive So as for the question, its not running ahead of time, its installed via a cloud-init script then started by cloud-init as close to being done as it can be. The other option that I've seen (below in ansible) and I guess its similar to what jenkins is doing (when trying to setup the slave?) is wait for ssh to get started, and if that never happens, assume the server is dead. https://github.com/kubespray/kargo-cli/blob/master/src/kargo/cloud.py#L382 (for example), It'd be nice to pick a strategy, or just use the existing jenkins ssh retries and such (and/or make that more robust)?

          Joshua Harlow added a comment -

          So the other option IMHO is to just depend on the JNLP or ssh and use that as the detection that if u can't access the server over SSH (I don't know much about JNLP) then the slave isn't going to work anyway so if u can't connect to it over those mechanisms after X period of time, then u might as well as declare the instance dead (because jenkins won't be able to use it anyway if those don't work).

          Joshua Harlow added a comment - So the other option IMHO is to just depend on the JNLP or ssh and use that as the detection that if u can't access the server over SSH (I don't know much about JNLP) then the slave isn't going to work anyway so if u can't connect to it over those mechanisms after X period of time, then u might as well as declare the instance dead (because jenkins won't be able to use it anyway if those don't work).

          JNLP is easy as the connection is generated from slave side. It is enough to start the agent once the setup is done from cloud-init.

          SSH connection is that causes this problem. The problem is that SSH starts responding before cloud-init completes so jobs can observe instance not fully setup. Which is something you can help us to clarify: Is there a way to tell cloud-init to reject the connection until it is done in a reasonable way (not like https://github.com/jenkinsci/acceptance-test-harness/blob/master/src/test/resources/openstack_plugin/cloud-init#L16)? That would narrow the problem down to properly wait for ssh connection to start working.

          Oliver Gondža added a comment - JNLP is easy as the connection is generated from slave side. It is enough to start the agent once the setup is done from cloud-init. SSH connection is that causes this problem. The problem is that SSH starts responding before cloud-init completes so jobs can observe instance not fully setup. Which is something you can help us to clarify: Is there a way to tell cloud-init to reject the connection until it is done in a reasonable way (not like https://github.com/jenkinsci/acceptance-test-harness/blob/master/src/test/resources/openstack_plugin/cloud-init#L16)? That would narrow the problem down to properly wait for ssh connection to start working.

          Joshua Harlow added a comment -

          Let me see what I can dig up, part of this feels like a out of order system daemon configuration (where sshd is starting up before cloud-init); which may just be a packaging issue. Let me look that over and get back to u.

          Joshua Harlow added a comment - Let me see what I can dig up, part of this feels like a out of order system daemon configuration (where sshd is starting up before cloud-init); which may just be a packaging issue. Let me look that over and get back to u.

          Joshua Harlow added a comment -

          Do u know what image u were trying this on, a image with systemd?

          Knowing that will help me figure out how to adjust the module order in cloud-init to better suit what u want to do.

          Joshua Harlow added a comment - Do u know what image u were trying this on, a image with systemd? Knowing that will help me figure out how to adjust the module order in cloud-init to better suit what u want to do.

          Donald Morton added a comment -

          I have run into this problem as well. It takes about 20 minutes for our init script to finish running. I was able to work around it by using the jenkins-cli.jar to reconnect the node at the end of the script. Here is an example:

          curl --insecure https://<JENKINS_URL>/jnlpJars/jenkins-cli.jar > jenkins-cli.jar 2>> ${LOGFILE}

          #Fix SSL Cert so jenkins-cli works properly
          openssl x509 -in <(openssl s_client -connect <JENKINS_URL>:443 -prexit 2>/dev/null) > jenkins.cer 2>> ${LOGFILE}
          /opt/java/jdk1.8.0_25/bin/keytool -keystore /opt/java/jdk1.8.0_25/jre/lib/security/cacerts -import -file jenkins.cer -alias <JENKINS_URL> -storepass changeit -noprompt >> ${LOGFILE} 2>&1

          #Tell Jenkins master to connect to the slave
          runuser -l cloud-user -c "/opt/java/jdk1.8.0_25/bin/java -jar jenkins-cli.jar -s https://<JENKINS_URL> connect-node `hostname`" >> ${LOGFILE} 2>&1

          Donald Morton added a comment - I have run into this problem as well. It takes about 20 minutes for our init script to finish running. I was able to work around it by using the jenkins-cli.jar to reconnect the node at the end of the script. Here is an example: curl --insecure https://<JENKINS_URL>/jnlpJars/jenkins-cli.jar > jenkins-cli.jar 2>> ${LOGFILE} #Fix SSL Cert so jenkins-cli works properly openssl x509 -in <(openssl s_client -connect <JENKINS_URL>:443 -prexit 2>/dev/null) > jenkins.cer 2>> ${LOGFILE} /opt/java/jdk1.8.0_25/bin/keytool -keystore /opt/java/jdk1.8.0_25/jre/lib/security/cacerts -import -file jenkins.cer -alias <JENKINS_URL> -storepass changeit -noprompt >> ${LOGFILE} 2>&1 #Tell Jenkins master to connect to the slave runuser -l cloud-user -c "/opt/java/jdk1.8.0_25/bin/java -jar jenkins-cli.jar -s https://<JENKINS_URL> connect-node `hostname`" >> ${LOGFILE} 2>&1

          Joshua Harlow added a comment -

          Right, I've got another way to do this also,

          Where basically u make the user that is being used to log-in really late in the cloud-init process.

          Then the external ssh connection won't be possible until the final user adding has actually happened (which is then after all the other stuff I want to occur has finished). Its not perfect, and as being one of the cloud-init authors this is a known thing; and since this plug-in is not really tying itself to cloud-init tightly the solutions to it become a little bit harder (especially say with windows).

          Joshua Harlow added a comment - Right, I've got another way to do this also, Where basically u make the user that is being used to log-in really late in the cloud-init process. Then the external ssh connection won't be possible until the final user adding has actually happened (which is then after all the other stuff I want to occur has finished). Its not perfect, and as being one of the cloud-init authors this is a known thing; and since this plug-in is not really tying itself to cloud-init tightly the solutions to it become a little bit harder (especially say with windows).

          Donald Morton added a comment -

          How do you control when cloud-init creates a user?

          Donald Morton added a comment - How do you control when cloud-init creates a user?

          d1morto Here's how we do it in our various environments.

          We pass a user-data script that does what is explained in the README in the following location.

          https://github.com/zephyrproject-rtos/ci-management/tree/master/jenkins-scripts

          I'll note we do this for several different Jenkins systems that we support

          Andrew Grimberg added a comment - d1morto Here's how we do it in our various environments. We pass a user-data script that does what is explained in the README in the following location. https://github.com/zephyrproject-rtos/ci-management/tree/master/jenkins-scripts I'll note we do this for several different Jenkins systems that we support

          Joshua Harlow added a comment -

          Ya, that idea is similar to what I'm doing at godaddy as well.

          Its not ideal, but basically using a shell script to add the user (and making sure adding the user last) will at least ensure jenkins can't login until the rest of the user script is finished...

          Best I've seen so far (though IMHO far from the best it could be).

          Joshua Harlow added a comment - Ya, that idea is similar to what I'm doing at godaddy as well. Its not ideal, but basically using a shell script to add the user (and making sure adding the user last ) will at least ensure jenkins can't login until the rest of the user script is finished... Best I've seen so far (though IMHO far from the best it could be).

          harlowja, No good news from cloud-init land?

          Oliver Gondža added a comment - harlowja , No good news from cloud-init land?

          harlowja it works pretty well actually. The one I linked is one of our simpler setups. Our most complex one is at:

          https://github.com/opendaylight/releng-builder/tree/master/jenkins-scripts

          Though I think the README is a little outdated since we're using the jenkins-init-script.sh pattern that was mentioned in the previous one I linked.

          It gives us some extra flexibility in our instance management so we don't have have to be regularly respinning packer builds for small configuration tweaks.

          Andrew Grimberg added a comment - harlowja it works pretty well actually. The one I linked is one of our simpler setups. Our most complex one is at: https://github.com/opendaylight/releng-builder/tree/master/jenkins-scripts Though I think the README is a little outdated since we're using the jenkins-init-script.sh pattern that was mentioned in the previous one I linked. It gives us some extra flexibility in our instance management so we don't have have to be regularly respinning packer builds for small configuration tweaks.

          Oliver Gondža added a comment - - edited

          Here is an idea how to implement it in a way that is easy to integrate with the plugin and is portable to all flavors of user-data including the situation when there is none. Openstack plugin will define one more macro that would be a file path inside remoteFs directory. The author of the user data script will write there a value depending on progress (running/completed/failed?) and openstack cloud plugin would wait for it to say completed before it starts launching. The plugin can tell if there is some setup logic to wait for by detecting the variable in user data (no matter what language it uses) and checking the file content in OS independent way should not be such a problem either (though, it would require to intercept ssh-slaves launching logic a bit). This will work even when nodes can not reach the master (easily) due to network restrictions or SSL verification.

          Oliver Gondža added a comment - - edited Here is an idea how to implement it in a way that is easy to integrate with the plugin and is portable to all flavors of user-data including the situation when there is none. Openstack plugin will define one more macro that would be a file path inside remoteFs directory. The author of the user data script will write there a value depending on progress (running/completed/failed?) and openstack cloud plugin would wait for it to say completed before it starts launching. The plugin can tell if there is some setup logic to wait for by detecting the variable in user data (no matter what language it uses) and checking the file content in OS independent way should not be such a problem either (though, it would require to intercept ssh-slaves launching logic a bit). This will work even when nodes can not reach the master (easily) due to network restrictions or SSL verification.

          olivergondza That sounds like folks would have to modify their current scripts (if they have anything even currently working). The phone home bit that was suggested would seem like a binary toggle (default false) that someone would set as an option to the template configuration.

          IIRC the JClouds plugin itself implements this pattern now. Doing so for the OpenStack plugin would put enabling it at the discretion of the admin configuring the instance template.

          If you do add in something like what you're proposing I would hope that's it's a toggle to enable the feature.

          Andrew Grimberg added a comment - olivergondza That sounds like folks would have to modify their current scripts (if they have anything even currently working). The phone home bit that was suggested would seem like a binary toggle (default false) that someone would set as an option to the template configuration. IIRC the JClouds plugin itself implements this pattern now. Doing so for the OpenStack plugin would put enabling it at the discretion of the admin configuring the instance template. If you do add in something like what you're proposing I would hope that's it's a toggle to enable the feature.

          I do not quite see how would you toggle any of the two features from plugin. As I understand, the plugin would need to manipulate the cloud-init scripts on the fly with something like https://cloudinit.readthedocs.io/en/latest/topics/examples.html#call-a-url-when-finished? jclouds plugin does not seem to do it so users are required to toggle that on modifying their cloud-init scripts anyway. Do I miss something, tykeal?

          I still have a mild preference of marker file over ping url. The latter would be more familiar to users, though...

          Oliver Gondža added a comment - I do not quite see how would you toggle any of the two features from plugin. As I understand, the plugin would need to manipulate the cloud-init scripts on the fly with something like https://cloudinit.readthedocs.io/en/latest/topics/examples.html#call-a-url-when-finished? jclouds plugin does not seem to do it so users are required to toggle that on modifying their cloud-init scripts anyway. Do I miss something, tykeal ? I still have a mild preference of marker file over ping url. The latter would be more familiar to users, though...

          I would have to setup a test system with JClouds installed again, but IIRC the toggle was on a per template basis and enabling it would cause that particular template to not attempt to do an connection to the instance until it had received the call back to a "phone home" URL. How that phone home gets triggered is left up to the admin creating the actual image being launched, not the the Jenkins system.

          Here's the info from the JClouds plugin: https://wiki.jenkins-ci.org/display/JENKINS/JClouds+Plugin#JCloudsPlugin-Usingthenewphone-homefeature

          You'll notice that they aren't saying that they're manipulating your cloud-init, they're stating that they delay the connection setup until the POST is received.

          Here's the webhook code for it https://github.com/jenkinsci/jclouds-plugin/blob/master/jclouds-plugin/src/main/java/jenkins/plugins/jclouds/internal/PhoneHomeWebHook.java

          As to the latter being more familiar to users. I'm not so certain on that. Since most folks are likely using cloud-init to work with, the phone_home feature has been in there for a fairly long time now it's just a matter of things being written to take advantage of it.

          Andrew Grimberg added a comment - I would have to setup a test system with JClouds installed again, but IIRC the toggle was on a per template basis and enabling it would cause that particular template to not attempt to do an connection to the instance until it had received the call back to a "phone home" URL. How that phone home gets triggered is left up to the admin creating the actual image being launched, not the the Jenkins system. Here's the info from the JClouds plugin: https://wiki.jenkins-ci.org/display/JENKINS/JClouds+Plugin#JCloudsPlugin-Usingthenewphone-homefeature You'll notice that they aren't saying that they're manipulating your cloud-init, they're stating that they delay the connection setup until the POST is received. Here's the webhook code for it https://github.com/jenkinsci/jclouds-plugin/blob/master/jclouds-plugin/src/main/java/jenkins/plugins/jclouds/internal/PhoneHomeWebHook.java As to the latter being more familiar to users. I'm not so certain on that. Since most folks are likely using cloud-init to work with, the phone_home feature has been in there for a fairly long time now it's just a matter of things being written to take advantage of it.

          I though so after reading the code. The thing is the machine needs to initiate the ping and there is no way for the plugin to get that arranged on its own. You need to change your cloud-init scripts or images.

          As to the latter being more familiar to users. I'm not so certain on that. Since most folks are likely using cloud-init to work with, the phone_home feature has been in there for a fairly long time now it's just a matter of things being written to take advantage of it.

          So we seem to be on the same page here.

          Oliver Gondža added a comment - I though so after reading the code. The thing is the machine needs to initiate the ping and there is no way for the plugin to get that arranged on its own. You need to change your cloud-init scripts or images. As to the latter being more familiar to users. I'm not so certain on that. Since most folks are likely using cloud-init to work with, the phone_home feature has been in there for a fairly long time now it's just a matter of things being written to take advantage of it. So we seem to be on the same page here.

            olivergondza Oliver Gondža
            alskor Alex Java
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: