Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49406

Design (JEP) the Evergreen snapshotting data safety system

    • Evergreen - Milestone 1

      I need to explore the idea suggested by sag47 of using a Git repository on-disk for checking in .xml.files before we run an upgrade process.

      In addition the approach should rollback if there is a failure, such as Jenkins failing to come up properly.

          [JENKINS-49406] Design (JEP) the Evergreen snapshotting data safety system

          Sam Gleske added a comment - - edited

          Example from my Jenkins RPM package:

          That repository supports packaging Jenkins and plugins into multiple formats.

          ./gradlew buildRpm
          ./gradlew buildDeb
          ./gradlew buildTar
          #or package all three with ./gradlew packages
          
          #docker requires buildTar
          docker build -t jenkins .
          

          Additional notes

          • One of the challenges I discussed with rtyler was setting workspaces for jobs building on master outside of JENKINS_HOME. Otherwise, you encounter weird issues with Git repositories inside of other Git repositories when they're not submodules. In general, we know it's bad practice for people to build on the master but it still gets done.
          • The gitignore file I linked intentionally does not track secret.key or the secrets directory. The intention here is that secrets get backed up separately from the encrypted configuration. However, this may not matter to some organizations.
          • Eventually, I want to completely rewrite the service scripts I copied from jenkins-packaging.  Mainly because I have a different style of bash writing and will propose my changes back.

          Sam Gleske added a comment - - edited Example from my Jenkins RPM package: preUninstall.sh script running dailycommit.sh to save a copy of configuration before package upgrade . Example gitignore used for my JENKINS_HOME . Contents of dailycommit.sh . That repository supports packaging Jenkins and plugins into multiple formats. ./gradlew buildRpm ./gradlew buildDeb ./gradlew buildTar #or package all three with ./gradlew packages #docker requires buildTar docker build -t jenkins . Additional notes One of the challenges I discussed with rtyler was setting workspaces for jobs building on master outside of JENKINS_HOME. Otherwise, you encounter weird issues with Git repositories inside of other Git repositories when they're not submodules. In general, we know it's bad practice for people to build on the master but it still gets done. The gitignore file I linked intentionally does not track secret.key or the secrets directory. The intention here is that secrets get backed up separately from the encrypted configuration. However, this may not matter to some organizations. Eventually, I want to completely rewrite the service scripts I copied from jenkins-packaging.  Mainly because I have a different style of bash writing and will propose my changes back.

          R. Tyler Croy added a comment -

          I'm going to assign this to batmat. Feel free to spin up some separate tickets as necessary to explore additional avenues of experimentation.

          I would expect that the end-result of the prototype/experiment phase would be a JEP document.

          R. Tyler Croy added a comment - I'm going to assign this to batmat . Feel free to spin up some separate tickets as necessary to explore additional avenues of experimentation. I would expect that the end-result of the prototype/experiment phase would be a JEP document.

          Jesse Glick added a comment -

          For inspiration: etckeeper

          Jesse Glick added a comment - For inspiration:  etckeeper

          Jesse Glick added a comment -

          Also think carefully about compatibleSinceVersion.

          Jesse Glick added a comment - Also think carefully about compatibleSinceVersion .

          Also think carefully about compatibleSinceVersion.

          jglick I didn't plan anything specific to be honest using this metadata. Because yes, we probably could do some optimizations on this front, for instance not reverting to previous if compatibleSinceVersion stayed the same. But as you said too, IIUC, yesterday well this practice is not currently used often and carefully enough to be really usable automatically I suspect?

          But agreed this might be something we can improve over time while defining the efforts and things a given plugin has to comply with to be able to enter the set of plugins delivered/used in Essentials. WDYT?

          (Should we rather take this in a dedicated thread on the ML BTW? I plan one anyway, so maybe we'll get back to it there very soon.)

          Baptiste Mathus added a comment - Also think carefully about compatibleSinceVersion. jglick I didn't plan anything specific to be honest using this metadata. Because yes, we probably could do some optimizations on this front, for instance not reverting to previous if compatibleSinceVersion stayed the same. But as you said too, IIUC, yesterday well this practice is not currently used often and carefully enough to be really usable automatically I suspect? But agreed this might be something we can improve over time while defining the efforts and things a given plugin has to comply with to be able to enter the set of plugins delivered/used in Essentials . WDYT? (Should we rather take this in a dedicated thread on the ML BTW? I plan one anyway, so maybe we'll get back to it there very soon.)

          Baptiste Mathus added a comment - As discussed yesterday, first draft submitted for review to the dev list: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/jenkinsci-dev/XdXuMFLXKPw (=> https://github.com/batmat/jep/pull/1 )

          We had a chat today with rarabaolaza and he vented a quite important thing we might want to do IMO: to reduce the risk of creating more things than necessary when Jenkins starts again after an upgrade, puting it in quiet start mode could help.

          Only once the evergreen client has performed the upgrade, and checked Jenkins is judged healthy, would it automatically cancel its quiet mode.

          Baptiste Mathus added a comment - We had a chat today with rarabaolaza and he vented a quite important thing we might want to do IMO: to reduce the risk of creating more things than necessary when Jenkins starts again after an upgrade, puting it in quiet start mode could help. Only once the evergreen client has performed the upgrade, and checked Jenkins is judged healthy, would it automatically cancel its quiet mode.

          R. Tyler Croy added a comment -

          That's an interesting idea batmat!

          I wonder if starting in quiet mode would result in us missing any potential errors? If not, then I say let's do it!

          R. Tyler Croy added a comment - That's an interesting idea batmat ! I wonder if starting in quiet mode would result in us missing any potential errors? If not, then I say let's do it!

          Baptiste Mathus added a comment - - edited

          I wonder if starting in quiet mode would result in us missing any potential errors? If not, then I say let's do it!

          Definitely. And jglick already had a similar comment reviewing https://github.com/batmat/jep/pull/1
          But I think it would still be interesting to triage the potential issue causes, with a slightly more progressive process.

          Roughly, would/could be:

          • set to start in quiet mode next time, and restart
          • check Jenkins is healthy [1]
          • if yes, cancel quiet [EDIT: or better, write some plugin that would *only* allow our smoke testing job, on the next bullet point, to run]
          • start some kind of smoke testing build
          • if success, then \o/, if not, roll back.

          [1] rtyler about that, I have been starting to think since a few days we probably need a dedicated JIRA/JEP to design what "evergreen-client decides if Jenkins is healthy [enough] or not", aka to trigger a rollback, or not... Do we something like this? WDYT?

          Baptiste Mathus added a comment - - edited I wonder if starting in quiet mode would result in us missing any potential errors? If not, then I say let's do it! Definitely. And jglick already had a similar comment reviewing https://github.com/batmat/jep/pull/1 But I think it would still be interesting to triage the potential issue causes, with a slightly more progressive process. Roughly, would/could be: set to start in quiet mode next time, and restart check Jenkins is healthy [1] if yes, cancel quiet [EDIT: or better, write some plugin that would *only* allow our smoke testing job, on the next bullet point, to run] start some kind of smoke testing build if success, then \o/, if not, roll back. [1]   rtyler about that, I have been starting to think since a few days we probably need a dedicated JIRA/JEP to design what "evergreen-client decides if Jenkins is healthy [enough] or not", aka to trigger a rollback, or not... Do we something like this? WDYT?

          R. Tyler Croy added a comment -

          batmat, regarding a JEP for determining Jenkins healthiness for Jenkins Essentials, I think that's a good idea and will be a useful design document to discuss with the broader development community.

          Will you file a ticket for that and drop it into Milestone 1?

          R. Tyler Croy added a comment - batmat , regarding a JEP for determining Jenkins healthiness for Jenkins Essentials, I think that's a good idea and will be a useful design document to discuss with the broader development community. Will you file a ticket for that and drop it into Milestone 1?

          I fully agree also.

          Just for openness and even if this has been already add to other sources this are the meeting notes of my conversation with batmat yesterday:

          RAUL: This is intended for development time, not for deployment validation
          Idea is Try an upgrade, test all works properly perform a rollback and test again all is working

          BAPTISTE: We are likely to be able to reuse the “health check” logic that will have to be developed for evergreen-client itself in production, to check if Jenkins is running fine.
          RAUL: critical: we need to test the health check

          QUESTION: Should we try to implement synthetic transactions here or go with ATH which already exists?

          PROPOSALS for Rollback testing:

          • Make sure there is enough coverage that all possible rollback paths are covered
          • Create a quality bar for rollbacks
            • Make sure you are including some failing scenarios in the quality bar
            • Not only test the happy path, for example:
              • Made a failed upgrade, test that we are able to detect the upgrade as a failure, rollback and test that the instance is working perfectly
              • Made a failed upgrade, test that we are able to detect the upgrade as a failure, made a failed rollback and test that we are able to detect the rollback failed
            • Make sure that in case of different chained rollback strategies we test each and every one of them
          • Create a healthcheck url to be invoked via CURL for example
            • We can create a plugin that provides that healthcheck url and integrate with ST
            • Maybe some work from metrics plugin can be reused

          Some possible testing flows:

          • Upgrade run health check (ST), rollback, ST again ¿and ATH?
            • No work yet on ST that I am aware of, but ST can be later reused for deployment testing
          • Run ATH, rollback, ATH again
            • Some work already done, but ATH is maybe too heavy and coverage is pretty poor and based on individual plugins not in coherent sets of them
              This should be done in the “pre canary, staging, or whatever is named” instances because we want to catch any possible degradation or problems in long running instances

          Raul Arabaolaza added a comment - I fully agree also. Just for openness and even if this has been already add to other sources this are the meeting notes of my conversation with batmat yesterday: RAUL: This is intended for development time, not for deployment validation Idea is Try an upgrade, test all works properly perform a rollback and test again all is working BAPTISTE: We are likely to be able to reuse the “health check” logic that will have to be developed for evergreen-client itself in production, to check if Jenkins is running fine. RAUL: critical: we need to test the health check QUESTION: Should we try to implement synthetic transactions here or go with ATH which already exists? PROPOSALS for Rollback testing: Make sure there is enough coverage that all possible rollback paths are covered Create a quality bar for rollbacks Make sure you are including some failing scenarios in the quality bar Not only test the happy path, for example: Made a failed upgrade, test that we are able to detect the upgrade as a failure, rollback and test that the instance is working perfectly Made a failed upgrade, test that we are able to detect the upgrade as a failure, made a failed rollback and test that we are able to detect the rollback failed Make sure that in case of different chained rollback strategies we test each and every one of them Create a healthcheck url to be invoked via CURL for example We can create a plugin that provides that healthcheck url and integrate with ST Maybe some work from metrics plugin can be reused Some possible testing flows: Upgrade run health check (ST), rollback, ST again ¿and ATH? No work yet on ST that I am aware of, but ST can be later reused for deployment testing Run ATH, rollback, ATH again Some work already done, but ATH is maybe too heavy and coverage is pretty poor and based on individual plugins not in coherent sets of them This should be done in the “pre canary, staging, or whatever is named” instances because we want to catch any possible degradation or problems in long running instances

          FTR, meeting added in the repo as we'll do for all of them in the future: https://github.com/jenkins-infra/evergreen/tree/master/docs/meetings/2018-03-18-JENKINS-49406-quality-bar

          Baptiste Mathus added a comment - FTR, meeting added in the repo as we'll do for all of them in the future:  https://github.com/jenkins-infra/evergreen/tree/master/docs/meetings/2018-03-18-JENKINS-49406-quality-bar

           Tried to start implementing the root separation by changing the "builds" and "workspace" directories as described in https://github.com/batmat/jep/blob/a3d70917b1095ee27c292c029593f79913ff186a/jep/302/README.adoc#segregate-job-configuration-and-build-data using CasC to also test/prototype this part of the proposal, but this proved impossible. See https://github.com/jenkinsci/configuration-as-code-plugin/issues/151

          Baptiste Mathus added a comment -  Tried to start implementing the root separation by changing the "builds" and "workspace" directories as described in https://github.com/batmat/jep/blob/a3d70917b1095ee27c292c029593f79913ff186a/jep/302/README.adoc#segregate-job-configuration-and-build-data  using CasC to also test/prototype this part of the proposal, but this proved impossible. See https://github.com/jenkinsci/configuration-as-code-plugin/issues/151

          Code changed in jenkins
          User: Baptiste Mathus
          Path:
          jep/0000/README.adoc
          http://jenkins-ci.org/commit/jep/6773edbc06488de4c2fa7371f54c79df38672861
          Log:
          JENKINS-49406 Evergreen snapshotting data safety system JEP

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Baptiste Mathus Path: jep/0000/README.adoc http://jenkins-ci.org/commit/jep/6773edbc06488de4c2fa7371f54c79df38672861 Log: JENKINS-49406 Evergreen snapshotting data safety system JEP

          Code changed in jenkins
          User: R. Tyler Croy
          Path:
          jep/302/README.adoc
          jep/README.adoc
          http://jenkins-ci.org/commit/jep/949cbdb6bb2823a0a780e1005cf86a9b815f48b6
          Log:
          Merge pull request #67 from batmat/JENKINS-49406-JEP-submission

          JENKINS-49406 Evergreen snapshotting data safety system JEP

          Compare: https://github.com/jenkinsci/jep/compare/b5b57a9f1c93...949cbdb6bb28

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: R. Tyler Croy Path: jep/302/README.adoc jep/README.adoc http://jenkins-ci.org/commit/jep/949cbdb6bb2823a0a780e1005cf86a9b815f48b6 Log: Merge pull request #67 from batmat/ JENKINS-49406 -JEP-submission JENKINS-49406 Evergreen snapshotting data safety system JEP Compare: https://github.com/jenkinsci/jep/compare/b5b57a9f1c93...949cbdb6bb28

          See JENKINS-50958 for usage of this specification

          Baptiste Mathus added a comment - See JENKINS-50958 for usage of this specification

            batmat Baptiste Mathus
            rtyler R. Tyler Croy
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: