Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49406

Design (JEP) the Evergreen snapshotting data safety system

    • Evergreen - Milestone 1

      I need to explore the idea suggested by sag47 of using a Git repository on-disk for checking in .xml.files before we run an upgrade process.

      In addition the approach should rollback if there is a failure, such as Jenkins failing to come up properly.

          [JENKINS-49406] Design (JEP) the Evergreen snapshotting data safety system

          We had a chat today with rarabaolaza and he vented a quite important thing we might want to do IMO: to reduce the risk of creating more things than necessary when Jenkins starts again after an upgrade, puting it in quiet start mode could help.

          Only once the evergreen client has performed the upgrade, and checked Jenkins is judged healthy, would it automatically cancel its quiet mode.

          Baptiste Mathus added a comment - We had a chat today with rarabaolaza and he vented a quite important thing we might want to do IMO: to reduce the risk of creating more things than necessary when Jenkins starts again after an upgrade, puting it in quiet start mode could help. Only once the evergreen client has performed the upgrade, and checked Jenkins is judged healthy, would it automatically cancel its quiet mode.

          R. Tyler Croy added a comment -

          That's an interesting idea batmat!

          I wonder if starting in quiet mode would result in us missing any potential errors? If not, then I say let's do it!

          R. Tyler Croy added a comment - That's an interesting idea batmat ! I wonder if starting in quiet mode would result in us missing any potential errors? If not, then I say let's do it!

          Baptiste Mathus added a comment - - edited

          I wonder if starting in quiet mode would result in us missing any potential errors? If not, then I say let's do it!

          Definitely. And jglick already had a similar comment reviewing https://github.com/batmat/jep/pull/1
          But I think it would still be interesting to triage the potential issue causes, with a slightly more progressive process.

          Roughly, would/could be:

          • set to start in quiet mode next time, and restart
          • check Jenkins is healthy [1]
          • if yes, cancel quiet [EDIT: or better, write some plugin that would *only* allow our smoke testing job, on the next bullet point, to run]
          • start some kind of smoke testing build
          • if success, then \o/, if not, roll back.

          [1] rtyler about that, I have been starting to think since a few days we probably need a dedicated JIRA/JEP to design what "evergreen-client decides if Jenkins is healthy [enough] or not", aka to trigger a rollback, or not... Do we something like this? WDYT?

          Baptiste Mathus added a comment - - edited I wonder if starting in quiet mode would result in us missing any potential errors? If not, then I say let's do it! Definitely. And jglick already had a similar comment reviewing https://github.com/batmat/jep/pull/1 But I think it would still be interesting to triage the potential issue causes, with a slightly more progressive process. Roughly, would/could be: set to start in quiet mode next time, and restart check Jenkins is healthy [1] if yes, cancel quiet [EDIT: or better, write some plugin that would *only* allow our smoke testing job, on the next bullet point, to run] start some kind of smoke testing build if success, then \o/, if not, roll back. [1]   rtyler about that, I have been starting to think since a few days we probably need a dedicated JIRA/JEP to design what "evergreen-client decides if Jenkins is healthy [enough] or not", aka to trigger a rollback, or not... Do we something like this? WDYT?

          R. Tyler Croy added a comment -

          batmat, regarding a JEP for determining Jenkins healthiness for Jenkins Essentials, I think that's a good idea and will be a useful design document to discuss with the broader development community.

          Will you file a ticket for that and drop it into Milestone 1?

          R. Tyler Croy added a comment - batmat , regarding a JEP for determining Jenkins healthiness for Jenkins Essentials, I think that's a good idea and will be a useful design document to discuss with the broader development community. Will you file a ticket for that and drop it into Milestone 1?

          I fully agree also.

          Just for openness and even if this has been already add to other sources this are the meeting notes of my conversation with batmat yesterday:

          RAUL: This is intended for development time, not for deployment validation
          Idea is Try an upgrade, test all works properly perform a rollback and test again all is working

          BAPTISTE: We are likely to be able to reuse the “health check” logic that will have to be developed for evergreen-client itself in production, to check if Jenkins is running fine.
          RAUL: critical: we need to test the health check

          QUESTION: Should we try to implement synthetic transactions here or go with ATH which already exists?

          PROPOSALS for Rollback testing:

          • Make sure there is enough coverage that all possible rollback paths are covered
          • Create a quality bar for rollbacks
            • Make sure you are including some failing scenarios in the quality bar
            • Not only test the happy path, for example:
              • Made a failed upgrade, test that we are able to detect the upgrade as a failure, rollback and test that the instance is working perfectly
              • Made a failed upgrade, test that we are able to detect the upgrade as a failure, made a failed rollback and test that we are able to detect the rollback failed
            • Make sure that in case of different chained rollback strategies we test each and every one of them
          • Create a healthcheck url to be invoked via CURL for example
            • We can create a plugin that provides that healthcheck url and integrate with ST
            • Maybe some work from metrics plugin can be reused

          Some possible testing flows:

          • Upgrade run health check (ST), rollback, ST again ¿and ATH?
            • No work yet on ST that I am aware of, but ST can be later reused for deployment testing
          • Run ATH, rollback, ATH again
            • Some work already done, but ATH is maybe too heavy and coverage is pretty poor and based on individual plugins not in coherent sets of them
              This should be done in the “pre canary, staging, or whatever is named” instances because we want to catch any possible degradation or problems in long running instances

          Raul Arabaolaza added a comment - I fully agree also. Just for openness and even if this has been already add to other sources this are the meeting notes of my conversation with batmat yesterday: RAUL: This is intended for development time, not for deployment validation Idea is Try an upgrade, test all works properly perform a rollback and test again all is working BAPTISTE: We are likely to be able to reuse the “health check” logic that will have to be developed for evergreen-client itself in production, to check if Jenkins is running fine. RAUL: critical: we need to test the health check QUESTION: Should we try to implement synthetic transactions here or go with ATH which already exists? PROPOSALS for Rollback testing: Make sure there is enough coverage that all possible rollback paths are covered Create a quality bar for rollbacks Make sure you are including some failing scenarios in the quality bar Not only test the happy path, for example: Made a failed upgrade, test that we are able to detect the upgrade as a failure, rollback and test that the instance is working perfectly Made a failed upgrade, test that we are able to detect the upgrade as a failure, made a failed rollback and test that we are able to detect the rollback failed Make sure that in case of different chained rollback strategies we test each and every one of them Create a healthcheck url to be invoked via CURL for example We can create a plugin that provides that healthcheck url and integrate with ST Maybe some work from metrics plugin can be reused Some possible testing flows: Upgrade run health check (ST), rollback, ST again ¿and ATH? No work yet on ST that I am aware of, but ST can be later reused for deployment testing Run ATH, rollback, ATH again Some work already done, but ATH is maybe too heavy and coverage is pretty poor and based on individual plugins not in coherent sets of them This should be done in the “pre canary, staging, or whatever is named” instances because we want to catch any possible degradation or problems in long running instances

          FTR, meeting added in the repo as we'll do for all of them in the future: https://github.com/jenkins-infra/evergreen/tree/master/docs/meetings/2018-03-18-JENKINS-49406-quality-bar

          Baptiste Mathus added a comment - FTR, meeting added in the repo as we'll do for all of them in the future:  https://github.com/jenkins-infra/evergreen/tree/master/docs/meetings/2018-03-18-JENKINS-49406-quality-bar

           Tried to start implementing the root separation by changing the "builds" and "workspace" directories as described in https://github.com/batmat/jep/blob/a3d70917b1095ee27c292c029593f79913ff186a/jep/302/README.adoc#segregate-job-configuration-and-build-data using CasC to also test/prototype this part of the proposal, but this proved impossible. See https://github.com/jenkinsci/configuration-as-code-plugin/issues/151

          Baptiste Mathus added a comment -  Tried to start implementing the root separation by changing the "builds" and "workspace" directories as described in https://github.com/batmat/jep/blob/a3d70917b1095ee27c292c029593f79913ff186a/jep/302/README.adoc#segregate-job-configuration-and-build-data  using CasC to also test/prototype this part of the proposal, but this proved impossible. See https://github.com/jenkinsci/configuration-as-code-plugin/issues/151

          Code changed in jenkins
          User: Baptiste Mathus
          Path:
          jep/0000/README.adoc
          http://jenkins-ci.org/commit/jep/6773edbc06488de4c2fa7371f54c79df38672861
          Log:
          JENKINS-49406 Evergreen snapshotting data safety system JEP

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Baptiste Mathus Path: jep/0000/README.adoc http://jenkins-ci.org/commit/jep/6773edbc06488de4c2fa7371f54c79df38672861 Log: JENKINS-49406 Evergreen snapshotting data safety system JEP

          Code changed in jenkins
          User: R. Tyler Croy
          Path:
          jep/302/README.adoc
          jep/README.adoc
          http://jenkins-ci.org/commit/jep/949cbdb6bb2823a0a780e1005cf86a9b815f48b6
          Log:
          Merge pull request #67 from batmat/JENKINS-49406-JEP-submission

          JENKINS-49406 Evergreen snapshotting data safety system JEP

          Compare: https://github.com/jenkinsci/jep/compare/b5b57a9f1c93...949cbdb6bb28

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: R. Tyler Croy Path: jep/302/README.adoc jep/README.adoc http://jenkins-ci.org/commit/jep/949cbdb6bb2823a0a780e1005cf86a9b815f48b6 Log: Merge pull request #67 from batmat/ JENKINS-49406 -JEP-submission JENKINS-49406 Evergreen snapshotting data safety system JEP Compare: https://github.com/jenkinsci/jep/compare/b5b57a9f1c93...949cbdb6bb28

          See JENKINS-50958 for usage of this specification

          Baptiste Mathus added a comment - See JENKINS-50958 for usage of this specification

            batmat Baptiste Mathus
            rtyler R. Tyler Croy
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: