XMLWordPrintable

    Details

    • Similar Issues:
    • Sprint:
      pannonian, iapetus

      Description

      The frequency of ATH failures has gone through the roof. All failures I've seen look (unconfirmed) like flaky tests since the failures seem different if you run the same stuff twice.

      Analyse the failures and see if there are assertions, waits etc we can make less flaky.

        Attachments

          Activity

          Hide
          michaelneale Michael Neale added a comment - - edited

          I slightly changed that test, but it just happens elsewhere "waitForJobRun.." is called - never gets the callback, infinitely subscribes and waits.

          Cliff Meyers well that is the point - that call back specifically waits for the SSE event before it fires. And it doesn't consistently fail to fire. Locally for me it mostly seems to work. Well depends on the day of the week. Mood etc.

          Show
          michaelneale Michael Neale added a comment - - edited I slightly changed that test, but it just happens elsewhere "waitForJobRun.." is called - never gets the callback, infinitely subscribes and waits. Cliff Meyers well that is the point - that call back specifically waits for the SSE event before it fires. And it doesn't consistently fail to fire. Locally for me it mostly seems to work. Well depends on the day of the week. Mood etc.
          Hide
          michaelneale Michael Neale added a comment -

          Tom FENNELLY can we have a look at the event source stuff (if it is the cause of the recent badness) urgently? I can only work so many 16 hour days running the ATH for people by hand.

          Show
          michaelneale Michael Neale added a comment - Tom FENNELLY can we have a look at the event source stuff (if it is the cause of the recent badness) urgently? I can only work so many 16 hour days running the ATH for people by hand.
          Hide
          michaelneale Michael Neale added a comment -

          So spent way to long mucking around with things, and settled on this:

          https://github.com/jenkinsci/blueocean-acceptance-test/pull/117

          This works around the missing events (by just waiting for result - optional but it means it doens't hang)
          And also, nightwatch has a retry option that is not the suite retry: http://nightwatchjs.org/guide#command-line-options
          this allows the flaky ones to be retried without the whole thing being retried.

          Thoughts?

          This has passed a few times in a row now.

          Show
          michaelneale Michael Neale added a comment - So spent way to long mucking around with things, and settled on this: https://github.com/jenkinsci/blueocean-acceptance-test/pull/117 This works around the missing events (by just waiting for result - optional but it means it doens't hang) And also, nightwatch has a retry option that is not the suite retry: http://nightwatchjs.org/guide#command-line-options this allows the flaky ones to be retried without the whole thing being retried. Thoughts? This has passed a few times in a row now.
          Hide
          tfennelly Tom FENNELLY added a comment -

          Made some fixes to the SSE client code, better handling async SSE config errors and connection errors.

          Also did a fair bit of SSE load testing. Created a plugin with node scripts etc for this (so we can do it again later), allowing load testing in both the browser and with the headless SSE client (used by the ATH, which uses an EventSource polyfill). See https://github.com/jenkinsci/sse-gateway-plugin/tree/master/load-test. This did uncover the bugs mentioned above, but also seems to show that the SSE mechanism is solid in terms of not dropping/leaking messages under load over a long period (spanning reconnects with store and forward on the server etc).

          Show
          tfennelly Tom FENNELLY added a comment - Made some fixes to the SSE client code, better handling async SSE config errors and connection errors. Also did a fair bit of SSE load testing. Created a plugin with node scripts etc for this (so we can do it again later), allowing load testing in both the browser and with the headless SSE client (used by the ATH, which uses an EventSource polyfill). See https://github.com/jenkinsci/sse-gateway-plugin/tree/master/load-test . This did uncover the bugs mentioned above, but also seems to show that the SSE mechanism is solid in terms of not dropping/leaking messages under load over a long period (spanning reconnects with store and forward on the server etc).
          Hide
          michaelneale Michael Neale added a comment - - edited

          Nice one - it has been much more stable. I am still not sold on the utillity of using SSE events to know a job has finished (in theory if we have assertions ti does tell us problem is at front end or backend though), but it makes for harder to diagnose failures vs a simple css wait failure.

          But it is also good SSE is hardened to this is great!

          The latest failure is due to a bad commit that caused a regression to a hidden feature that has a new test, will be resolved shortly!

          Show
          michaelneale Michael Neale added a comment - - edited Nice one - it has been much more stable. I am still not sold on the utillity of using SSE events to know a job has finished (in theory if we have assertions ti does tell us problem is at front end or backend though), but it makes for harder to diagnose failures vs a simple css wait failure. But it is also good SSE is hardened to this is great! The latest failure is due to a bad commit that caused a regression to a hidden feature that has a new test, will be resolved shortly!

            People

            Assignee:
            tfennelly Tom FENNELLY
            Reporter:
            tfennelly Tom FENNELLY
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: