Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-42137

Don't fetch run data on an event if it not already in the store

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Minor Minor
    • blueocean-plugin
    • None
    • b23

      This ticket started life as a bug, after observing what was thought to be a regression.

      The BlueOcean main pipeline screen (https://ci.blueocean.io/blue/pipelines) appears to be receiving events for running pipelines that are not visible, and then fetching data, creating more than expected traffic.

      On a regular basis this sends event from the backend to the frontend detailing things that are running.

      To reproduce

      Expected behavior:

      • Should not fetch the run data if the job isn't even in the store.

      Some details

      Event socket:
      https://ci.blueocean.io/sse-gateway/listen/jenkins-blueocean-core-js-1487311928759-80shtgg3w8q5hpelba9k9

      Event data:

      open  {"dispatcherId":"jenkins-blueocean-core-js-1487311928759-80shtgg3w8q5hpelba9k9","dispatcherInst":1238701885}  
      16:15:00.445
      job {"jenkins_object_type":"org.jenkinsci.plugins.workflow.job.WorkflowRun","jenkins_event_uuid":"6c1358f1-f887-41bb-beb9-3228dda99df8","sse_subs_dispatcher_inst":"1238701885","job_run_status":"SUCCESS","job_name":"scratch/bwalding-alwaysBlue","jenkins_org":"jenkins","job_run_queueId":"5246","jenkins_object_name":"#2","blueocean_job_rest_url":"/blue/rest/organizations/jenkins/pipelines/scratch/pipelines/bwalding-alwaysBlue/","jenkins_object_id":"2","jenkins_event":"job_run_ended","sse_subs_dispatcher":"jenkins-blueocean-core-js-1487311928759-80shtgg3w8q5hpelba9k9","blueocean_job_pipeline_name":"scratch/bwalding-alwaysBlue","jenkins_object_url":"job/scratch/job/bwalding-alwaysBlue/2/","jenkins_channel":"job"} 
      

      Once the frontend receives those events, the frontend then makes a call to retrieve details about the run.

      https://ci.blueocean.io/blue/rest/organizations/jenkins/pipelines/scratch/pipelines/bwalding-alwaysBlue/runs/2/

      However the important thing to note is that the bwalding-alwaysBlue job is not visible on the front screen - so the frontend is querying the backend for details on jobs that will not change the state of the front end (afaik)

      In a large instance that is running a lot of jobs all the time, this could be problematic.

          [JENKINS-42137] Don't fetch run data on an event if it not already in the store

          Tom FENNELLY added a comment -

          Perhaps in future we could avoid the fetch if we know for sure it ain't needed

          That sounds perfectly reasonable, but in reality is not easy to do because this is a SPA. If we receive an event about something that's in the data store (but not visible on the screen at that time), we need to decide how to handle that i.e. update that data, invalidate it (so it will be retrieved again if it's needed after transitioning to a different route) etc etc. And invalidation of data probably isn't as clean/easy as you might think either, because it might mean you actually need to invalidate a lot more than just "that run" (for example) if the data is part of a collection etc etc. If you go down that road then, we'd start seeing tickets about what appear to be unnecessary REST API calls to refetch data that "appears" to be getting invalidated for no good reason. There are lots of swings and round-about, tradeoffs etc. It's rarely as clearcut as it seems.

          Tom FENNELLY added a comment - Perhaps in future we could avoid the fetch if we know for sure it ain't needed That sounds perfectly reasonable, but in reality is not easy to do because this is a SPA. If we receive an event about something that's in the data store (but not visible on the screen at that time), we need to decide how to handle that i.e. update that data, invalidate it (so it will be retrieved again if it's needed after transitioning to a different route) etc etc. And invalidation of data probably isn't as clean/easy as you might think either, because it might mean you actually need to invalidate a lot more than just "that run" (for example) if the data is part of a collection etc etc. If you go down that road then, we'd start seeing tickets about what appear to be unnecessary REST API calls to refetch data that "appears" to be getting invalidated for no good reason. There are lots of swings and round-about, tradeoffs etc. It's rarely as clearcut as it seems.

          Ben Walding added a comment -

          I flagged this because in a small instance - you will barely notice it.

          But it is making one request per running job every cycle (30s) - so on a big instance this might start to be a real issue - e.g. 30 concurrent jobs - with 30 users - that would be 1800 requests/min (on top of regular workload)

          (I don't have to worry about how to solve it!)

          Ben Walding added a comment - I flagged this because in a small instance - you will barely notice it. But it is making one request per running job every cycle (30s) - so on a big instance this might start to be a real issue - e.g. 30 concurrent jobs - with 30 users - that would be 1800 requests/min (on top of regular workload) (I don't have to worry about how to solve it!)

          Tom FENNELLY added a comment -

          bwalding you're dead right to flag it. Thanks for keeping an eye on these things Ben !!!! As I was saying to michaelneale, one small change in the wrong place could easily lead to disaster, so these things are something we need to keep a close eye on.

          Tom FENNELLY added a comment - bwalding you're dead right to flag it. Thanks for keeping an eye on these things Ben !!!! As I was saying to michaelneale , one small change in the wrong place could easily lead to disaster, so these things are something we need to keep a close eye on.

          Michael Neale added a comment -

          We had a discussion on this, and agreed ideally it would check the store first to see if there is any interest in the event, as it is likely dropping the data on the floor anyway, so this should be able to be eliminated.

          Michael Neale added a comment - We had a discussion on this, and agreed ideally it would check the store first to see if there is any interest in the event, as it is likely dropping the data on the floor anyway, so this should be able to be eliminated.

          Michael Neale added a comment -

          bumping this back to "critical" but it is really just "major". I think this is worth looking at as it may impact some users.

          Michael Neale added a comment - bumping this back to "critical" but it is really just "major". I think this is worth looking at as it may impact some users.

          Tom FENNELLY added a comment -

          Tom FENNELLY added a comment - https://github.com/jenkinsci/blueocean-plugin/pull/867

          Tom FENNELLY added a comment -

          This is a bit of a rat-hole. Talked with michaelneale and we decided to punt on trying to perform this optimisation for now.

          PR 867 "seemed" to do the trick, but the ATH uncovered a scenario where the fetch should still take place. If the RunDetails page is cold loaded for a run instance (e.g. directly from a link), the page loads the pipeline instance, steps data and a load of other stuff, but doesn't use the ActivityService or create a Pager instance. If it's a failed run and the user clicks on the Re-Run button a build will be triggered, causing SSE events, which in turn cause the DefaultSSEHandler to attempt to get the Pager instance from the PagerService etc etc.

          I spent some time studying all this and kinda got it working, but ran into other issues then with other components. Bottom line ... I think this is something we should come back to post 1.0 because it seems like some other things need to be straightened out in order to make it work and I don't think it's worth taking the risk of breaking things at this point.

          Tom FENNELLY added a comment - This is a bit of a rat-hole. Talked with michaelneale and we decided to punt on trying to perform this optimisation for now. PR 867 "seemed" to do the trick, but the ATH uncovered a scenario where the fetch should still take place. If the RunDetails page is cold loaded for a run instance (e.g. directly from a link), the page loads the pipeline instance, steps data and a load of other stuff, but doesn't use the ActivityService or create a Pager instance. If it's a failed run and the user clicks on the Re-Run button a build will be triggered, causing SSE events, which in turn cause the DefaultSSEHandler to attempt to get the Pager instance from the PagerService etc etc. I spent some time studying all this and kinda got it working, but ran into other issues then with other components. Bottom line ... I think this is something we should come back to post 1.0 because it seems like some other things need to be straightened out in order to make it work and I don't think it's worth taking the risk of breaking things at this point.

          Nick Campbell added a comment -

          Hey tfennelly - thanks for all of the background and context on this issue. I oversee the Developer Platform for a large technology company, and we currently have a centrally managed Jenkins deployment used by our entire developer community (~2000 or so engineers).

          Given the large number of concurrent sessions that our instance typically has open, this bug has been contributing to excessively high CPU utilization of the Jenkins master process, often to the point where the instance (and the physical box it runs on), grind to a halt.

          Is there any estimate as to when this ticket will be assigned? Alternatively, is there a tactical work around? Given the impact, we would even consider disabling live status updates altogether if it is possible to do so.

          Thanks in advance for any and all replies. 

          Nick Campbell added a comment - Hey tfennelly - thanks for all of the background and context on this issue. I oversee the Developer Platform for a large technology company, and we currently have a centrally managed Jenkins deployment used by our entire developer community (~2000 or so engineers). Given the large number of concurrent sessions that our instance typically has open, this bug has been contributing to excessively high CPU utilization of the Jenkins master process, often to the point where the instance (and the physical box it runs on), grind to a halt. Is there any estimate as to when this ticket will be assigned? Alternatively, is there a tactical work around? Given the impact, we would even consider disabling live status updates altogether if it is possible to do so. Thanks in advance for any and all replies. 

          Michael Neale added a comment -

          nbc204 interesting - this issue was parked as it was kind of "premature optimisation". We suspected it could be a problem but lacked the validation, so perhaps this is worth looking into now. 

          I will cc vivek as he may consider scheduling this at some point (contributions always welcome, this will be a bit fiddly to test). There were some other performance fixes recently so there may be appetite to pick this up. What do you think vivek

          Michael Neale added a comment - nbc204 interesting - this issue was parked as it was kind of "premature optimisation". We suspected it could be a problem but lacked the validation, so perhaps this is worth looking into now.  I will cc vivek as he may consider scheduling this at some point (contributions always welcome, this will be a bit fiddly to test). There were some other performance fixes recently so there may be appetite to pick this up. What do you think vivek ? 

          Michał Woś added a comment -

          Michał Woś added a comment - Not sure if related:  https://issues.jenkins.io/browse/JENKINS-65060

            tfennelly Tom FENNELLY
            bwalding Ben Walding
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: