Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68116

Slow processing of multi branch events

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Minor
    • Resolution: Unresolved
    • None

    Description

      I'm seeing delays of about an hour from when a multi branch event is created in GitHub to when Jenkins actually processes it.

      If I run the in the script console:

      jenkins.scm.api.SCMEvent.executorService
      

      I see:

      java.util.concurrent.ScheduledThreadPoolExecutor@8b0291b[Running, pool size = 10, active threads = 10, queued tasks = 2401, completed tasks = 12838]
      

      Relevant code appears to be around:
      https://github.com/jenkinsci/branch-api-plugin/blob/master/src/main/java/jenkins/branch/MultiBranchProject.java#L1179-L1198
      and

      https://github.com/jenkinsci/branch-api-plugin/blob/master/src/main/java/jenkins/branch/MultiBranchProject.java#L1385

      Running:

      Jenkins.get().getAllItems(jenkins.branch.MultiBranchProject.class).size()
      

      Gives:
      579 multi branch projects.

      From what I can see in the log in
      /var/jenkins_home/jenkins.branch.* logs

      Events seem to be process in 0-1 seconds but based on Maths comparing events processed over an hour we were only processing 22 a minute, so some must be quite slow.

      We only get to keep 15 minutes worth of logs as the file max size is set to 33 kilobytes for some reason:
      https://github.com/jenkinsci/branch-api-plugin/blob/master/src/main/java/jenkins/branch/MultiBranchProject.java#L1138-L1142

      Context:
      Large organistion with 1.7K repositories, and an organisation webhook is pointing at Jenkins

      Few ideas so far:
      1. Throw more threads at it

      https://github.com/jenkinsci/scm-api-plugin/blob/master/src/main/java/jenkins/scm/api/SCMEvent.java#L215

      Is hardcoded to 10 threads at a time

      2. Keep logs for longer

      3. See if there's something specific that could be holding this up?

      4. Global configuration to filter projects out of webhook processing? We have lots that don't need to be processed and will never be matched

      Attachments

        1. events-trace.txt
          53 kB
        2. events-trace-90s.txt
          53 kB
        3. github-api-get-repos-sorted.txt
          0.4 kB
        4. over-10-seconds.txt
          7 kB
        5. over-10-seconds-29-3.txt
          19 kB
        6. over-1-second.txt
          21 kB
        7. threads.6.20220325093536.txt
          667 kB
        8. threads.6.20220325093550.txt
          670 kB
        9. threads.6.20220325093602.txt
          642 kB
        10. threads.6.20220325093615.txt
          642 kB
        11. threads.6.20220325093629.txt
          678 kB
        12. threads.6.20220325093641.txt
          689 kB
        13. threads.6.20220325093655.txt
          716 kB
        14. threads.6.20220325093706.txt
          696 kB
        15. threads.6.20220325093720.txt
          704 kB
        16. threads.6.20220325093732.txt
          698 kB
        17. threads.6.20220325093745.txt
          662 kB
        18. threads.6.20220325093759.txt
          671 kB

        Activity

          timja Tim Jacomb added a comment - - edited

          over-10-seconds-29-3.txt

          teilo any suggestions on what else to do / look at, if you see the attachment there's lots still take 30-60s to process?

          If I check the multi branch events for the repositories that were matched they all seem to get processed in less than 1 second

          I changed some of our orgs to topics but it doesn't look to have helped much, although some of the events that took a long time seem to be not so bad as they don't match anymore.

          I'm wondering (without knowing fully how this area works) if not all the jobs are held in memory so there's a lot of disk access required?

          (all the repositories are public in the attachment)

          timja Tim Jacomb added a comment - - edited over-10-seconds-29-3.txt teilo any suggestions on what else to do / look at, if you see the attachment there's lots still take 30-60s to process? If I check the multi branch events for the repositories that were matched they all seem to get processed in less than 1 second I changed some of our orgs to topics but it doesn't look to have helped much, although some of the events that took a long time seem to be not so bad as they don't match anymore. I'm wondering (without knowing fully how this area works) if not all the jobs are held in memory so there's a lot of disk access required? (all the repositories are public in the attachment)
          timja Tim Jacomb added a comment -

          events-trace.txt

          I added an ID to the logs so I could grep for all related logs.
          attached an example of a 30 second one and a 90 second one

          events-trace-90s.txt

          timja Tim Jacomb added a comment - events-trace.txt I added an ID to the logs so I could grep for all related logs. attached an example of a 30 second one and a 90 second one events-trace-90s.txt
          timja Tim Jacomb added a comment -

          At this point I think that Jenkins (scm-api + branch-api + github-branch-source) just doesn't scale with a large number of organization folders and using multi branch events.

          I mentioned earlier that we have 579 multi branch projects those are sub-folders under organization folders, which we have 68 of.

          Basically each team has 2 folders, one for regular builds and one for nightly builds.
          Which is working around limitations of not being able to have multiple pipelines from one repo when using org folders...

          Each team having their own folder is really historical from before we used GitHub app credentials and had to spread the rate limit across multiple users.
          So I can likely collapse those down.

          I've debugged through the event processing and I can see that there's a minimum of ~250ms per organization folder due to each folder validating that the repo from the event exists and retrieving details from it here: https://github.com/jenkinsci/github-branch-source-plugin/blob/18d333ef73792f92b1ebc4d8e42ded8fa070fc50/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMNavigator.java#L1390

          Event timings produced by running this in github-branch-source

          for (var i = 0; i < 100; i++) {
              long now = System.currentTimeMillis();
              
              GHOrganization org1 = getGhOrganization(github);
              GHRepository repo = org1.getRepository(sourceName);
              long end = System.currentTimeMillis();
          
              System.out.println(end - now);
              Thread.sleep(100L);
          }
          

          This is a sorted list with response times over 100 attempts:
          github-api-get-repos-sorted.txt

          $  cat github-api-get-repos-sorted.txt | jq -s '{minimum:min,maximum:max,average:(add/length),median:(sort|if length%2==1 then.[length/2|floor]else[.[length/2-1,length/2]]|add/2 end)}'
          {
            "minimum": 215,
            "maximum": 553,
            "average": 251.18,
            "median": 241
          }
          
          timja Tim Jacomb added a comment - At this point I think that Jenkins (scm-api + branch-api + github-branch-source) just doesn't scale with a large number of organization folders and using multi branch events. I mentioned earlier that we have 579 multi branch projects those are sub-folders under organization folders, which we have 68 of. Basically each team has 2 folders, one for regular builds and one for nightly builds. Which is working around limitations of not being able to have multiple pipelines from one repo when using org folders... Each team having their own folder is really historical from before we used GitHub app credentials and had to spread the rate limit across multiple users. So I can likely collapse those down. I've debugged through the event processing and I can see that there's a minimum of ~250ms per organization folder due to each folder validating that the repo from the event exists and retrieving details from it here: https://github.com/jenkinsci/github-branch-source-plugin/blob/18d333ef73792f92b1ebc4d8e42ded8fa070fc50/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMNavigator.java#L1390 Event timings produced by running this in github-branch-source for ( var i = 0; i < 100; i++) { long now = System .currentTimeMillis(); GHOrganization org1 = getGhOrganization(github); GHRepository repo = org1.getRepository(sourceName); long end = System .currentTimeMillis(); System .out.println(end - now); Thread .sleep(100L); } This is a sorted list with response times over 100 attempts: github-api-get-repos-sorted.txt $ cat github-api-get-repos-sorted.txt | jq -s '{minimum:min,maximum:max,average:(add/length),median:(sort| if length%2==1 then.[length/2|floor] else [.[length/2-1,length/2]]|add/2 end)}' { "minimum" : 215, "maximum" : 553, "average" : 251.18, "median" : 241 }
          timja Tim Jacomb added a comment - - edited

          There's some time that still goes walk-abouts.
          A large number of events take 30s to process, which is 441ms per folder.

          and in some cases randomly 60 seconds disappears between folders, haven't found anything to point at why yet

          Ideas currently:

          1. Event queue should be at least the size as the number of organization folders.
          2. Possibly process an event across all folders in parallel so that there's not a large delay in scheduling? Rather than submitting one task for it that iterates across all folders?
          3. Reduce number of folders
          4. Event filtering to drop events that will never match (I've currently added this to our reverse proxy monitoring atm to see if that helps at all)
          5. Use a quicker API than `getRepository`, unfortunately github-api doesn't support graphql https://github.com/hub4j/github-api/issues/521

          cc bitwiseman in case you have any input.

          teilo I assume you don't have anywhere near the number of org folders we're using

          timja Tim Jacomb added a comment - - edited There's some time that still goes walk-abouts. A large number of events take 30s to process, which is 441ms per folder. and in some cases randomly 60 seconds disappears between folders, haven't found anything to point at why yet Ideas currently: 1. Event queue should be at least the size as the number of organization folders. 2. Possibly process an event across all folders in parallel so that there's not a large delay in scheduling? Rather than submitting one task for it that iterates across all folders? 3. Reduce number of folders 4. Event filtering to drop events that will never match (I've currently added this to our reverse proxy monitoring atm to see if that helps at all) 5. Use a quicker API than `getRepository`, unfortunately github-api doesn't support graphql https://github.com/hub4j/github-api/issues/521 cc bitwiseman in case you have any input. teilo I assume you don't have anywhere near the number of org folders we're using
          basil Basil Crow added a comment -

          When asking the question "where did the time go?" my weapon of choice is CPU flame graphs. With Jenkins I tend to use jvm-profiling-tools/async-profiler.

          basil Basil Crow added a comment - When asking the question "where did the time go?" my weapon of choice is CPU flame graphs. With Jenkins I tend to use jvm-profiling-tools/async-profiler .

          People

            Unassigned Unassigned
            timja Tim Jacomb
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: