• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None

      I'm seeing delays of about an hour from when a multi branch event is created in GitHub to when Jenkins actually processes it.

      If I run the in the script console:

      jenkins.scm.api.SCMEvent.executorService
      

      I see:

      java.util.concurrent.ScheduledThreadPoolExecutor@8b0291b[Running, pool size = 10, active threads = 10, queued tasks = 2401, completed tasks = 12838]
      

      Relevant code appears to be around:
      https://github.com/jenkinsci/branch-api-plugin/blob/master/src/main/java/jenkins/branch/MultiBranchProject.java#L1179-L1198
      and

      https://github.com/jenkinsci/branch-api-plugin/blob/master/src/main/java/jenkins/branch/MultiBranchProject.java#L1385

      Running:

      Jenkins.get().getAllItems(jenkins.branch.MultiBranchProject.class).size()
      

      Gives:
      579 multi branch projects.

      From what I can see in the log in
      /var/jenkins_home/jenkins.branch.* logs

      Events seem to be process in 0-1 seconds but based on Maths comparing events processed over an hour we were only processing 22 a minute, so some must be quite slow.

      We only get to keep 15 minutes worth of logs as the file max size is set to 33 kilobytes for some reason:
      https://github.com/jenkinsci/branch-api-plugin/blob/master/src/main/java/jenkins/branch/MultiBranchProject.java#L1138-L1142

      Context:
      Large organistion with 1.7K repositories, and an organisation webhook is pointing at Jenkins

      Few ideas so far:
      1. Throw more threads at it

      https://github.com/jenkinsci/scm-api-plugin/blob/master/src/main/java/jenkins/scm/api/SCMEvent.java#L215

      Is hardcoded to 10 threads at a time

      2. Keep logs for longer

      3. See if there's something specific that could be holding this up?

      4. Global configuration to filter projects out of webhook processing? We have lots that don't need to be processed and will never be matched

        1. events-trace.txt
          53 kB
        2. events-trace-90s.txt
          53 kB
        3. github-api-get-repos-sorted.txt
          0.4 kB
        4. over-10-seconds.txt
          7 kB
        5. over-10-seconds-29-3.txt
          19 kB
        6. over-1-second.txt
          21 kB
        7. threads.6.20220325093536.txt
          667 kB
        8. threads.6.20220325093550.txt
          670 kB
        9. threads.6.20220325093602.txt
          642 kB
        10. threads.6.20220325093615.txt
          642 kB
        11. threads.6.20220325093629.txt
          678 kB
        12. threads.6.20220325093641.txt
          689 kB
        13. threads.6.20220325093655.txt
          716 kB
        14. threads.6.20220325093706.txt
          696 kB
        15. threads.6.20220325093720.txt
          704 kB
        16. threads.6.20220325093732.txt
          698 kB
        17. threads.6.20220325093745.txt
          662 kB
        18. threads.6.20220325093759.txt
          671 kB

          [JENKINS-68116] Slow processing of multi branch events

          Tim Jacomb added a comment -

          GitHub app is enabled I can upload the logs but I didn’t see anything useful there. It’s been running for 1.5 hours with 30 threads and has nothing queued anymore

          Tim Jacomb added a comment - GitHub app is enabled I can upload the logs but I didn’t see anything useful there. It’s been running for 1.5 hours with 30 threads and has nothing queued anymore

          James Nord added a comment - - edited

          in that case off the top of my head I am not sure.
          FWIW we have similar sized org (but more controllers) one controller has 120+ MBP jobs, we have not observed any issues.

          The caching threadpool does not help here as the cpu / alive time does not tell me if that request has been stuck waiting for GH to respond for 30 seconds or 1ms...

          Have you only just started seeing this, is it sporadic? GitHub has been flaky recently - even if their status page says otherwise

          James Nord added a comment - - edited in that case off the top of my head I am not sure. FWIW we have similar sized org (but more controllers) one controller has 120+ MBP jobs, we have not observed any issues. The caching threadpool does not help here as the cpu / alive time does not tell me if that request has been stuck waiting for GH to respond for 30 seconds or 1ms... Have you only just started seeing this, is it sporadic? GitHub has been flaky recently - even if their status page says otherwise

          Tim Jacomb added a comment -

          > Have you only just started seeing this, is it sporadic? GitHub has been flaky recently - even if their status page says otherwise

          14th February was when it was first reported as far as I can tell.

          Tim Jacomb added a comment - > Have you only just started seeing this, is it sporadic? GitHub has been flaky recently - even if their status page says otherwise 14th February was when it was first reported as far as I can tell.

          Tim Jacomb added a comment - - edited

          Attached event logs after deploying https://github.com/jenkinsci/branch-api-plugin/pull/304

          This is only 25 mins of logs or so

          These are the events over 1 second and over 10 seconds.

          All the ones that take over 10 seconds are ones that would never match any repository on this controller.

          Possibly better regex filters on org folders could means less looking up I guess.
          Also 'Global configuration to filter projects out of webhook processing' would also solve it by the looks of it

          I'll try filter the projects on Monday on org folders

          over-1-second.txt over-10-seconds.txt

          Tim Jacomb added a comment - - edited Attached event logs after deploying https://github.com/jenkinsci/branch-api-plugin/pull/304 This is only 25 mins of logs or so These are the events over 1 second and over 10 seconds. All the ones that take over 10 seconds are ones that would never match any repository on this controller. Possibly better regex filters on org folders could means less looking up I guess. Also 'Global configuration to filter projects out of webhook processing' would also solve it by the looks of it I'll try filter the projects on Monday on org folders over-1-second.txt over-10-seconds.txt

          James Nord added a comment -

          > Possibly better regex filters on org folders could means less looking up I guess.
          >Also 'Global configuration to filter projects out of webhook processing' would also solve it by the looks of it

          Ahh - we use "Filter by repository topics" - means maintaining the ugly regex is not needed and we can add the topic to any repository without having to change Jenkins config.

          James Nord added a comment - > Possibly better regex filters on org folders could means less looking up I guess. >Also 'Global configuration to filter projects out of webhook processing' would also solve it by the looks of it Ahh - we use "Filter by repository topics" - means maintaining the ugly regex is not needed and we can add the topic to any repository without having to change Jenkins config.

          Tim Jacomb added a comment - - edited

          over-10-seconds-29-3.txt

          teilo any suggestions on what else to do / look at, if you see the attachment there's lots still take 30-60s to process?

          If I check the multi branch events for the repositories that were matched they all seem to get processed in less than 1 second

          I changed some of our orgs to topics but it doesn't look to have helped much, although some of the events that took a long time seem to be not so bad as they don't match anymore.

          I'm wondering (without knowing fully how this area works) if not all the jobs are held in memory so there's a lot of disk access required?

          (all the repositories are public in the attachment)

          Tim Jacomb added a comment - - edited over-10-seconds-29-3.txt teilo any suggestions on what else to do / look at, if you see the attachment there's lots still take 30-60s to process? If I check the multi branch events for the repositories that were matched they all seem to get processed in less than 1 second I changed some of our orgs to topics but it doesn't look to have helped much, although some of the events that took a long time seem to be not so bad as they don't match anymore. I'm wondering (without knowing fully how this area works) if not all the jobs are held in memory so there's a lot of disk access required? (all the repositories are public in the attachment)

          Tim Jacomb added a comment -

          events-trace.txt

          I added an ID to the logs so I could grep for all related logs.
          attached an example of a 30 second one and a 90 second one

          events-trace-90s.txt

          Tim Jacomb added a comment - events-trace.txt I added an ID to the logs so I could grep for all related logs. attached an example of a 30 second one and a 90 second one events-trace-90s.txt

          Tim Jacomb added a comment -

          At this point I think that Jenkins (scm-api + branch-api + github-branch-source) just doesn't scale with a large number of organization folders and using multi branch events.

          I mentioned earlier that we have 579 multi branch projects those are sub-folders under organization folders, which we have 68 of.

          Basically each team has 2 folders, one for regular builds and one for nightly builds.
          Which is working around limitations of not being able to have multiple pipelines from one repo when using org folders...

          Each team having their own folder is really historical from before we used GitHub app credentials and had to spread the rate limit across multiple users.
          So I can likely collapse those down.

          I've debugged through the event processing and I can see that there's a minimum of ~250ms per organization folder due to each folder validating that the repo from the event exists and retrieving details from it here: https://github.com/jenkinsci/github-branch-source-plugin/blob/18d333ef73792f92b1ebc4d8e42ded8fa070fc50/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMNavigator.java#L1390

          Event timings produced by running this in github-branch-source

          for (var i = 0; i < 100; i++) {
              long now = System.currentTimeMillis();
              
              GHOrganization org1 = getGhOrganization(github);
              GHRepository repo = org1.getRepository(sourceName);
              long end = System.currentTimeMillis();
          
              System.out.println(end - now);
              Thread.sleep(100L);
          }
          

          This is a sorted list with response times over 100 attempts:
          github-api-get-repos-sorted.txt

          $  cat github-api-get-repos-sorted.txt | jq -s '{minimum:min,maximum:max,average:(add/length),median:(sort|if length%2==1 then.[length/2|floor]else[.[length/2-1,length/2]]|add/2 end)}'
          {
            "minimum": 215,
            "maximum": 553,
            "average": 251.18,
            "median": 241
          }
          

          Tim Jacomb added a comment - At this point I think that Jenkins (scm-api + branch-api + github-branch-source) just doesn't scale with a large number of organization folders and using multi branch events. I mentioned earlier that we have 579 multi branch projects those are sub-folders under organization folders, which we have 68 of. Basically each team has 2 folders, one for regular builds and one for nightly builds. Which is working around limitations of not being able to have multiple pipelines from one repo when using org folders... Each team having their own folder is really historical from before we used GitHub app credentials and had to spread the rate limit across multiple users. So I can likely collapse those down. I've debugged through the event processing and I can see that there's a minimum of ~250ms per organization folder due to each folder validating that the repo from the event exists and retrieving details from it here: https://github.com/jenkinsci/github-branch-source-plugin/blob/18d333ef73792f92b1ebc4d8e42ded8fa070fc50/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMNavigator.java#L1390 Event timings produced by running this in github-branch-source for ( var i = 0; i < 100; i++) { long now = System .currentTimeMillis(); GHOrganization org1 = getGhOrganization(github); GHRepository repo = org1.getRepository(sourceName); long end = System .currentTimeMillis(); System .out.println(end - now); Thread .sleep(100L); } This is a sorted list with response times over 100 attempts: github-api-get-repos-sorted.txt $ cat github-api-get-repos-sorted.txt | jq -s '{minimum:min,maximum:max,average:(add/length),median:(sort| if length%2==1 then.[length/2|floor] else [.[length/2-1,length/2]]|add/2 end)}' { "minimum" : 215, "maximum" : 553, "average" : 251.18, "median" : 241 }

          Tim Jacomb added a comment - - edited

          There's some time that still goes walk-abouts.
          A large number of events take 30s to process, which is 441ms per folder.

          and in some cases randomly 60 seconds disappears between folders, haven't found anything to point at why yet

          Ideas currently:

          1. Event queue should be at least the size as the number of organization folders.
          2. Possibly process an event across all folders in parallel so that there's not a large delay in scheduling? Rather than submitting one task for it that iterates across all folders?
          3. Reduce number of folders
          4. Event filtering to drop events that will never match (I've currently added this to our reverse proxy monitoring atm to see if that helps at all)
          5. Use a quicker API than `getRepository`, unfortunately github-api doesn't support graphql https://github.com/hub4j/github-api/issues/521

          cc bitwiseman in case you have any input.

          teilo I assume you don't have anywhere near the number of org folders we're using

          Tim Jacomb added a comment - - edited There's some time that still goes walk-abouts. A large number of events take 30s to process, which is 441ms per folder. and in some cases randomly 60 seconds disappears between folders, haven't found anything to point at why yet Ideas currently: 1. Event queue should be at least the size as the number of organization folders. 2. Possibly process an event across all folders in parallel so that there's not a large delay in scheduling? Rather than submitting one task for it that iterates across all folders? 3. Reduce number of folders 4. Event filtering to drop events that will never match (I've currently added this to our reverse proxy monitoring atm to see if that helps at all) 5. Use a quicker API than `getRepository`, unfortunately github-api doesn't support graphql https://github.com/hub4j/github-api/issues/521 cc bitwiseman in case you have any input. teilo I assume you don't have anywhere near the number of org folders we're using

          Basil Crow added a comment -

          When asking the question "where did the time go?" my weapon of choice is CPU flame graphs. With Jenkins I tend to use jvm-profiling-tools/async-profiler.

          Basil Crow added a comment - When asking the question "where did the time go?" my weapon of choice is CPU flame graphs. With Jenkins I tend to use jvm-profiling-tools/async-profiler .

            Unassigned Unassigned
            timja Tim Jacomb
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: