Plugin version: 2.11.4

      Jenkins Version: 2.319.2

      We have topics filtering enabled. 

      We are using Github App authentication.

       

      Upon each Scan Plugin discovers only 100 (random) repositories. Some of the repositories with the set topic are not discovered. During the next run, some of them are rediscovered, while some random repos are marked as orphaned.

       (here 1 repo is archived)

      Looking up repositories of organization ...
      Looking up repositories for topics: ...
      
      ...
      
      99 repositories were processed
      

      No other logs in Jenkins logs regarding scan, except for finishing. 

       

      The previous version didn't have such behavior.

       

          [JENKINS-67597] Plugin processing only 100 repositories

          Joseph Petersen added a comment - - edited

          Could be related to the page size not being set when searching and running out due to rate limiting?

          How many repos would you expect to be returned from the query?

          You could test it using the github cli:

          gh api -X GET search/repositories -f q="org:jenkinsci topic:jenkins-api-plugin" -f per_page="100"

          Joseph Petersen added a comment - - edited Could be related to the page size not being set when searching and running out due to rate limiting? How many repos would you expect to be returned from the query? You could test it using the github cli: gh api -X GET search/repositories -f q="org:jenkinsci topic:jenkins-api-plugin" -f per_page="100"

          Joy Arackal added a comment - - edited

          We got 363 repositories from the query you shared(modified it for our org and topic). Of which 15 are archived, so a total of 348 should be processed I believe. The Org scan is processing 349 (per the logs) and this number is consistent in every run.

          However, the number of orphaned items varies everytime, eg. in one run we had 21 items and in another there were 31. The orphaned repos do have the correct topic and meet all the criteria to be processed and probably therefore some of them are randomly indexed back in a subsequent runs (and few others are randomly orphaned).

          Since the number of processed repos is consistent in every run, maybe we are not hitting any limits with the search query? perhaps some rate limit when querying individual repos after that? or some issue while parsing the results?

          Joy Arackal added a comment - - edited We got 363 repositories from the query you shared(modified it for our org and topic). Of which 15 are archived, so a total of 348 should be processed I believe. The Org scan is processing 349 (per the logs) and this number is consistent in every run. However, the number of orphaned items varies everytime, eg. in one run we had 21 items and in another there were 31. The orphaned repos do have the correct topic and meet all the criteria to be processed and probably therefore some of them are randomly indexed back in a subsequent runs (and few others are randomly orphaned). Since the number of processed repos is consistent in every run, maybe we are not hitting any limits with the search query? perhaps some rate limit when querying individual repos after that? or some issue while parsing the results?

          Rob Hamilton added a comment -

          Just adding my view on this. We have found the github search api unreliable when querying against topics. Topics tend to be a "global" thing in github and we found jobs would be enabled/disable dependending on the latest scan.

          So that we could use topics we created a hack using JobDSL that used the github graphql api with the following query, in order to dynamically create our own jobs

          query Query($organization:String!, $cursor:String) {
                        organization(login: $organization) {
                            repositories(first:100,after:$cursor) {
                                pageInfo { 
                                    hasNextPage
                                    endCursor
                                }
                  
                                nodes {
                                    name
                                    isArchived
                                    repositoryTopics(first:10) {
                                        nodes {
                                            topic{
                                                name
                                            }        
                                        }
                                    }
                                }
                            }
                        }
                    } 

          Rob Hamilton added a comment - Just adding my view on this. We have found the github search api unreliable when querying against topics. Topics tend to be a "global" thing in github and we found jobs would be enabled/disable dependending on the latest scan. So that we could use topics we created a hack using JobDSL that used the github graphql api with the following query, in order to dynamically create our own jobs query Query($organization: String !, $cursor: String ) { organization(login: $organization) { repositories(first:100,after:$cursor) { pageInfo { hasNextPage endCursor } nodes { name isArchived repositoryTopics(first:10) { nodes { topic{ name } } } } } } }

          Indeed, this seems to be an API problem with the Search API. Using a test job like the following that use the search api by topic for an org (an org that have ~10 repo and only one with the requested topic):

          node ('built-in') {
              withCredentials([file(credentialsId: 'github-token-curl', variable: 'GITHUB_HEADERS')]) {
                  sh """
                      set +xe
                      while true
                      do
                          date
                          curl -s -L -H "Accept: application/vnd.github+json" -K \$GITHUB_HEADERS "https://api.github.com/search/repositories?q=org:myorg+topic:mytopic&per_page=100" | grep -i 'full_name'
                          sleep 30
                      done
                  """
              }
          }
          

          I do see that sometimes my repo is not returned although curl is successful:

          [...]
          Tue Mar 14 08:56:37 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 08:57:08 UTC 2023
          Tue Mar 14 08:57:38 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 08:58:08 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 08:58:39 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 08:59:09 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 08:59:39 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:00:10 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:00:40 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:01:10 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:01:41 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:02:11 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:02:41 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:03:12 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:03:42 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:04:12 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:04:43 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:05:13 UTC 2023
          Tue Mar 14 09:05:43 UTC 2023
                "full_name": "myorg/myrepo",
          Tue Mar 14 09:06:14 UTC 2023
          [...]
          

          Allan BURDAJEWICZ added a comment - Indeed, this seems to be an API problem with the Search API. Using a test job like the following that use the search api by topic for an org (an org that have ~10 repo and only one with the requested topic): node ( 'built-in' ) { withCredentials([file(credentialsId: 'github-token-curl' , variable: 'GITHUB_HEADERS' )]) { sh """ set +xe while true do date curl -s -L -H "Accept: application/vnd.github+json" -K \$GITHUB_HEADERS "https: //api.github.com/search/repositories?q=org:myorg+topic:mytopic&per_page=100" | grep -i 'full_name' sleep 30 done """ } } I do see that sometimes my repo is not returned although curl is successful: [...] Tue Mar 14 08:56:37 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 08:57:08 UTC 2023 Tue Mar 14 08:57:38 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 08:58:08 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 08:58:39 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 08:59:09 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 08:59:39 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:00:10 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:00:40 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:01:10 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:01:41 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:02:11 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:02:41 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:03:12 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:03:42 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:04:12 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:04:43 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:05:13 UTC 2023 Tue Mar 14 09:05:43 UTC 2023 "full_name" : "myorg/myrepo" , Tue Mar 14 09:06:14 UTC 2023 [...]

          Allan BURDAJEWICZ added a comment - - edited

          Reading through the Search API:

          -the search API has different rate limit https://docs.github.com/en/rest/search?apiVersion=2022-11-28#rate-limit. Though if we were hitting those, the issue would probably be more explicit

          -you may get incomplete results if the request times out (per my understanding, a timeout on the server side but you still get a successful http code):

          Allan BURDAJEWICZ added a comment - - edited Reading through the Search API: -the search API has different rate limit https://docs.github.com/en/rest/search?apiVersion=2022-11-28#rate-limit . Though if we were hitting those, the issue would probably be more explicit -you may get incomplete results if the request times out (per my understanding, a timeout on the server side but you still get a successful http code): https://docs.github.com/en/rest/search?apiVersion=2022-11-28#timeouts-and-incomplete-results https://developer.github.com/changes/2014-04-07-understanding-search-results-and-potential-timeouts/

          Allan BURDAJEWICZ added a comment - - edited

          Here is a reproducer isolated from Jenkins:

          #!/bin/bash
          set +xe
          while true
          do
              date
              responseAndCode=$(curl -s -L -H "Accept: application/vnd.github+json" -K $GITHUB_HEADERS -w ",http_code:%{http_code}" "https://api.github.com/search/repositories?q=org:jenkinsci+topic:configuration-as-code&per_page=100") 
              httpCode=$(echo $responseAndCode | sed 's/.*,http_code:\(.*\)/\1/')
              echo $responseAndCode | sed 's/\,http_code.*//' | ./jq ". | {\"total_count\": .total_count, \"incomplete_results\": .incomplete_results, \"repo-names\": [(.items[] | .full_name)], \"httpCode\": \"$httpCode\"}" -c
              curl -s -L -H "Accept: application/vnd.github+json" -K $GITHUB_HEADERS "https://api.github.com/rate_limit" | ./jq '. | {"search": .resources.search.remaining, "rate": .rate.remaining}' -c
              sleep 10
          done
          

          Running this for long enough, sometimes it shows incomplete results but judging from the date output, it does not look like it timed out at all:

          [...]
          Wed Apr 19 07:25:53 UTC 2023
          {"total_count":3,"incomplete_results":false,"repo-names":["jenkinsci/configuration-as-code-plugin","jenkinsci/scm-sync-configuration-plugin","jenkinsci/configuration-as-code-groovy-plugin"],"httpCode":"200"}
          {"search":28,"rate":5000}
          Wed Apr 19 07:26:04 UTC 2023
          {"total_count":2,"incomplete_results":true,"repo-names":["jenkinsci/scm-sync-configuration-plugin","jenkinsci/configuration-as-code-groovy-plugin"],"httpCode":"200"}
          {"search":27,"rate":5000}
          Wed Apr 19 07:26:15 UTC 2023
          {"total_count":3,"incomplete_results":false,"repo-names":["jenkinsci/configuration-as-code-plugin","jenkinsci/scm-sync-configuration-plugin","jenkinsci/configuration-as-code-groovy-plugin"],"httpCode":"200"}
          {"search":26,"rate":5000}
          [...]
          

          Allan BURDAJEWICZ added a comment - - edited Here is a reproducer isolated from Jenkins: #!/bin/bash set +xe while true do date responseAndCode=$(curl -s -L -H "Accept: application/vnd.github+json" -K $GITHUB_HEADERS -w ",http_code:%{http_code}" "https: //api.github.com/search/repositories?q=org:jenkinsci+topic:configuration-as-code&per_page=100" ) httpCode=$(echo $responseAndCode | sed 's/.*,http_code:\(.*\)/\1/' ) echo $responseAndCode | sed 's/\,http_code.* //' | ./jq ". | {\" total_count\ ": .total_count, \" incomplete_results\ ": .incomplete_results, \" repo-names\ ": [(.items[] | .full_name)], \" httpCode\ ": \" $httpCode\ "}" -c curl -s -L -H "Accept: application/vnd.github+json" -K $GITHUB_HEADERS "https: //api.github.com/rate_limit" | ./jq '. | { "search" : .resources.search.remaining, "rate" : .rate.remaining}' -c sleep 10 done Running this for long enough, sometimes it shows incomplete results but judging from the date output, it does not look like it timed out at all: [...] Wed Apr 19 07:25:53 UTC 2023 { "total_count" :3, "incomplete_results" : false , "repo-names" :[ "jenkinsci/configuration-as-code-plugin" , "jenkinsci/scm-sync-configuration-plugin" , "jenkinsci/configuration-as-code-groovy-plugin" ], "httpCode" : "200" } { "search" :28, "rate" :5000} Wed Apr 19 07:26:04 UTC 2023 { "total_count" :2, "incomplete_results" : true , "repo-names" :[ "jenkinsci/scm-sync-configuration-plugin" , "jenkinsci/configuration-as-code-groovy-plugin" ], "httpCode" : "200" } { "search" :27, "rate" :5000} Wed Apr 19 07:26:15 UTC 2023 { "total_count" :3, "incomplete_results" : false , "repo-names" :[ "jenkinsci/configuration-as-code-plugin" , "jenkinsci/scm-sync-configuration-plugin" , "jenkinsci/configuration-as-code-groovy-plugin" ], "httpCode" : "200" } { "search" :26, "rate" :5000} [...]

          Jeremy added a comment -

          Understanding that this is an issue where GitHub has work to do with returning consistently complete results, should the workaround for the time-being be to check whether the "incomplete_results" value is true in the return and ABORT processing orphaned child items for that scan?

          Jeremy added a comment - Understanding that this is an issue where GitHub has work to do with returning consistently complete results, should the workaround for the time-being be to check whether the "incomplete_results" value is true in the return and ABORT processing orphaned child items for that scan?

          jwillaz Absolutely. I was actually thinking about some retry mechanims in github-api on incomplete results.

          Allan BURDAJEWICZ added a comment - jwillaz Absolutely. I was actually thinking about some retry mechanims in github-api on incomplete results.

          After reaching out to GH support, their eng team found out that this issue is caused by a timeout:

          For certain types of searches, such as this one, a timeout can happen in less than a second.

          and they are working on improving the search. So it may be improved in future.

          Still we might benefit from a retry mechanism of some sort here.

          Allan BURDAJEWICZ added a comment - After reaching out to GH support, their eng team found out that this issue is caused by a timeout: For certain types of searches, such as this one, a timeout can happen in less than a second. and they are working on improving the search. So it may be improved in future. Still we might benefit from a retry mechanism of some sort here.

          Sam Gleske added a comment -

          This would be resolved by JENKINS-64016

          Sam Gleske added a comment - This would be resolved by JENKINS-64016

            allan_burdajewicz Allan BURDAJEWICZ
            vmiletskyi Vladyslav Miletskyi
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: