Resolution: Unresolved
Powered by SuggestiMate
Plugin version: 2.11.4
Jenkins Version: 2.319.2
We have topics filtering enabled.
We are using Github App authentication.
Upon each Scan Plugin discovers only 100 (random) repositories. Some of the repositories with the set topic are not discovered. During the next run, some of them are rediscovered, while some random repos are marked as orphaned.
(here 1 repo is archived)
Looking up repositories of organization ...
Looking up repositories for topics: ...
99 repositories were processed
No other logs in Jenkins logs regarding scan, except for finishing.
The previous version didn't have such behavior.
- image-2022-01-14-17-57-48-104.png
- 78 kB
- Vladyslav Miletskyi
- image-2022-01-14-17-56-52-044.png
- 66 kB
- Vladyslav Miletskyi
[JENKINS-67597] Plugin processing only 100 repositories
We got 363 repositories from the query you shared(modified it for our org and topic). Of which 15 are archived, so a total of 348 should be processed I believe. The Org scan is processing 349 (per the logs) and this number is consistent in every run.
However, the number of orphaned items varies everytime, eg. in one run we had 21 items and in another there were 31. The orphaned repos do have the correct topic and meet all the criteria to be processed and probably therefore some of them are randomly indexed back in a subsequent runs (and few others are randomly orphaned).
Since the number of processed repos is consistent in every run, maybe we are not hitting any limits with the search query? perhaps some rate limit when querying individual repos after that? or some issue while parsing the results?
Just adding my view on this. We have found the github search api unreliable when querying against topics. Topics tend to be a "global" thing in github and we found jobs would be enabled/disable dependending on the latest scan.
So that we could use topics we created a hack using JobDSL that used the github graphql api with the following query, in order to dynamically create our own jobs
query Query($organization:String!, $cursor:String) { organization(login: $organization) { repositories(first:100,after:$cursor) { pageInfo { hasNextPage endCursor } nodes { name isArchived repositoryTopics(first:10) { nodes { topic{ name } } } } } } }
Indeed, this seems to be an API problem with the Search API. Using a test job like the following that use the search api by topic for an org (an org that have ~10 repo and only one with the requested topic):
node ('built-in') { withCredentials([file(credentialsId: 'github-token-curl', variable: 'GITHUB_HEADERS')]) { sh """ set +xe while true do date curl -s -L -H "Accept: application/vnd.github+json" -K \$GITHUB_HEADERS "https://api.github.com/search/repositories?q=org:myorg+topic:mytopic&per_page=100" | grep -i 'full_name' sleep 30 done """ } }
I do see that sometimes my repo is not returned although curl is successful:
[...] Tue Mar 14 08:56:37 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 08:57:08 UTC 2023 Tue Mar 14 08:57:38 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 08:58:08 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 08:58:39 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 08:59:09 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 08:59:39 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:00:10 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:00:40 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:01:10 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:01:41 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:02:11 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:02:41 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:03:12 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:03:42 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:04:12 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:04:43 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:05:13 UTC 2023 Tue Mar 14 09:05:43 UTC 2023 "full_name": "myorg/myrepo", Tue Mar 14 09:06:14 UTC 2023 [...]
Reading through the Search API:
-the search API has different rate limit https://docs.github.com/en/rest/search?apiVersion=2022-11-28#rate-limit. Though if we were hitting those, the issue would probably be more explicit
-you may get incomplete results if the request times out (per my understanding, a timeout on the server side but you still get a successful http code):
Here is a reproducer isolated from Jenkins:
#!/bin/bash set +xe while true do date responseAndCode=$(curl -s -L -H "Accept: application/vnd.github+json" -K $GITHUB_HEADERS -w ",http_code:%{http_code}" "https://api.github.com/search/repositories?q=org:jenkinsci+topic:configuration-as-code&per_page=100") httpCode=$(echo $responseAndCode | sed 's/.*,http_code:\(.*\)/\1/') echo $responseAndCode | sed 's/\,http_code.*//' | ./jq ". | {\"total_count\": .total_count, \"incomplete_results\": .incomplete_results, \"repo-names\": [(.items[] | .full_name)], \"httpCode\": \"$httpCode\"}" -c curl -s -L -H "Accept: application/vnd.github+json" -K $GITHUB_HEADERS "https://api.github.com/rate_limit" | ./jq '. | {"search": .resources.search.remaining, "rate": .rate.remaining}' -c sleep 10 done
Running this for long enough, sometimes it shows incomplete results but judging from the date output, it does not look like it timed out at all:
[...] Wed Apr 19 07:25:53 UTC 2023 {"total_count":3,"incomplete_results":false,"repo-names":["jenkinsci/configuration-as-code-plugin","jenkinsci/scm-sync-configuration-plugin","jenkinsci/configuration-as-code-groovy-plugin"],"httpCode":"200"} {"search":28,"rate":5000} Wed Apr 19 07:26:04 UTC 2023 {"total_count":2,"incomplete_results":true,"repo-names":["jenkinsci/scm-sync-configuration-plugin","jenkinsci/configuration-as-code-groovy-plugin"],"httpCode":"200"} {"search":27,"rate":5000} Wed Apr 19 07:26:15 UTC 2023 {"total_count":3,"incomplete_results":false,"repo-names":["jenkinsci/configuration-as-code-plugin","jenkinsci/scm-sync-configuration-plugin","jenkinsci/configuration-as-code-groovy-plugin"],"httpCode":"200"} {"search":26,"rate":5000} [...]
Understanding that this is an issue where GitHub has work to do with returning consistently complete results, should the workaround for the time-being be to check whether the "incomplete_results" value is true in the return and ABORT processing orphaned child items for that scan?
jwillaz Absolutely. I was actually thinking about some retry mechanims in github-api on incomplete results.
After reaching out to GH support, their eng team found out that this issue is caused by a timeout:
and they are working on improving the search. So it may be improved in future.For certain types of searches, such as this one, a timeout can happen in less than a second.
Still we might benefit from a retry mechanism of some sort here.
Could be related to the page size not being set when searching and running out due to rate limiting?
How many repos would you expect to be returned from the query?
You could test it using the github cli:
gh api -X GET search/repositories -f q="org:jenkinsci topic:jenkins-api-plugin" -f per_page="100"