Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-72047

scm-filter-jervis gives up after 1 API request to GitHub which can lead to missed webhooks

    • 2.0-66.vc21d0c1d936d

      Bug description

      Webhooks get received by Jenkins but do not create jobs or start builds. This only happens sometimes.

      Other info

      I noticed clock drift on GitHub servers but it wasn't a factor.

      I verified GitHub API servers have about a 12 second clock drift currently compared to time.gov.

      We've been having several webhooks issues and I'm suspicious about the clock differences (I haven't nailed down a specific bug in code, yet).

      For example, GitHub will send a webhook at 22:07:04 and Jenkins will process the hook payload with signature verification at 22:07:03. No builds trigger for this clock difference and the log is missing from the multibranch pipeline event log.

      However, if I close and re-open the pull request to trigger another webhook its timestamps are in chronological order and succeed. Is it possible there's a clock drift bug in code? I'm still struggling to track it down with traces.

      Custom loggers

      I installed the support-core plugin and created a custom logger named "GitHub webhooks debugging".

      I have logging enabled for the following classes currently (level ALL):

      com.cloudbees.jenkins.GitHubWebHook
      org.jenkinsci.plugins.github.webhook.WebhookManager
      org.jenkinsci.plugins.github.admin.GitHubHookRegisterProblemMonitor
      org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber
      org.jenkinsci.plugins.github.webhook.subscriber.PingGHEventSubscriber
      org.jenkinsci.plugins.github.webhook.GHEventHeader$PayloadHandler
      org.jenkinsci.plugins.github.webhook.GHEventPayload$PayloadHandler
      org.jenkinsci.plugins.github.webhook.GHWebhookSignature
      org.jenkinsci.plugins.github.webhook.RequirePostWithGHHookPayload$Processor
      org.jenkinsci.plugins.workflow.job.properties.PipelineTriggersJobProperty
      org.jenkinsci.plugins.github_branch_source.GitHubRepositoryEventSubscriber
      org.jenkinsci.plugins.github_branch_source.PushGHEventSubscriber
      org.jenkinsci.plugins.github_branch_source.PullRequestGHEventSubscriber
      org.jenkinsci.plugins.workflow.multibranch.WorkflowMultiBranchProject
      jenkins.branch.buildstrategies.basic.TagBuildStrategyImpl
      jenkins.branch.buildstrategies.basic.ChangeRequestBuildStrategyImpl
      jenkins.scm.api.SCMHeadEvent
      jenkins.branch.MultiBranchProject
      

      I'm able to trace webhook events from GitHub to Jenkins and inside of Jenkins: pull request event, payload received, signature verification succeeded.

      However, the trail stops at signature verification and there's no multibranch pipeline event log. If I retry it goes through all of the above and an event shows up in multibranch pipeline event log with a build being started.

      Sample job

      See attachment sample-job.xml

      Jenkins war and plugin versions

      See dependencies.gradle and the companion comment "How to reproduce" in the comments section of this issue.
       

          [JENKINS-72047] scm-filter-jervis gives up after 1 API request to GitHub which can lead to missed webhooks

          Sam Gleske added a comment -

          The temporary workaround

          Before I dive into details I found a temporary workaround. GitHub clocks being out of sync required delaying between payload processing and triggering multibranch pipeline builds. This was achieved via the following system property.

          -Dorg.jenkinsci.plugins.github_branch_source.GitHubSCMSource.eventDelaySeconds=22
          

          I had to restart Jenkins. I would like to change this property (specifically the static method getEventDelaySeconds()) to return the property or fall back to static value so that it can be changed without restart to runtime.

          Why does it work?

          GitHub servers were out of sync. Jenkins processed multibranch events BEFORE GitHub sent webhook payloads. This triggered a bug (I've yet to find in source but now I have an idea).

          By forcing a delay the Jenkins controller system clock has a chance to catch up to the payload event so that multibranch pipeline events are processed AFTER the hook payload timestamp.

          Sam Gleske added a comment - The temporary workaround Before I dive into details I found a temporary workaround. GitHub clocks being out of sync required delaying between payload processing and triggering multibranch pipeline builds. This was achieved via the following system property. -Dorg.jenkinsci.plugins.github_branch_source.GitHubSCMSource.eventDelaySeconds=22 I had to restart Jenkins. I would like to change this property (specifically the static method getEventDelaySeconds()) to return the property or fall back to static value so that it can be changed without restart to runtime. Why does it work? GitHub servers were out of sync. Jenkins processed multibranch events BEFORE GitHub sent webhook payloads. This triggered a bug (I've yet to find in source but now I have an idea). By forcing a delay the Jenkins controller system clock has a chance to catch up to the payload event so that multibranch pipeline events are processed AFTER the hook payload timestamp.

          Sam Gleske added a comment -

          Sample logs

          I've narrowed down the issue to branch matchers. This gives me a specific avenue of source code to review

          Failed webhook trace log

          [Fri Sep 22 14:41:34 GMT 2023] Received Push event for tag 1.0.146 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP ADDRESS** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 14:41:29 GMT 2023
          

          This means:

          • GitHub sent webhook.
          • Jenkins received webhook.
          • Jenkins processed payload and successfully verified the signature to return 200 status to GitHub.
          • Jenkins multibranch pipeline processed the event
          • Nothing happened. No jobs created or builds started.

          Successful trace log after replay webhook

          [Fri Sep 22 16:20:03 GMT 2023] Received Push event for tag 1.0.146 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP ADDRESS** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 16:19:58 GMT 2023
          Found match against Reporting-Platform/fantastic-signals-sso (new branch 1.0.146)
          
          [Fri Sep 22 16:20:05 GMT 2023] Finished processing Push event for tag 1.0.146 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP ADDRESS** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 16:19:58 GMT 2023, processed in 1746ms. Matched 1.
          

          This means:

          • GitHub sent webhook.
          • Jenkins received webhook.
          • Jenkins processed payload and successfully verified the signature to return 200 status to GitHub.
          • Jenkins central multibranch pipeline processed the event. Found a match.
          • Jenkins central multibranch pipeline processed the event and notified the multibranch pipeline job.
          • Jenkins multibranch pipeline job successfully processed the event against branch matchers and created a Jenkins job for a GitHub tag.
          • Jenkins automatically started a build for the GitHub tag.

          Sam Gleske added a comment - Sample logs I've narrowed down the issue to branch matchers. This gives me a specific avenue of source code to review Failed webhook trace log [Fri Sep 22 14:41:34 GMT 2023] Received Push event for tag 1.0.146 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP ADDRESS** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 14:41:29 GMT 2023 This means: GitHub sent webhook. Jenkins received webhook. Jenkins processed payload and successfully verified the signature to return 200 status to GitHub. Jenkins multibranch pipeline processed the event Nothing happened. No jobs created or builds started. Successful trace log after replay webhook [Fri Sep 22 16:20:03 GMT 2023] Received Push event for tag 1.0.146 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP ADDRESS** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 16:19:58 GMT 2023 Found match against Reporting-Platform/fantastic-signals-sso (new branch 1.0.146) [Fri Sep 22 16:20:05 GMT 2023] Finished processing Push event for tag 1.0.146 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP ADDRESS** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 16:19:58 GMT 2023, processed in 1746ms. Matched 1. This means: GitHub sent webhook. Jenkins received webhook. Jenkins processed payload and successfully verified the signature to return 200 status to GitHub. Jenkins central multibranch pipeline processed the event. Found a match. Jenkins central multibranch pipeline processed the event and notified the multibranch pipeline job. Jenkins multibranch pipeline job successfully processed the event against branch matchers and created a Jenkins job for a GitHub tag. Jenkins automatically started a build for the GitHub tag.

          Sam Gleske added a comment - - edited

          Code sleuthing

          CREATED event processing from branch-api-plugin

          https://github.com/jenkinsci/branch-api-plugin/blob/717130d4f81663f6bd1f1f7fc272d29ad833847e/src/main/java/jenkins/branch/MultiBranchProject.java#L1207C1-L1216

          Find a matchCount against org.jenkinsci.plugins.github_branch_source.GitHubSCMSource

          https://github.com/jenkinsci/branch-api-plugin/blob/717130d4f81663f6bd1f1f7fc272d29ad833847e/src/main/java/jenkins/branch/MultiBranchProject.java#L1266

          The bug returns false for `event.isMatch(source)` which causes no logs to output.

          GitHub Branch Source tag is covered by PushEvent org.jenkinsci.plugins.github_branch_source.PushGHEventSubscriber

          Where if matched it will fire CREATED event via fireLater of SCMHeadEvent

          https://github.com/jenkinsci/github-branch-source-plugin/blob/a3028eb9fd2180459efc511b30e9dd46937d44fd/src/main/java/org/jenkinsci/plugins/github_branch_source/PushGHEventSubscriber.java#L126-L127

          https://github.com/jenkinsci/github-branch-source-plugin/blob/a3028eb9fd2180459efc511b30e9dd46937d44fd/src/main/java/org/jenkinsci/plugins/github_branch_source/PushGHEventSubscriber.java#L149

          This looks to be configurable with `GitHubSCMSource.getEventDelaySeconds()`

          https://github.com/jenkinsci/github-branch-source-plugin/blob/a3028eb9fd2180459efc511b30e9dd46937d44fd/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java#L558-L565

          Looking at configuration to see if we can delay events long enough to prevent them from being affected by timestamps.

          Event delays are configurable up to 300 seconds and default to 5 seconds.

          https://github.com/jenkinsci/github-branch-source-plugin/blob/a3028eb9fd2180459efc511b30e9dd46937d44fd/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java#L160-L162

          Unfortunately, static instantiation of the property means this integer will only get set at JVM boot and remain static for the rest of runtime; requiring restart.

          Sam Gleske added a comment - - edited Code sleuthing CREATED event processing from branch-api-plugin https://github.com/jenkinsci/branch-api-plugin/blob/717130d4f81663f6bd1f1f7fc272d29ad833847e/src/main/java/jenkins/branch/MultiBranchProject.java#L1207C1-L1216 Find a matchCount against org.jenkinsci.plugins.github_branch_source.GitHubSCMSource https://github.com/jenkinsci/branch-api-plugin/blob/717130d4f81663f6bd1f1f7fc272d29ad833847e/src/main/java/jenkins/branch/MultiBranchProject.java#L1266 The bug returns false for ` event.isMatch(source) ` which causes no logs to output. GitHub Branch Source tag is covered by PushEvent org.jenkinsci.plugins.github_branch_source.PushGHEventSubscriber Where if matched it will fire CREATED event via fireLater of SCMHeadEvent https://github.com/jenkinsci/github-branch-source-plugin/blob/a3028eb9fd2180459efc511b30e9dd46937d44fd/src/main/java/org/jenkinsci/plugins/github_branch_source/PushGHEventSubscriber.java#L126-L127 https://github.com/jenkinsci/github-branch-source-plugin/blob/a3028eb9fd2180459efc511b30e9dd46937d44fd/src/main/java/org/jenkinsci/plugins/github_branch_source/PushGHEventSubscriber.java#L149 This looks to be configurable with `GitHubSCMSource.getEventDelaySeconds()` https://github.com/jenkinsci/github-branch-source-plugin/blob/a3028eb9fd2180459efc511b30e9dd46937d44fd/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java#L558-L565 Looking at configuration to see if we can delay events long enough to prevent them from being affected by timestamps. Event delays are configurable up to 300 seconds and default to 5 seconds. https://github.com/jenkinsci/github-branch-source-plugin/blob/a3028eb9fd2180459efc511b30e9dd46937d44fd/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java#L160-L162 Unfortunately, static instantiation of the property means this integer will only get set at JVM boot and remain static for the rest of runtime; requiring restart.

          Sam Gleske added a comment - - edited

          Verification

          After applying system startup property

          -Dorg.jenkinsci.plugins.github_branch_source.GitHubSCMSource.eventDelaySeconds=22
          

          I verified it works the way I thought it would; I'll keep an eye out for more clues pending additional debugging.

          GitHub webhook debug log

          Path: /var/lib/jenkins/logs/custom/GitHub\ webhooks\ debugging.log

          GitHub sends payload at 2023-09-22 22:08:13 UTC

          Jenkins gets push event at

          2023-09-22 22:08:13.859+0000 [id=49]    FINE    o.j.p.g.w.GHEventHeader$PayloadHandler#parse: Header X-GitHub-Event -> push
          

          Jenkins finishes processing push event

          2023-09-22 22:08:13.859+0000 [id=49]    FINEST  o.j.p.g.w.GHWebhookSignature#matches: Signature: calculated=c2deae64d9400b359f10586fe027edb79ade8a79 provided=c2deae64d9400b359f10586fe027edb79ade8a79
          

          jenkins.branch.MultiBranchProject.log

          Path: /var/lib/jenkins/logs/jenkins.branch.MultiBranchProject.log

          I searched for time (trimming to second) "22:08:13"

          [Fri Sep 22 22:08:35 GMT 2023] Received Push event for tag 1.0.32 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 22:08:13 GMT 2023
          Found match against jenkins-ng/jenkins-ng-cloudformation (new branch 1.0.32)
          
          [Fri Sep 22 22:08:37 GMT 2023] Finished processing Push event for tag 1.0.32 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 22:08:13 GMT 2023, processed in 1790ms. Matched 1.
          

          22:08:35 - 22:08:13 is 22 second delay in processing verified.

          Next steps

          I want to review the github branch source plugin source code further to see if I can identify exactly where this timing bug could occur. This is caused by webhooks being in the future from Jenkins controller system time (Jenkins clock accurately synced and GitHub servers out of sync clocks).

          Sam Gleske added a comment - - edited Verification After applying system startup property -Dorg.jenkinsci.plugins.github_branch_source.GitHubSCMSource.eventDelaySeconds=22 I verified it works the way I thought it would; I'll keep an eye out for more clues pending additional debugging. GitHub webhook debug log Path: /var/lib/jenkins/logs/custom/GitHub\ webhooks\ debugging.log GitHub sends payload at 2023-09-22 22:08:13 UTC Jenkins gets push event at 2023-09-22 22:08:13.859+0000 [id=49] FINE o.j.p.g.w.GHEventHeader$PayloadHandler#parse: Header X-GitHub-Event -> push Jenkins finishes processing push event 2023-09-22 22:08:13.859+0000 [id=49] FINEST o.j.p.g.w.GHWebhookSignature#matches: Signature: calculated=c2deae64d9400b359f10586fe027edb79ade8a79 provided=c2deae64d9400b359f10586fe027edb79ade8a79 jenkins.branch.MultiBranchProject.log Path: /var/lib/jenkins/logs/jenkins.branch.MultiBranchProject.log I searched for time (trimming to second) "22:08:13" [Fri Sep 22 22:08:35 GMT 2023] Received Push event for tag 1.0.32 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 22:08:13 GMT 2023 Found match against jenkins-ng/jenkins-ng-cloudformation (new branch 1.0.32) [Fri Sep 22 22:08:37 GMT 2023] Finished processing Push event for tag 1.0.32 in repository **REDACTED ORG**/**REDACTED REPO** CREATED event from **REDACTED IP** ⇒ https://jenkins-webhooks.REDACTED.net/github-webhook/ with timestamp Fri Sep 22 22:08:13 GMT 2023, processed in 1790ms. Matched 1. 22:08:35 - 22:08:13 is 22 second delay in processing verified. Next steps I want to review the github branch source plugin source code further to see if I can identify exactly where this timing bug could occur. This is caused by webhooks being in the future from Jenkins controller system time (Jenkins clock accurately synced and GitHub servers out of sync clocks).

          Sam Gleske added a comment - - edited

          How to reproduce

          I have attached a dependencies.gradle file

          You can use https://github.com/samrocketman/jenkins-bootstrap-shared and copy dependencies.gradle into the clone of that repository.

          git clone https://github.com/samrocketman/jenkins-bootstrap-shared
          cd jenkins-bootstrap-shared/
          
          curl -sSfLo dependencies.gradle https://issues.jenkins.io/secure/attachment/61174/dependencies.gradle
          
          ./gradlew getjenkins getplugins
          
          ls *.war plugins/*.jpi
          

          It downloads the exact version of Jenkins and all open source plugins (minus any custom or proprietary patches which aren't relevant for reproducing)

          Set your system clock for the test controller about 1 minute into the future before processing webhook payloads.

          Sam Gleske added a comment - - edited How to reproduce I have attached a dependencies.gradle file You can use https://github.com/samrocketman/jenkins-bootstrap-shared and copy dependencies.gradle into the clone of that repository. git clone https://github.com/samrocketman/jenkins-bootstrap-shared cd jenkins-bootstrap-shared/ curl -sSfLo dependencies.gradle https://issues.jenkins.io/secure/attachment/61174/dependencies.gradle ./gradlew getjenkins getplugins ls *.war plugins/*.jpi It downloads the exact version of Jenkins and all open source plugins (minus any custom or proprietary patches which aren't relevant for reproducing) Set your system clock for the test controller about 1 minute into the future before processing webhook payloads.

          Sam Gleske added a comment -

          With GitHubSCMSource.eventDelaySeconds there doesn't appear to be a positive effect like I hoped; it still isn't processing payloads and logging stops in jenkins.branch.MultiBranchProject.log where Jenkins mentions it processed the log but did nothing with the event.

          Sam Gleske added a comment - With GitHubSCMSource.eventDelaySeconds there doesn't appear to be a positive effect like I hoped; it still isn't processing payloads and logging stops in jenkins.branch.MultiBranchProject.log where Jenkins mentions it processed the log but did nothing with the event.

          Sam Gleske added a comment - - edited

          I am using a GitHub App with webhooks disabled in-app. I am relying on the github-plugin (Manage webhooks checked in Global settings) to setup the webhooks directly on the repository when the multibranch pipeline is created configured with the github-branch-source plugin using the GitHub App as the clone credential.

          Sam Gleske added a comment - - edited I am using a GitHub App with webhooks disabled in-app. I am relying on the github-plugin (Manage webhooks checked in Global settings) to setup the webhooks directly on the repository when the multibranch pipeline is created configured with the github-branch-source plugin using the GitHub App as the clone credential.

          Sam Gleske added a comment -

          Added jenkins.branch.MultiBranchProject to debug logger to help narrow down where processing stops.

          Sam Gleske added a comment - Added jenkins.branch.MultiBranchProject to debug logger to help narrow down where processing stops.

          Sam Gleske added a comment -

          I've narrowed the issue down to the scm-filter-jervis plugin.

          It is specifically an issue with this network API call https://github.com/jenkinsci/scm-filter-jervis-plugin/blob/f67873b6c47d8550a899f621aa06c78584ec4360/src/main/groovy/net/gleske/scmfilter/impl/trait/JervisFilterTrait.groovy#L172

          It tries only once and expects GitHub to perfectly respond each time. Later it gracefully handles GraphQL errors by treating the branch as if it has no YAML file and so is not buildable. I consider this the bug which is the reason webhooks randomly drop.

          Bug identified

          Here's the full range of the buggy code with it ending on "return true" (which means do not build or create a job in multibranch pipeline) https://github.com/jenkinsci/scm-filter-jervis-plugin/blob/f67873b6c47d8550a899f621aa06c78584ec4360/src/main/groovy/net/gleske/scmfilter/impl/trait/JervisFilterTrait.groovy#L172-L190

          Proposed fix

          It should retry with a random delay. I'll open a PR for enhancement. I'll also update the component to the buggy plugin.

          Sam Gleske added a comment - I've narrowed the issue down to the scm-filter-jervis plugin. It is specifically an issue with this network API call https://github.com/jenkinsci/scm-filter-jervis-plugin/blob/f67873b6c47d8550a899f621aa06c78584ec4360/src/main/groovy/net/gleske/scmfilter/impl/trait/JervisFilterTrait.groovy#L172 It tries only once and expects GitHub to perfectly respond each time. Later it gracefully handles GraphQL errors by treating the branch as if it has no YAML file and so is not buildable. I consider this the bug which is the reason webhooks randomly drop. Bug identified Here's the full range of the buggy code with it ending on "return true" (which means do not build or create a job in multibranch pipeline) https://github.com/jenkinsci/scm-filter-jervis-plugin/blob/f67873b6c47d8550a899f621aa06c78584ec4360/src/main/groovy/net/gleske/scmfilter/impl/trait/JervisFilterTrait.groovy#L172-L190 Proposed fix It should retry with a random delay. I'll open a PR for enhancement. I'll also update the component to the buggy plugin.

          Sam Gleske added a comment -

          Sam Gleske added a comment - https://github.com/jenkinsci/scm-filter-jervis-plugin/pull/20

            sag47 Sam Gleske
            sag47 Sam Gleske
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: