Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-54106

Long delay from github webhook to polling when polling threads all busy

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • jenkins 2.107.3
      git-plugin 3.9.0
      github-plugin 1.29.0
      github API 1.90

      I am seeing some long delays between github webhook events and jobs polling for changes (e.g. from GitHub Hook Log). Note the almost 17 1/2 hour gap between the event being received and the polling being performed.

       

      Started on Oct 16, 2018 2:46:44 PM
      Started by event from 43.148.32.90 ? https://<jenkins>/github-webhook/ on Mon Oct 15 21:19:54 BST 2018
      [poll] Last Built Revision: Revision c7013e0bb447b77bf13e719201ce2acb44b073af (refs/remotes/origin/<branch>)
      {{ > git --version # timeout=30}}
      using GIT_SSH to set credentials <creds>
      {{ > git ls-remote -h <repo> # timeout=30}}
      Found 345 remote heads on <git_url>
      [poll] Latest remote head revision on refs/heads/<branch> is: c7013e0bb447b77bf13e719201ce2acb44b073af - already built by 34835
      Done. Took 3 sec
      No changes

       

      I have checked that github is sending the webhook notifications and these get a http 200 response code.

      The Jenkins Log reports the PushEvents are being received and that my build job is being "Poked" but polling is not being run for the job.

       

      Oct 16, 2018 3:29:50 PM FINEST org.jenkinsci.plugins.github.webhook.GHEventPayload$PayloadHandler parse
      {{Payload }}

      ...
      Oct 16, 2018 3:29:50 PM INFO org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber onEvent
      Received PushEvent for https://<github>/<user>/<repo> from <ip> ⇒ https://<jenkins>/github-webhook/
      Oct 16, 2018 3:29:51 PM FINE org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber$1 run
      Considering to poke my_build
      Oct 16, 2018 3:29:51 PM INFO org.jenkinsci.plugins.github.webhook.subscriber.DefaultPushGHEventSubscriber$1 run
      Poked my_build

       

      The agent that my_build runs on is not permanently busy.

      Any idea what is going on here? How can I debug this further?

      We have a lot of jobs (~1000) and possibly a lot (10s) of jobs polling SCMs but I can't imagine this would take 17 1/2 hours...

          [JENKINS-54106] Long delay from github webhook to polling when polling threads all busy

          Russell Gallop added a comment - - edited

          It looks like all 10 of the SCM polling threads were stuck processing another job. (e.g.)

          SCMTrigger 3

          "SCMTrigger 3" Id=334 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7088b71d at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7088b71d at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

          Waiting to acquire C:\j\w\JobName : GitHubPushTrigger 3

          "Waiting to acquire C:\j\w\JobName : GitHubPushTrigger 3" Id=857002 Group=main WAITING on hudson.slaves.WorkspaceList@744f269b at java.lang.Object.wait(Native Method) - waiting on hudson.slaves.WorkspaceList@744f269b at java.lang.Object.wait(Object.java:502) at hudson.slaves.WorkspaceList.acquire(WorkspaceList.java:257) at hudson.slaves.WorkspaceList.acquire(WorkspaceList.java:236) at hudson.model.AbstractProject.pollWithWorkspace(AbstractProject.java:1405) at hudson.model.AbstractProject._poll(AbstractProject.java:1382) at hudson.model.AbstractProject.poll(AbstractProject.java:1293) at jenkins.triggers.SCMTriggerItem$SCMTriggerItems$Bridge.poll(SCMTriggerItem.java:143) at com.cloudbees.jenkins.GitHubPushTrigger$1.runPolling(GitHubPushTrigger.java:109) at com.cloudbees.jenkins.GitHubPushTrigger$1.run(GitHubPushTrigger.java:135) at hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:119) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@4045db32

          Russell Gallop added a comment - - edited It looks like all 10 of the SCM polling threads were stuck processing another job. (e.g.) SCMTrigger 3 "SCMTrigger 3 " Id=334 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7088b71d at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7088b71d at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Waiting to acquire C:\j\w\JobName : GitHubPushTrigger 3 "Waiting to acquire C:\j\w\JobName : GitHubPushTrigger 3 " Id=857002 Group=main WAITING on hudson.slaves.WorkspaceList@744f269b at java.lang.Object.wait(Native Method) - waiting on hudson.slaves.WorkspaceList@744f269b at java.lang.Object.wait(Object.java:502) at hudson.slaves.WorkspaceList.acquire(WorkspaceList.java:257) at hudson.slaves.WorkspaceList.acquire(WorkspaceList.java:236) at hudson.model.AbstractProject.pollWithWorkspace(AbstractProject.java:1405) at hudson.model.AbstractProject._poll(AbstractProject.java:1382) at hudson.model.AbstractProject.poll(AbstractProject.java:1293) at jenkins.triggers.SCMTriggerItem$SCMTriggerItems$Bridge.poll(SCMTriggerItem.java:143) at com.cloudbees.jenkins.GitHubPushTrigger$1.runPolling(GitHubPushTrigger.java:109) at com.cloudbees.jenkins.GitHubPushTrigger$1.run(GitHubPushTrigger.java:135) at hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:119) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@4045db32

          Mark Waite added a comment -

          Any way of duplicating the problem that caused the polling threads to be stuck?

          Mark Waite added a comment - Any way of duplicating the problem that caused the polling threads to be stuck?

          I'm not sure whether it is easy to replicate but I'm seeing it again and will try to describe the set up as closely as possible.

           

          This follows a migration of a repository from Bitbucket (using git/notifyCommit) to Github (with webhook notifications). The jobs that appear to be causing the problem are older Matrix jobs which use multi-scm with git and p4-plugin.

          That job has triggered several times overnight despite there not being any relevant changes. A polling log from overnight on that job looks like:

           

          Started on Oct 17, 2018 8:33:15 AM
          Started by event from 43.148.32.90 ? https://<jenkins>/github-webhook/ on Wed Oct 17 03:09:33 BST 2018
          Polling SCM changes on <agent>
          Using strategy: Default
          [poll] Last Built Revision: Revision acc6ad05f1a8e98bc1cebec53ffdf695fad7fca2 (origin/<branch>)
          {{ > git rev-parse --is-inside-work-tree # timeout=30}}
          Fetching changes from the remote Git repositories
          {{ > git config remote.origin.url <repo> # timeout=30}}
          Fetching upstream changes from <repo>
          {{ > git --version # timeout=30}}
          using GIT_SSH to set credentials <creds>
          {{ > git fetch --tags --progress git@github.sie.sony.com:SIE-Private/cpu-toolchain-orbis.git +refs/heads/:refs/remotes/origin/ # timeout=180}}
          Polling for changes in
          {{ > git rev-parse "origin/<branch>^{commit}" # timeout=30}}
          {{ > git rev-parse "<branch>^{commit}" # timeout=30}}
          P4: Polling on: <agent> with:<workspace name>
          Done. Took 1 hr 17 min
          Changes found

           

          This job requires "Polling ignores commits in certain paths" so will have to use a local checkout. This is a large repository so takes a while to checkout (~15 minutes).

          All 10 SCM polling threads appear to be busy on this job: Waiting to acquire C:\j\w\<job name> : GitHubPushTrigger 10. It is not surprising that this is backing up as github push notifications can happen about on average about every 20 minutes and the poll operation took 1h17.

          This raises the following questions:

          1. Why is it taking so long?
          2. Why is it triggering a job when there are no relevant changes?

          We had a build of that job running from 8:29am and a new one at 9:51am so it seems like the polling job starting at 8:33am may have been waiting for the build to finish. The matrix parent and one of the configurations ran on the same agent as the polling agent. That might explain 1.

          So it seems to me like Jenkins takes git pushes in, allocates an SCM polling thread to poll them, which finds the best agent to poll with, which picks the last agent it polled with, which is currently busy running a build. More SCM notifications come in and repeat the pattern and things back up, all waiting for the build to finish with the agent and the git checkout that is is using. I don't know what the queueing policy is like between SCM triggers and builds but it seems like if you have triggers coming in quicker than your builds then this will back up even if there are no relevant changes.

          I've noticed that I'm not using the "refs/heads/<branch>" form of branch in the configuration so I'll try that to avoid 2.

          I also haven't removed the old repository checkouts since migrating so I'll try that as well (it is the same git data and I haven't seen this being a problem elsewhere but might be best to be sure).

          I might be able to remove the git path filter which might help as well (not ideal but better than all triggers being delayed by hours!).

           

          Thanks

          Russell Gallop added a comment - I'm not sure whether it is easy to replicate but I'm seeing it again and will try to describe the set up as closely as possible.   This follows a migration of a repository from Bitbucket (using git/notifyCommit) to Github (with webhook notifications). The jobs that appear to be causing the problem are older Matrix jobs which use multi-scm with git and p4-plugin. That job has triggered several times overnight despite there not being any relevant changes. A polling log from overnight on that job looks like:   Started on Oct 17, 2018 8:33:15 AM Started by event from 43.148.32.90 ? https://<jenkins>/github-webhook/ on Wed Oct 17 03:09:33 BST 2018 Polling SCM changes on <agent> Using strategy: Default [poll] Last Built Revision: Revision acc6ad05f1a8e98bc1cebec53ffdf695fad7fca2 (origin/<branch>) {{ > git rev-parse --is-inside-work-tree # timeout=30}} Fetching changes from the remote Git repositories {{ > git config remote.origin.url <repo> # timeout=30}} Fetching upstream changes from <repo> {{ > git --version # timeout=30}} using GIT_SSH to set credentials <creds> {{ > git fetch --tags --progress git@github.sie.sony.com:SIE-Private/cpu-toolchain-orbis.git +refs/heads/ :refs/remotes/origin/ # timeout=180}} Polling for changes in {{ > git rev-parse "origin/<branch>^{commit}" # timeout=30}} {{ > git rev-parse "<branch>^{commit}" # timeout=30}} P4: Polling on: <agent> with:<workspace name> Done. Took 1 hr 17 min Changes found   This job requires "Polling ignores commits in certain paths" so will have to use a local checkout. This is a large repository so takes a while to checkout (~15 minutes). All 10 SCM polling threads appear to be busy on this job: Waiting to acquire C:\j\w\<job name> : GitHubPushTrigger  10 . It is not surprising that this is backing up as github push notifications can happen about on average about every 20 minutes and the poll operation took 1h17. This raises the following questions: Why is it taking so long? Why is it triggering a job when there are no relevant changes? We had a build of that job running from 8:29am and a new one at 9:51am so it seems like the polling job starting at 8:33am may have been waiting for the build to finish. The matrix parent and one of the configurations ran on the same agent as the polling agent. That might explain 1. So it seems to me like Jenkins takes git pushes in, allocates an SCM polling thread to poll them, which finds the best agent to poll with, which picks the last agent it polled with, which is currently busy running a build. More SCM notifications come in and repeat the pattern and things back up, all waiting for the build to finish with the agent and the git checkout that is is using. I don't know what the queueing policy is like between SCM triggers and builds but it seems like if you have triggers coming in quicker than your builds then this will back up  even if there are no relevant changes . I've noticed that I'm not using the "refs/heads/<branch>" form of branch in the configuration so I'll try that to avoid 2. I also haven't removed the old repository checkouts since migrating so I'll try that as well (it is the same git data and I haven't seen this being a problem elsewhere but might be best to be sure). I might be able to remove the git path filter which might help as well (not ideal but better than all triggers being delayed by hours!).   Thanks

          A workaround is to identify the offending job by looking at the thread dump and disable that job. After that and killing the erroneously triggered builds the queue of SCM triggers gets through.

          Russell Gallop added a comment - A workaround is to identify the offending job by looking at the thread dump and disable that job. After that and killing the erroneously triggered builds the queue of SCM triggers gets through.

          Using "refs/heads/<branch>" and removing the path filter doesn't prevent the erroneous triggering, but it does seem to be quicker (possibly as no builds are going on).

           

          Started on Oct 17, 2018 11:34:11 AM
          Started by event from <ip> ? https://<jenkins>/github-webhook/ on Wed Oct 17 11:34:00 BST 2018
          Using strategy: Default
          [poll] Last Built Revision: Revision acc6ad05f1a8e98bc1cebec53ffdf695fad7fca2 (origin/<branch>)
          {{ > git --version # timeout=30}}
          using GIT_SSH to set credentials <creds>
          {{ > git ls-remote -h <repo> # timeout=30}}
          Found 347 remote heads on <repo>
          [poll] Latest remote head revision on refs/heads/<branch> is: acc6ad05f1a8e98bc1cebec53ffdf695fad7fca2 - already built by 186
          P4: Polling on: master with:<workspace>
          Done. Took 3.7 sec
          Changes found

           

           

          Russell Gallop added a comment - Using "refs/heads/<branch>" and removing the path filter doesn't prevent the erroneous triggering, but it does seem to be quicker (possibly as no builds are going on).   Started on Oct 17, 2018 11:34:11 AM Started by event from <ip> ? https://<jenkins>/github-webhook/ on Wed Oct 17 11:34:00 BST 2018 Using strategy: Default [poll] Last Built Revision: Revision acc6ad05f1a8e98bc1cebec53ffdf695fad7fca2 (origin/<branch>) {{ > git --version # timeout=30}} using GIT_SSH to set credentials <creds> {{ > git ls-remote -h <repo> # timeout=30}} Found 347 remote heads on <repo> [poll] Latest remote head revision on refs/heads/<branch> is: acc6ad05f1a8e98bc1cebec53ffdf695fad7fca2 - already built by 186 P4: Polling on: master with:<workspace> Done. Took 3.7 sec Changes found    

          I have worked around this by adding a separate job which just does the polling and then triggers the Matrix+Multi-SCM job. That seems to avoid the extraneous triggering.

          SCM Polling threads are a resource where the user is left to find the "correct" configuration. As such it would be useful if metrics were made available for things like:

          • How many SCM polling threads are busy
          • How long SCM poll tasks have had to wait
          • How long SCM polls tasks are taking

          That would help enormously in debugging problems such as this.

          Russell Gallop added a comment - I have worked around this by adding a separate job which just does the polling and then triggers the Matrix+Multi-SCM job. That seems to avoid the extraneous triggering. SCM Polling threads are a resource where the user is left to find the "correct" configuration. As such it would be useful if metrics were made available for things like: How many SCM polling threads are busy How long SCM poll tasks have had to wait How long SCM polls tasks are taking That would help enormously in debugging problems such as this.

            Unassigned Unassigned
            rg Russell Gallop
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: