Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65262

High lock contention in Queue causes builds to not trigger

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Labels:
      None
    • Environment:
    • Similar Issues:
    • Released As:
      github-branch-source-2.10.3

      Description

      We upgraded to 2.283 of Jenkins from ~2.279 3 weeks ago.

      Since the last 1-2 weeks we've noticed builds often not being triggered when they should be, if you click 'Build now' then it flashes for a second but no job starts.

      The job does start eventually I believe but minutes later.

      Attached results from 'collectPerformanceData.sh' in https://support.cloudbees.com/hc/en-us/articles/229795948-Required-Data-CJP-CJT-Hang-Issue-On-Linux

      Note most of the tools the script wants isn't installed as its running in a docker container.
      I ran thread dumps every 5 seconds for 3 minutes

      you can see mostly of the relevant info with

      tar -xzf performanceData.7.output.tar.gz
      cd threads
      grep -A 10 -B 10 hudson.model.Queue.withLock *
      

      I'm not sure exactly what I'm looking for but here's an excerpt

      threads.7.20210330084345.txt-"Executor #-1 for master" #334602 daemon prio=5 os_prio=0 cpu=0.09ms elapsed=13.88s tid=0x00007f9e04604000 nid=0x5713 waiting on condition  [0x00007f9d4717f000]
      threads.7.20210330084345.txt-   java.lang.Thread.State: WAITING (parking)
      threads.7.20210330084345.txt-	at jdk.internal.misc.Unsafe.park(java.base@11.0.10/Native Method)
      threads.7.20210330084345.txt-	- parking to wait for  <0x000000046cdf2d10> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      threads.7.20210330084345.txt-	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.10/LockSupport.java:194)
      threads.7.20210330084345.txt-	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.10/AbstractQueuedSynchronizer.java:885)
      threads.7.20210330084345.txt-	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.10/AbstractQueuedSynchronizer.java:917)
      threads.7.20210330084345.txt-	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.10/AbstractQueuedSynchronizer.java:1240)
      threads.7.20210330084345.txt-	at java.util.concurrent.locks.ReentrantLock.lock(java.base@11.0.10/ReentrantLock.java:267)
      threads.7.20210330084345.txt-	at hudson.model.Queue._withLock(Queue.java:1444)
      threads.7.20210330084345.txt:	at hudson.model.Queue.withLock(Queue.java:1304)
      threads.7.20210330084345.txt-	at hudson.model.Executor.run(Executor.java:347)
      threads.7.20210330084345.txt-
      threads.7.20210330084345.txt-   Locked ownable synchronizers:
      threads.7.20210330084345.txt-	- None
      threads.7.20210330084345.txt-
      threads.7.20210330084345.txt-"org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#19471]" #334603 daemon prio=5 os_prio=0 cpu=0.24ms elapsed=13.45s tid=0x00007f9ebc658000 nid=0x5714 waiting on condition  [0x00007f9ded3da000]
      threads.7.20210330084345.txt-   java.lang.Thread.State: TIMED_WAITING (parking)
      threads.7.20210330084345.txt-	at jdk.internal.misc.Unsafe.park(java.base@11.0.10/Native Method)
      threads.7.20210330084345.txt-	- parking to wait for  <0x0000000477c00dc8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
      threads.7.20210330084345.txt-	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.10/LockSupport.java:234)
      

        Attachments

          Activity

          Hide
          timja Tim Jacomb added a comment -

          May be caused / impacted by JENKINS-64931 cc Olivier Lamy

          Show
          timja Tim Jacomb added a comment - May be caused / impacted by JENKINS-64931 cc Olivier Lamy
          Hide
          timja Tim Jacomb added a comment -

          Adding in checks-api and github-branch-source as I can jstack.review has found a deadlock:

          deadlock can be found in threads.7.20210330083934.txt

          excerpt:

          "org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution [#2615]" #334426 daemon prio=5 os_prio=0 cpu=3.28ms elapsed=83.05s tid=0x00007f9ea4596000 nid=0x5085 waiting for monitor entry  [0x00007f9d4bfca000]
             java.lang.Thread.State: BLOCKED (on object monitor)
          	at org.jenkinsci.plugins.github_branch_source.Connector.connect(Connector.java:345)
          	- waiting to lock <0x00000004b6b2dbe0> (a java.util.HashMap)
          	at org.jenkinsci.plugins.github_branch_source.GitHubSCMSource.retrieve(GitHubSCMSource.java:1582)
          	at jenkins.scm.api.SCMSource.fetch(SCMSource.java:582)
          	at io.jenkins.plugins.checks.github.SCMFacade.findRevision(SCMFacade.java:156)
          	at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.resolveHeadSha(GitHubSCMSourceChecksContext.java:131)
          	at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.<init>(GitHubSCMSourceChecksContext.java:46)
          	at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.fromRun(GitHubSCMSourceChecksContext.java:24)
          	at io.jenkins.plugins.checks.github.GitHubChecksPublisherFactory.createPublisher(GitHubChecksPublisherFactory.java:42)
          	at io.jenkins.plugins.checks.api.ChecksPublisherFactory.lambda$fromRun$0(ChecksPublisherFactory.java:89)
          

          10 threads are waiting on api rate limit checker to return:

          https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/ApiRateLimitChecker.java#L287

          Show
          timja Tim Jacomb added a comment - Adding in checks-api and github-branch-source as I can jstack.review has found a deadlock: deadlock can be found in threads.7.20210330083934.txt excerpt: "org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution [#2615]" #334426 daemon prio=5 os_prio=0 cpu=3.28ms elapsed=83.05s tid=0x00007f9ea4596000 nid=0x5085 waiting for monitor entry [0x00007f9d4bfca000] java.lang. Thread .State: BLOCKED (on object monitor) at org.jenkinsci.plugins.github_branch_source.Connector.connect(Connector.java:345) - waiting to lock <0x00000004b6b2dbe0> (a java.util.HashMap) at org.jenkinsci.plugins.github_branch_source.GitHubSCMSource.retrieve(GitHubSCMSource.java:1582) at jenkins.scm.api.SCMSource.fetch(SCMSource.java:582) at io.jenkins.plugins.checks.github.SCMFacade.findRevision(SCMFacade.java:156) at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.resolveHeadSha(GitHubSCMSourceChecksContext.java:131) at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.<init>(GitHubSCMSourceChecksContext.java:46) at io.jenkins.plugins.checks.github.GitHubSCMSourceChecksContext.fromRun(GitHubSCMSourceChecksContext.java:24) at io.jenkins.plugins.checks.github.GitHubChecksPublisherFactory.createPublisher(GitHubChecksPublisherFactory.java:42) at io.jenkins.plugins.checks.api.ChecksPublisherFactory.lambda$fromRun$0(ChecksPublisherFactory.java:89) 10 threads are waiting on api rate limit checker to return: https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/ApiRateLimitChecker.java#L287
          Hide
          bitwiseman Liam Newman added a comment -

          Tim Jacomb
          That is where they are supposed to wait when there is rate limiting. If you look at your Jenkins logs you will probably see rate limiting happening.
          I think the behavior you're describing makes sense if limiting is occurring: you make the request to start a run, Jenkins tries to fetch the revision, and ends up waiting due to rate limiting.

          However, it seems odd that this would only start happening now. ... Hm, wait, I think I see a potential source of blocking. I added a potential call to verify the connection during the connection lookup.

          https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L621

          If that has to wait for rate limiting, then all attempts to even look up a connection will block. I've created https://github.com/jenkinsci/github-branch-source-plugin/pull/405 to test removing that.

          Show
          bitwiseman Liam Newman added a comment - Tim Jacomb That is where they are supposed to wait when there is rate limiting. If you look at your Jenkins logs you will probably see rate limiting happening. I think the behavior you're describing makes sense if limiting is occurring: you make the request to start a run, Jenkins tries to fetch the revision, and ends up waiting due to rate limiting. However, it seems odd that this would only start happening now. ... Hm, wait, I think I see a potential source of blocking. I added a potential call to verify the connection during the connection lookup. https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L621 If that has to wait for rate limiting, then all attempts to even look up a connection will block. I've created https://github.com/jenkinsci/github-branch-source-plugin/pull/405 to test removing that.
          Hide
          timja Tim Jacomb added a comment -

          The queue lock is being held there though,
          There’s over a minute where the queue is completely locked and no new builds can be scheduled.

          I would be a bit surprised if rate limiting was happening as we are on GitHub app but I’ll try grep the logs tomorrow

          Show
          timja Tim Jacomb added a comment - The queue lock is being held there though, There’s over a minute where the queue is completely locked and no new builds can be scheduled. I would be a bit surprised if rate limiting was happening as we are on GitHub app but I’ll try grep the logs tomorrow
          Hide
          timja Tim Jacomb added a comment -

          Looking at the 3 locks that are being held by the RateLimited thread.

          1. 0x0000000478e20b38 ThreadPoolExecutor$Worker
          https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubBuildStatusNotification.java#L210

          Seems fine, the reason provided is to avoid holding the Queue lock.

          2. 0x00000004b6b2dbe0
          HashMap
          https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L345
          This is locking on any read or write to the connection map (Map<ConnectionId, GitHubConnection> connections).

          3. 0x00000004c2fdfd10
          Connector$GitHubConnection
          https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L673
          Appears to be an unneeded call to verifyConnection that when reading from the 'Map<ConnectionId, GitHubConnection> connections' here: https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L346

          Summary:

          2. This lock appears to be overly broad, why do we need a read/write lock, can't we use either: a write lock only or a lockless approach with a ConcurrentHashMap and the possibility of more API requests to authenticate than are needed?
          3. Is addressed by Liam Newman in https://github.com/jenkinsci/github-branch-source-plugin/pull/405

          Show
          timja Tim Jacomb added a comment - Looking at the 3 locks that are being held by the RateLimited thread. 1. 0x0000000478e20b38 ThreadPoolExecutor$Worker https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubBuildStatusNotification.java#L210 Seems fine, the reason provided is to avoid holding the Queue lock. 2. 0x00000004b6b2dbe0 HashMap https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L345 This is locking on any read or write to the connection map (Map<ConnectionId, GitHubConnection> connections). 3. 0x00000004c2fdfd10 Connector$GitHubConnection https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L673 Appears to be an unneeded call to verifyConnection that when reading from the 'Map<ConnectionId, GitHubConnection> connections' here: https://github.com/jenkinsci/github-branch-source-plugin/blob/github-branch-source-2.10.2/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L346 Summary: 2. This lock appears to be overly broad, why do we need a read/write lock, can't we use either: a write lock only or a lockless approach with a ConcurrentHashMap and the possibility of more API requests to authenticate than are needed? 3. Is addressed by Liam Newman in https://github.com/jenkinsci/github-branch-source-plugin/pull/405
          Hide
          bitwiseman Liam Newman added a comment -

          Timja put created https://github.com/jenkinsci/github-branch-source-plugin/pull/406 to address the issues he references above. We're collaborating on a full fix.

          I believe these changes will address this issue, but even if they don't, they are worth merging and releasing as they remove a potential source of deadlocks.

          Show
          bitwiseman Liam Newman added a comment - Timja put created https://github.com/jenkinsci/github-branch-source-plugin/pull/406 to address the issues he references above. We're collaborating on a full fix. I believe these changes will address this issue, but even if they don't, they are worth merging and releasing as they remove a potential source of deadlocks.
          Hide
          ricardojdsilva87 Ricardo added a comment -

          Hello all,

          Thank you for your contributions on this issue.

          We started also reaching the same issue with the apiratecheckerlimit. It seems that the jobs on the queue stay locked up on the master node and everything gets stuck.

          We are triggering the new jobs on separate pods on kubernetes using the kubernetes plugin so the master node is used basically only for scheduling.

          We have also started to configure the GHE app on the GHE organisation pipelines that we have and reconfigured the API checker to "Throttle at near rate limit" option.

          I've seen that the PR is already merged is there any release date sceduled already for the new version of the github-branch-source-plugin above 2.10.2 with this fix?

          Thanks again

          Show
          ricardojdsilva87 Ricardo added a comment - Hello all, Thank you for your contributions on this issue. We started also reaching the same issue with the apiratecheckerlimit. It seems that the jobs on the queue stay locked up on the master node and everything gets stuck. We are triggering the new jobs on separate pods on kubernetes using the kubernetes plugin so the master node is used basically only for scheduling. We have also started to configure the GHE app on the GHE organisation pipelines that we have and reconfigured the API checker to "Throttle at near rate limit" option. I've seen that the PR is already merged is there any release date sceduled already for the new version of the  github-branch-source-plugin  above 2.10.2 with this fix? Thanks again

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            timja Tim Jacomb
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: