Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-56480

Threads in TIMED_WAITING causing build agents to hold on to completed tasks

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • openjdk version "1.8.0_191"
      OpenJDK Runtime Environment (build 1.8.0_191-b12)
      OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
      OS: Amazon Linux 2018.03
      Jenkins: 2.138
      Bitbucket branch source: 2.2.12, 2.4.0, 2.4.2

      A few times a day the executors in our build agents are filling up and not "releasing" completed tasks.  The job will finish and be marked as success or failure, but the executors are still listed as active in the node, causing other jobs to queue up waiting for a free slot.

       

      After tracing through this for a while I think the problem is being caused by the Bitbucket Branch Source plugin, this thought is also being backed up by the fact the none of the jobs that have their code in Github are displaying this behaviour.

       

      From looking at a thread dump when this happens I can see we will have 10 threads stuck in TIMED_WAITING "com.cloudbees.jenkins.plugins.bitbucket.hooks.PullRequestHookProcessor$1 Thu Mar 07 15:45:43 UTC 2019 / jenkins.util.Timer 1" Id=20 Group=main TIMED_WAITING" (I have attached a partial thread dump from when this happened yesterday (taken when we rolled back to plugin version 2.2.12)).

       

      I'm not sure what is causing this behaviour or how to resolve it, so any help would be greatly appreciated.  I have looked at the plugin source code referenced in the stacktrace in the thread dump and can see that it's happening because of an API rate limit, but our jobs shouldn't be going to Bitbucket more than two to thrree times per build (initial receipt of webook, checkout of code, and a possible push of a tag), definitely not every task will need bitbucket, but each individual task is holding on to it's executor.  The attached image Screenshot-stuck-executors.png shows a section of the executor status when this is happening (some of our older build jobs have node blocks every few commands, the newer jobs have one node block wrapping the whole job but they will still hang in the executor list).

       

      This has been happening to us for over a month or so now and I can't think what config change other than plugin version we would have made.  I have tried rolling back to a previous version of the plugin to see if that will fix our issue, but nothing has made a difference so far.

       

      I can workaround this by increasing the number of executors in the affected build agent to a really large number and waiting for this to resolve itself in a few hours.

          [JENKINS-56480] Threads in TIMED_WAITING causing build agents to hold on to completed tasks

          Although not Bitbucket (we are a pure GitHub shop), we observe a similar symptom: task do get through the end, but worker never releases them. 

          {{In the slave thread dump, we get }}

          "jenkins WAITING on On Slaves: java.util.concurrent.locks.AbstractQueuedSynchronizer"

          Not to pollute your issue, we have a separate one: JENKINS-56441 

          Olivier Jacques added a comment - Although not Bitbucket (we are a pure GitHub shop), we observe a similar symptom: task do get through the end, but worker never releases them.  {{In the slave thread dump, we get }} "jenkins WAITING on On Slaves: java.util.concurrent.locks.AbstractQueuedSynchronizer" Not to pollute your issue, we have a separate one:  JENKINS-56441  

          Nikolas Falco added a comment -

          Is this issue present in the latest plugin version?

          Nikolas Falco added a comment - Is this issue present in the latest plugin version?

          Olivier Dagenais added a comment - - edited

          I had fixed an issue like this via jenkinsci/bitbucket-branch-source-plugin#135: Restore timeouts in BitbucketServerAPIClient, which looks like it was released as early as version 2.2.13 (which would explain why it was seen in version 2.2.12), but I don't know if those changes were kept across other versions. Also, my testing was limited to Bitbucket Server/DC, as we weren't using Bitbucket Cloud. As my PR pointed out, the implementations had diverged on the HttpClient configurations.

          Looking at the source code for the most recent version, we see that lines 139-147 of src/main/java/com/cloudbees/jenkins/plugins/bitbucket/impl/client/AbstractBitbucketApi.java at 935.0.0 in jenkinsci/bitbucket-branch-source-plugin perform the same function as those I had added, also at a "choke point" that should configure all HttpClient instances appropriately:

                  int connectTimeout = Integer.getInteger("http.connect.timeout", 10);
                  int connectionRequestTimeout = Integer.getInteger("http.connect.request.timeout", 60);
                  int socketTimeout = Integer.getInteger("http.socket.timeout", 60);
          
                  RequestConfig config = RequestConfig.custom()
                          .setConnectTimeout(connectTimeout * 1000)
                          .setConnectionRequestTimeout(connectionRequestTimeout * 1000)
                          .setSocketTimeout(socketTimeout * 1000)
                          .build();
          

          I haven't performed a full audit of the source code to make sure that all HttpClient instances are created via the HttpClientBuilder instance created by setupClientBuilder(), but that would be a good place to start.

          Olivier Dagenais added a comment - - edited I had fixed an issue like this via jenkinsci/bitbucket-branch-source-plugin#135: Restore timeouts in BitbucketServerAPIClient , which looks like it was released as early as version 2.2.13 (which would explain why it was seen in version 2.2.12), but I don't know if those changes were kept across other versions. Also, my testing was limited to Bitbucket Server/DC, as we weren't using Bitbucket Cloud. As my PR pointed out, the implementations had diverged on the HttpClient configurations. Looking at the source code for the most recent version, we see that lines 139-147 of src/main/java/com/cloudbees/jenkins/plugins/bitbucket/impl/client/AbstractBitbucketApi.java at 935.0.0 in jenkinsci/bitbucket-branch-source-plugin perform the same function as those I had added, also at a "choke point" that should configure all HttpClient instances appropriately: int connectTimeout = Integer .getInteger( "http.connect.timeout" , 10); int connectionRequestTimeout = Integer .getInteger( "http.connect.request.timeout" , 60); int socketTimeout = Integer .getInteger( "http.socket.timeout" , 60); RequestConfig config = RequestConfig.custom() .setConnectTimeout(connectTimeout * 1000) .setConnectionRequestTimeout(connectionRequestTimeout * 1000) .setSocketTimeout(socketTimeout * 1000) .build(); I haven't performed a full audit of the source code to make sure that all HttpClient instances are created via the HttpClientBuilder instance created by setupClientBuilder() , but that would be a good place to start.

            nfalco Nikolas Falco
            bmcmath Bill McMath
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: