Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60527

Cloud Statistics Plugin(?) causing perpetually busy jenkins.util.Timer threads when cloud has problems

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None

      I use the OpenStack Cloud plugin to dynamically provision cloud VMs from a volume snapshot. As the OpenStack plugin relies on the Cloud Statistics plugin, that's installed too. Once in a while, the cloud I provision from is afflicted with some problem or other, causing provisioning to fail. When that happens, bizarre things start to occur on the Jenkins server.

      (Symptom #1) Any declarative Pipeline jobs running on master will begin to perpetually hog up their executor, even after the job has "ended" with a green tick. This happens even with a trivial Pipeline job written with declarative syntax:

      pipeline {
          agent { label 'master' }
          stages {
              stage('test') {
                  steps {
                      echo "foobar"
                  }
              }
          }
      }
      

      Upon closer inspection, it seems some part of Jenkins still thinks the job is running, because it shows up as such under "Build Executor Status" (see attachment: zombie-jobs.png). But the job cannot be stopped or killed. Deleting it (via /doDelete) changes the deleted job's name under "Build Executor Status" to "Unknown Pipeline node step", without however removing it from that view or freeing up the executor. The only way to free up the executor is to restart Jenkins.

      This symptom has also been reported in https://issues.jenkins-ci.org/browse/JENKINS-51568.

       

      (Symptom #2) Any non-declarative Pipeline job running on master will perpetually hang on calling the sleep() function made available by the Pipeline: Basic Steps plugin. In my case, the sleep() call is nested within a timeout() call, which doesn't time out the job under these circumstances. Thread-dumping the hung job yields:

      Thread #6
      	at DSL.sleep(java.util.concurrent.TimeoutException)
      	at WorkflowScript.create_clusters(WorkflowScript:134)
      	at WorkflowScript.run(WorkflowScript:42)
      	at DSL.withCredentials(java.util.concurrent.TimeoutException)
      	at WorkflowScript.run(WorkflowScript:42)
      	at DSL.timestamps(java.util.concurrent.TimeoutException)
      	at WorkflowScript.run(WorkflowScript:42)
      	at DSL.timeout(java.util.concurrent.TimeoutException)
      	at WorkflowScript.run(WorkflowScript:42)
      

      After taking a cue from one of the comments in https://issues.jenkins-ci.org/browse/JENKINS-51568, I noticed that when these symptoms occur, all 10 jenkins.util.Timer threads on my Jenkins server's thread dump page are permanently in a "RUNNABLE" state and display:

      	Number of locked synchronizers = 1
      	- java.util.concurrent.ThreadPoolExecutor$Worker@6794c8b7
      

      The stack trace for each thread makes reference to the Cloud Statistics plugin:

      	at org.jenkinsci.plugins.cloudstats.CloudStatistics.save(CloudStatistics.java:272)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics.persist(CloudStatistics.java:277)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics.attach(CloudStatistics.java:266)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics$ProvisioningListener.onFailure(CloudStatistics.java:477)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics$ProvisioningListener.lambda$onFailure$1(CloudStatistics.java:464)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics$ProvisioningListener$$Lambda$407/181161531.run(Unknown Source)
      

      When Jenkins is healthy, the threads are usually in a "WAITING" state and their stack traces don't make reference to any plugins. This makes me suspect the Cloud Statistics plugin is keeping the threads busy and preventing other parts of Jenkins that rely on them from operating normally.

      I most recently saw symptom #2 at around 06:45, 18 December, 2019 UTC. The last successful 2.5 minute sleep() in a non-declarative Pipeline job before that was at 06:11:59 on the same day. The attached org.jenkinsci.plugins.cloudstats.CloudStatistics.zip archive contains my $JENKINS_HOME/org.jenkinsci.plugins.cloudstats.CloudStatistics.xml from earlier today, when symptom #2 was being witnessed.**

          [JENKINS-60527] Cloud Statistics Plugin(?) causing perpetually busy jenkins.util.Timer threads when cloud has problems

          Additional information:

          Unlike the sleep() function made available by the Pipeline: Basic Steps plugin, java.lang.Thread.sleep() does not hang when the jenkins.util.Timer threads are jammed. The former looks to be implemented using jenkins.util.Timer: https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/steps/SleepStep.java. I suspect that's why.

          Arnie Alpenbeach added a comment - Additional information: Unlike the sleep() function made available by the Pipeline: Basic Steps plugin, java.lang.Thread.sleep() does not hang when the jenkins.util.Timer threads are jammed. The former looks to be implemented using jenkins.util.Timer: https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/steps/SleepStep.java . I suspect that's why.

          There are several problems reported here at once.

          • Those zombie jobs are actually legitimate run spending their time in POST_PRODUCTION state. Its visualization is a bit strange, as you have described, but this is normal unless jobs occupy too much time in this state. As they do now.
          • You seem to be running into a known issue of cloud-stats bloating with dangling activities in PROVISIONING state. Presumably, left behind when Jenkins master restarts. The xml file that is being serialized in all those 10 threads is 70MB which scales poorly.
          • I cannot comment on the pipeline sleep thing.

          So, to hotfix this, removing the org.jenkinsci.plugins.cloudstats.CloudStatistics.xml (which would throw away the statistics) should reduce the time builds spend in POST_PRODUCTION.

          Oliver Gondža added a comment - There are several problems reported here at once. Those zombie jobs are actually legitimate run spending their time in POST_PRODUCTION state. Its visualization is a bit strange, as you have described, but this is normal unless jobs occupy too much time in this state. As they do now. You seem to be running into a known issue of cloud-stats bloating with dangling activities in PROVISIONING state. Presumably, left behind when Jenkins master restarts. The xml file that is being serialized in all those 10 threads is 70MB which scales poorly. I cannot comment on the pipeline sleep thing. So, to hotfix this, removing the  org.jenkinsci.plugins.cloudstats.CloudStatistics.xml (which would throw away the statistics) should reduce the time builds spend in POST_PRODUCTION .

          Thanks for your reply. So, if I'm understanding this correctly, those timer threads are getting stuck serializing an XML file that's grown larger than it should due to a known issue in the cloud-stats plugin? If so, do you have any idea why this would manifest only when cloud provisioning starts failing? I don't see those threads getting stuck like that when cloud provisioning is working, even though the XML file is still there and as large as ever when that's the case.

          Additional information:

          In my case, the jenkins.util.Timer threads did eventually become available again several hours after cloud-stats locked them, even without deleting org.jenkinsci.plugins.cloudstats.CloudStatistics.xml. Any Pipeline job functionality that makes use of jenkins.util.Timer – including the sleep() function provided by Pipeline: Basic Steps – then started working again, too.

          Arnie Alpenbeach added a comment - Thanks for your reply. So, if I'm understanding this correctly, those timer threads are getting stuck serializing an XML file that's grown larger than it should due to a known issue in the cloud-stats plugin? If so, do you have any idea why this would manifest only when cloud provisioning starts failing? I don't see those threads getting stuck like that when cloud provisioning is working, even though the XML file is still there and as large as ever when that's the case. Additional information: In my case, the jenkins.util.Timer threads did eventually become available again several hours after cloud-stats locked them, even without deleting org.jenkinsci.plugins.cloudstats.CloudStatistics.xml. Any Pipeline job functionality that makes use of jenkins.util.Timer – including the sleep() function provided by Pipeline: Basic Steps – then started working again, too.

          Good point, it appears the Timer thread pool (that is shared by numerous components including cloud-stats and Basic Steps) has the maximal number of threads set to 10. Which is exactly the number of threads busy writing the XML in your threaddump. I speculate this can be the cause of your sleep problem.

          Oliver Gondža added a comment - Good point, it appears the Timer thread pool (that is shared by numerous components including cloud-stats and Basic Steps) has the maximal number of threads set to 10. Which is exactly the number of threads busy writing the XML in your threaddump. I speculate this can be the cause of your sleep problem.

            olivergondza Oliver Gondža
            nan Arnie Alpenbeach
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: