Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60527

Cloud Statistics Plugin(?) causing perpetually busy jenkins.util.Timer threads when cloud has problems

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None

      I use the OpenStack Cloud plugin to dynamically provision cloud VMs from a volume snapshot. As the OpenStack plugin relies on the Cloud Statistics plugin, that's installed too. Once in a while, the cloud I provision from is afflicted with some problem or other, causing provisioning to fail. When that happens, bizarre things start to occur on the Jenkins server.

      (Symptom #1) Any declarative Pipeline jobs running on master will begin to perpetually hog up their executor, even after the job has "ended" with a green tick. This happens even with a trivial Pipeline job written with declarative syntax:

      pipeline {
          agent { label 'master' }
          stages {
              stage('test') {
                  steps {
                      echo "foobar"
                  }
              }
          }
      }
      

      Upon closer inspection, it seems some part of Jenkins still thinks the job is running, because it shows up as such under "Build Executor Status" (see attachment: zombie-jobs.png). But the job cannot be stopped or killed. Deleting it (via /doDelete) changes the deleted job's name under "Build Executor Status" to "Unknown Pipeline node step", without however removing it from that view or freeing up the executor. The only way to free up the executor is to restart Jenkins.

      This symptom has also been reported in https://issues.jenkins-ci.org/browse/JENKINS-51568.

       

      (Symptom #2) Any non-declarative Pipeline job running on master will perpetually hang on calling the sleep() function made available by the Pipeline: Basic Steps plugin. In my case, the sleep() call is nested within a timeout() call, which doesn't time out the job under these circumstances. Thread-dumping the hung job yields:

      Thread #6
      	at DSL.sleep(java.util.concurrent.TimeoutException)
      	at WorkflowScript.create_clusters(WorkflowScript:134)
      	at WorkflowScript.run(WorkflowScript:42)
      	at DSL.withCredentials(java.util.concurrent.TimeoutException)
      	at WorkflowScript.run(WorkflowScript:42)
      	at DSL.timestamps(java.util.concurrent.TimeoutException)
      	at WorkflowScript.run(WorkflowScript:42)
      	at DSL.timeout(java.util.concurrent.TimeoutException)
      	at WorkflowScript.run(WorkflowScript:42)
      

      After taking a cue from one of the comments in https://issues.jenkins-ci.org/browse/JENKINS-51568, I noticed that when these symptoms occur, all 10 jenkins.util.Timer threads on my Jenkins server's thread dump page are permanently in a "RUNNABLE" state and display:

      	Number of locked synchronizers = 1
      	- java.util.concurrent.ThreadPoolExecutor$Worker@6794c8b7
      

      The stack trace for each thread makes reference to the Cloud Statistics plugin:

      	at org.jenkinsci.plugins.cloudstats.CloudStatistics.save(CloudStatistics.java:272)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics.persist(CloudStatistics.java:277)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics.attach(CloudStatistics.java:266)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics$ProvisioningListener.onFailure(CloudStatistics.java:477)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics$ProvisioningListener.lambda$onFailure$1(CloudStatistics.java:464)
      	at org.jenkinsci.plugins.cloudstats.CloudStatistics$ProvisioningListener$$Lambda$407/181161531.run(Unknown Source)
      

      When Jenkins is healthy, the threads are usually in a "WAITING" state and their stack traces don't make reference to any plugins. This makes me suspect the Cloud Statistics plugin is keeping the threads busy and preventing other parts of Jenkins that rely on them from operating normally.

      I most recently saw symptom #2 at around 06:45, 18 December, 2019 UTC. The last successful 2.5 minute sleep() in a non-declarative Pipeline job before that was at 06:11:59 on the same day. The attached org.jenkinsci.plugins.cloudstats.CloudStatistics.zip archive contains my $JENKINS_HOME/org.jenkinsci.plugins.cloudstats.CloudStatistics.xml from earlier today, when symptom #2 was being witnessed.**

        1. java.util.Thread-dumps.log
          56 kB
          Arnie Alpenbeach
        2. serialized-program-state-during-sleep-hang.xml
          320 kB
          Arnie Alpenbeach
        3. zombie-jobs.png
          26 kB
          Arnie Alpenbeach

            olivergondza Oliver Gondža
            nan Arnie Alpenbeach
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: