• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • cloud-stats-plugin
    • None
    • Operating System: RHEL 7.4 Linux 64 bit, Kernel 4.4.114-94.11-default
      JDK: OpenJDK 1.8.0_161-b14
      Jenkins version: 2.150.2
      cloud-stats-plugin version: 0.22
    • 2.23

      Hey,

      the cloud-stats-plugin keeps track of every provisioning activity in order to create a statistic that is accessible via <Jenkins-Base-URL>/cloud-stats/. The underlying data is persistet in <Jenkins-Home>/org.jenkinsci.plugins.cloudstats.CloudStatistics.xml. However, there seems to be no rotation mechanism, causing that file to grow continually.

      In our case that file reached 150 MB which caused various issues:

      1. Provisioning time took up to 15 minutes, depending on the number of currently requested agents to provision. There are about 50 short-lived (per build) agents connected at any time.
      2. CPU usage was going through the roof. Jenkins (the master only) is running on a 24 core machine with 128 Gig of RAM and used about 800%-2000% of all CPUs.

      I assume that most of that CPU-time is spent on XML serialization as well as sychronization operations.

      Workaround: Manually delete the persistence file and restart Jenkins. After pruning that persistence file, our CPU-Usage went down to 20%-200%. Provisioning times went down to about 10 seconds, regardless of load.

          [JENKINS-56863] Cloud stats persistence file grows continually

          To add on this: There is actually a default limit of 100 entries to be logged. This is indeed true and corresponds with the persistence file:

          xmlstarlet sel -t -v 'count(//org.jenkinsci.plugins.cloudstats.CloudStatistics/log/data/org.jenkinsci.plugins.cloudstats.ProvisioningActivity)' org.jenkinsci.plugins.cloudstats.CloudStatistics.xml.bak
          100
          

          However, there are lots of active provisionings in that 150MB file:

          xmlstarlet sel -t -v 'count(//org.jenkinsci.plugins.cloudstats.CloudStatistics/active/org.jenkinsci.plugins.cloudstats.ProvisioningActivity)' org.jenkinsci.plugins.cloudstats.CloudStatistics.xml.bak
          52680
          

          24h after purging the persistence file, the number of active provisionings already reached 3225.

          Dennis Keitzel added a comment - To add on this: There is actually a default limit of 100 entries to be logged. This is indeed true and corresponds with the persistence file: xmlstarlet sel -t -v 'count( //org.jenkinsci.plugins.cloudstats.CloudStatistics/log/data/org.jenkinsci.plugins.cloudstats.ProvisioningActivity)' org.jenkinsci.plugins.cloudstats.CloudStatistics.xml.bak 100 However, there are lots of active provisionings in that 150MB file: xmlstarlet sel -t -v 'count( //org.jenkinsci.plugins.cloudstats.CloudStatistics/active/org.jenkinsci.plugins.cloudstats.ProvisioningActivity)' org.jenkinsci.plugins.cloudstats.CloudStatistics.xml.bak 52680 24h after purging the persistence file, the number of active provisionings already reached 3225.

          The problem seems to be the activities are not completed as they should, and hence never considered for rotation. What is the plugin that does the provisioning? Can you attach one of the offending (//org.jenkinsci.plugins.cloudstats.CloudStatistics/log/data/org.jenkinsci.plugins.cloudstats.ProvisioningActivity) elements so I can investigate?

          Oliver Gondža added a comment - The problem seems to be the activities are not completed as they should, and hence never considered for rotation. What is the plugin that does the provisioning? Can you attach one of the offending (//org.jenkinsci.plugins.cloudstats.CloudStatistics/log/data/org.jenkinsci.plugins.cloudstats.ProvisioningActivity) elements so I can investigate?

          The plugin used for provisioning is the yet-another-docker-plugin.

          Here is one //org.jenkinsci.plugins.cloudstats.CloudStatistics/log/data/org.jenkinsci.plugins.cloudstats.ProvisioningActivity entry:

          <org.jenkinsci.plugins.cloudstats.ProvisioningActivity>
            <id>
              <cloudName>Swarm</cloudName>
              <templateName>the-image:latest</templateName>
              <fingerprint>660965903</fingerprint>
            </id>
            <name>Swarm-65034ee5fc80</name>
            <progress class="java.util.Collections$SynchronizedMap" serialization="custom">
              <java.util.Collections_-SynchronizedMap>
                <default>
                  <m class="linked-hash-map">
                    <entry>
                      <org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase>PROVISIONING</org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase>
                      <org.jenkinsci.plugins.cloudstats.PhaseExecution>
                        <attachments class="java.util.concurrent.CopyOnWriteArrayList" />
                        <started>1552358629697</started>
                        <phase>PROVISIONING</phase>
                      </org.jenkinsci.plugins.cloudstats.PhaseExecution>
                    </entry>
                    <entry>
                      <org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase>LAUNCHING</org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase>
                      <org.jenkinsci.plugins.cloudstats.PhaseExecution>
                        <attachments class="java.util.concurrent.CopyOnWriteArrayList" />
                        <started>1552358639701</started>
                        <phase>LAUNCHING</phase>
                      </org.jenkinsci.plugins.cloudstats.PhaseExecution>
                    </entry>
                    <entry>
                      <org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase>OPERATING</org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase>
                      <org.jenkinsci.plugins.cloudstats.PhaseExecution>
                        <attachments class="java.util.concurrent.CopyOnWriteArrayList" />
                        <started>1552358643963</started>
                        <phase>OPERATING</phase>
                      </org.jenkinsci.plugins.cloudstats.PhaseExecution>
                    </entry>
                    <entry>
                      <org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase>COMPLETED</org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase>
                      <org.jenkinsci.plugins.cloudstats.PhaseExecution>
                        <attachments class="java.util.concurrent.CopyOnWriteArrayList" />
                        <started>1552358907655</started>
                        <phase>COMPLETED</phase>
                      </org.jenkinsci.plugins.cloudstats.PhaseExecution>
                    </entry>
                  </m>
                  <mutex class="java.util.Collections$SynchronizedMap" reference="../../.." />
                </default>
              </java.util.Collections_-SynchronizedMap>
            </progress>
          </org.jenkinsci.plugins.cloudstats.ProvisioningActivity>
          

          An entry in //org.jenkinsci.plugins.cloudstats.CloudStatistics/active/org.jenkinsci.plugins.cloudstats.ProvisioningActivity is basically identical, apart from the timestamps and the fingerprint IDs. In fact, every fingerprint ID is different, there are no duplicate fingerprint IDs for provisioning activities. If that is any help to you.

          I guess you're hinting at provisioning activities, that were not correctly completed. So I also looked at the difference in //org.jenkinsci.plugins.cloudstats.CloudStatistics/active/org.jenkinsci.plugins.cloudstats.ProvisioningActivity between the count of PROVISIONING, LAUNCHING, OPERATING and COMPLETED (via xmlstarlet sel -t -c '//org.jenkinsci.plugins.cloudstats.CloudStatistics/active/org.jenkinsci.plugins.cloudstats.ProvisioningActivity' org.jenkinsci.plugins.cloudstats.CloudStatistics.xml.copy | grep COMPLETED | wc -l):

          • PROVISIONING: 7232
          • LAUNCHING: 7232
          • OPERATING: 7229
          • COMPLETED: 7204

          This seems plausible, as the difference is explained by the number of active agents at that time.

          Let me know if you need more information, happy to help.

          Dennis Keitzel added a comment - The plugin used for provisioning is the yet-another-docker-plugin . Here is one //org.jenkinsci.plugins.cloudstats.CloudStatistics/log/data/org.jenkinsci.plugins.cloudstats.ProvisioningActivity entry: <org.jenkinsci.plugins.cloudstats.ProvisioningActivity> <id> <cloudName> Swarm </cloudName> <templateName> the-image:latest </templateName> <fingerprint> 660965903 </fingerprint> </id> <name> Swarm-65034ee5fc80 </name> <progress class= "java.util.Collections$SynchronizedMap" serialization= "custom" > <java.util.Collections_-SynchronizedMap> <default> <m class= "linked-hash-map" > <entry> <org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase> PROVISIONING </org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase> <org.jenkinsci.plugins.cloudstats.PhaseExecution> <attachments class= "java.util.concurrent.CopyOnWriteArrayList" /> <started> 1552358629697 </started> <phase> PROVISIONING </phase> </org.jenkinsci.plugins.cloudstats.PhaseExecution> </entry> <entry> <org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase> LAUNCHING </org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase> <org.jenkinsci.plugins.cloudstats.PhaseExecution> <attachments class= "java.util.concurrent.CopyOnWriteArrayList" /> <started> 1552358639701 </started> <phase> LAUNCHING </phase> </org.jenkinsci.plugins.cloudstats.PhaseExecution> </entry> <entry> <org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase> OPERATING </org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase> <org.jenkinsci.plugins.cloudstats.PhaseExecution> <attachments class= "java.util.concurrent.CopyOnWriteArrayList" /> <started> 1552358643963 </started> <phase> OPERATING </phase> </org.jenkinsci.plugins.cloudstats.PhaseExecution> </entry> <entry> <org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase> COMPLETED </org.jenkinsci.plugins.cloudstats.ProvisioningActivity_-Phase> <org.jenkinsci.plugins.cloudstats.PhaseExecution> <attachments class= "java.util.concurrent.CopyOnWriteArrayList" /> <started> 1552358907655 </started> <phase> COMPLETED </phase> </org.jenkinsci.plugins.cloudstats.PhaseExecution> </entry> </m> <mutex class= "java.util.Collections$SynchronizedMap" reference= "../../.." /> </default> </java.util.Collections_-SynchronizedMap> </progress> </org.jenkinsci.plugins.cloudstats.ProvisioningActivity> An entry in //org.jenkinsci.plugins.cloudstats.CloudStatistics/active/org.jenkinsci.plugins.cloudstats.ProvisioningActivity is basically identical, apart from the timestamps and the fingerprint IDs. In fact, every fingerprint ID is different, there are no duplicate fingerprint IDs for provisioning activities. If that is any help to you. I guess you're hinting at provisioning activities, that were not correctly completed. So I also looked at the difference in //org.jenkinsci.plugins.cloudstats.CloudStatistics/active/org.jenkinsci.plugins.cloudstats.ProvisioningActivity between the count of PROVISIONING, LAUNCHING, OPERATING and COMPLETED (via xmlstarlet sel -t -c '//org.jenkinsci.plugins.cloudstats.CloudStatistics/active/org.jenkinsci.plugins.cloudstats.ProvisioningActivity' org.jenkinsci.plugins.cloudstats.CloudStatistics.xml.copy | grep COMPLETED | wc -l ): PROVISIONING: 7232 LAUNCHING: 7232 OPERATING: 7229 COMPLETED: 7204 This seems plausible, as the difference is explained by the number of active agents at that time. Let me know if you need more information, happy to help.

          Thanks. It is not expected for activities that reached completed phase to remain in active. I managed to reproduce that - seems like recent regression.

          Oliver Gondža added a comment - Thanks. It is not expected for activities that reached completed phase to remain in active . I managed to reproduce that - seems like recent regression.

          Fix proposed: https://github.com/jenkinsci/cloud-stats-plugin/pull/16. I would like to give it some internal test time before releasing.

          Oliver Gondža added a comment - Fix proposed: https://github.com/jenkinsci/cloud-stats-plugin/pull/16. I would like to give it some internal test time before releasing.

          Thank you for the taking care of this issue so quickly, much appreciated!

          Dennis Keitzel added a comment - Thank you for the taking care of this issue so quickly, much appreciated!

          Resolved in 2.23. Thanks for the report!

          Oliver Gondža added a comment - Resolved in 2.23. Thanks for the report!

            olivergondza Oliver Gondža
            seaspu Dennis Keitzel
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: