• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core

      We are seeing repeatable very heavy lock congestion with FingerPrint.save.

      Jenkins apparently slides down to a state where a lot of threads have locked / are competing for a lock on FingerPrint instance and are competing for a lock on AnnotationMapper (a singleton). Everything grings to a halt.

          [JENKINS-13154] Heavy thread congestion with FingerPrint.save

          Teppo Kurki added a comment -

          Clippings from a thread dump with evidence of the problem.

          Teppo Kurki added a comment - Clippings from a thread dump with evidence of the problem.

          Teppo Kurki added a comment -

          Suggestions for a fix

          • speed up serialization?
          • deferred write of the FingerPrint dependencies information
            • periodic flushing of dirty FingerPrints
            • create a copy of FingerPrint and queue that for writing, in effect making FingerPrint saving single threaded (which it sort of is already)
          • shard the XStream in FingerPrint: create an configurable-size array of XStream instances and use them in round robin/random order, spreading the locks over the instances

          Teppo Kurki added a comment - Suggestions for a fix speed up serialization? deferred write of the FingerPrint dependencies information periodic flushing of dirty FingerPrints create a copy of FingerPrint and queue that for writing, in effect making FingerPrint saving single threaded (which it sort of is already) shard the XStream in FingerPrint: create an configurable-size array of XStream instances and use them in round robin/random order, spreading the locks over the instances

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          xstream/src/java/com/thoughtworks/xstream/mapper/AnnotationMapper.java
          http://jenkins-ci.org/commit/xstream/bc2bd6df986cc54a90a013d014d4b1fa36a6e023
          Log:
          JENKINS-13154 removed hot lock contention

          annotatedTypes is used to keep track of classes whose annotations we've already processed. So on a sufficiently warm system, annotatedTypes.contains(type) should almost always return true. Thus we just need to move it out of the synchronization block to be able to eliminate the contention. For this, we use ConcurrentWeakHashMap instead of regular WeakHashMap.

          The processTypes method still needs to run in a synchronized block to prevent concurrent calls of processAnnotation with the same class to serialize their executions. It is safe to enter this synchronized block after the initial annotatedTypes.contains(type) check, as the processTypes method itself guards against repeated processing of the same type.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: xstream/src/java/com/thoughtworks/xstream/mapper/AnnotationMapper.java http://jenkins-ci.org/commit/xstream/bc2bd6df986cc54a90a013d014d4b1fa36a6e023 Log: JENKINS-13154 removed hot lock contention annotatedTypes is used to keep track of classes whose annotations we've already processed. So on a sufficiently warm system, annotatedTypes.contains(type) should almost always return true. Thus we just need to move it out of the synchronization block to be able to eliminate the contention. For this, we use ConcurrentWeakHashMap instead of regular WeakHashMap. The processTypes method still needs to run in a synchronized block to prevent concurrent calls of processAnnotation with the same class to serialize their executions. It is safe to enter this synchronized block after the initial annotatedTypes.contains(type) check, as the processTypes method itself guards against repeated processing of the same type.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          changelog.html
          core/pom.xml
          http://jenkins-ci.org/commit/jenkins/2fbcc7300faf285e363e37c1b9c98074a1d910db
          Log:
          [FIXED JENKINS-13154] integrated the fix to XStream that removes the lock contention.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: changelog.html core/pom.xml http://jenkins-ci.org/commit/jenkins/2fbcc7300faf285e363e37c1b9c98074a1d910db Log: [FIXED JENKINS-13154] integrated the fix to XStream that removes the lock contention.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          xstream/src/java/com/thoughtworks/xstream/mapper/AnnotationMapper.java
          http://jenkins-ci.org/commit/xstream/0903e75b981bc3ee91995890de91b800aebe11e6
          Log:
          JENKINS-13154 No, the previous fix was incomplete.

          The type gets added to annotatedTypes before its annotations are actually processed, so it is possible for one thread to visit those annotations while another thread comes in and finds it already, then carry on even though the first thread hasn't finished visiting annotations.

          So I added a second WeakHashSet to guard against this problem.

          The serializedClass field also needs to be concurrent-safe, as we don't seem to lock access to it

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: xstream/src/java/com/thoughtworks/xstream/mapper/AnnotationMapper.java http://jenkins-ci.org/commit/xstream/0903e75b981bc3ee91995890de91b800aebe11e6 Log: JENKINS-13154 No, the previous fix was incomplete. The type gets added to annotatedTypes before its annotations are actually processed, so it is possible for one thread to visit those annotations while another thread comes in and finds it already, then carry on even though the first thread hasn't finished visiting annotations. So I added a second WeakHashSet to guard against this problem. The serializedClass field also needs to be concurrent-safe, as we don't seem to lock access to it

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          core/pom.xml
          http://jenkins-ci.org/commit/jenkins/8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b
          Log:
          [FIXED JENKINS-13154] bumping up further to -11.

          ... to fix a possible race condition.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: core/pom.xml http://jenkins-ci.org/commit/jenkins/8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b Log: [FIXED JENKINS-13154] bumping up further to -11. ... to fix a possible race condition.

          dogfood added a comment -

          Integrated in jenkins_main_trunk #1690
          [FIXED JENKINS-13154] integrated the fix to XStream that removes the lock contention. (Revision 2fbcc7300faf285e363e37c1b9c98074a1d910db)

          Result = UNSTABLE
          Kohsuke Kawaguchi : 2fbcc7300faf285e363e37c1b9c98074a1d910db
          Files :

          • changelog.html
          • core/pom.xml

          dogfood added a comment - Integrated in jenkins_main_trunk #1690 [FIXED JENKINS-13154] integrated the fix to XStream that removes the lock contention. (Revision 2fbcc7300faf285e363e37c1b9c98074a1d910db) Result = UNSTABLE Kohsuke Kawaguchi : 2fbcc7300faf285e363e37c1b9c98074a1d910db Files : changelog.html core/pom.xml

          dogfood added a comment -

          Integrated in jenkins_main_trunk #1691
          [FIXED JENKINS-13154] bumping up further to -11. (Revision 8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b)

          Result = UNSTABLE
          Kohsuke Kawaguchi : 8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b
          Files :

          • core/pom.xml

          dogfood added a comment - Integrated in jenkins_main_trunk #1691 [FIXED JENKINS-13154] bumping up further to -11. (Revision 8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b) Result = UNSTABLE Kohsuke Kawaguchi : 8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b Files : core/pom.xml

          dogfood added a comment -

          Integrated in jenkins_ui-changes_branch #26
          [FIXED JENKINS-13154] integrated the fix to XStream that removes the lock contention. (Revision 2fbcc7300faf285e363e37c1b9c98074a1d910db)
          [FIXED JENKINS-13154] bumping up further to -11. (Revision 8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b)

          Result = SUCCESS
          Kohsuke Kawaguchi : 2fbcc7300faf285e363e37c1b9c98074a1d910db
          Files :

          • core/pom.xml
          • changelog.html

          Kohsuke Kawaguchi : 8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b
          Files :

          • core/pom.xml

          dogfood added a comment - Integrated in jenkins_ui-changes_branch #26 [FIXED JENKINS-13154] integrated the fix to XStream that removes the lock contention. (Revision 2fbcc7300faf285e363e37c1b9c98074a1d910db) [FIXED JENKINS-13154] bumping up further to -11. (Revision 8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b) Result = SUCCESS Kohsuke Kawaguchi : 2fbcc7300faf285e363e37c1b9c98074a1d910db Files : core/pom.xml changelog.html Kohsuke Kawaguchi : 8dbdf8887dfcc8f478e5b6e9a34acdffd8e4d19b Files : core/pom.xml

          Jenkins 1.463 did not fix this problem for us. We're still seeing similar congestion, and many of our builds are stuck for tens of minutes in "Waiting for Jenkins to finish collecting data" at the end of the build.

          I'll attach another thread dump taken with 1.463.

          Mikko Peltonen added a comment - Jenkins 1.463 did not fix this problem for us. We're still seeing similar congestion, and many of our builds are stuck for tens of minutes in "Waiting for Jenkins to finish collecting data" at the end of the build. I'll attach another thread dump taken with 1.463.

          Thread dump taken with Jenkins 1.463, showing the problem still exists.

          Mikko Peltonen added a comment - Thread dump taken with Jenkins 1.463, showing the problem still exists.

          Stefan Prietl added a comment -

          We experience the same problem with Jenkins 1.465 (and Maven 3.0.3). I also provided a thread dump and some traces. Don't know if this helps.

          When I have time (and fully understand the issue here), I'll also try to figure out what goes wrong.

          Stefan Prietl added a comment - We experience the same problem with Jenkins 1.465 (and Maven 3.0.3). I also provided a thread dump and some traces. Don't know if this helps. When I have time (and fully understand the issue here), I'll also try to figure out what goes wrong.

          Stefan Prietl added a comment -

          Don't know if this is related to the issue but our "$JENKINS_HOME/fingerprints" folder contains 1.3GB of fingerprint data.

          Stefan Prietl added a comment - Don't know if this is related to the issue but our "$JENKINS_HOME/fingerprints" folder contains 1.3GB of fingerprint data.

          kutzi added a comment -

          Stefan, are your sure your thread dumps are really from 1.465?
          For example, there's no add() method at line 699 - see
          https://github.com/jenkinsci/jenkins/blob/jenkins-1.465/core/src/main/java/hudson/model/Fingerprint.java

          kutzi added a comment - Stefan, are your sure your thread dumps are really from 1.465? For example, there's no add() method at line 699 - see https://github.com/jenkinsci/jenkins/blob/jenkins-1.465/core/src/main/java/hudson/model/Fingerprint.java

          kutzi, there is an add method at line 699 of the URL you provide.

          Richard Mortimer added a comment - kutzi, there is an add method at line 699 of the URL you provide.

          kutzi added a comment -

          Sorry, you're right. Somehow the Github displays the line numbers wrong in my browser

          kutzi added a comment - Sorry, you're right. Somehow the Github displays the line numbers wrong in my browser

          Stefan Prietl added a comment - - edited

          Seems to me that the size of the "fingerprints" folder has quite an impact on the performance of "collecting data". Today I backed up and removed the fingerprint data (~1,3Gb) and restarted Jenkins. Data collection now takes ~10 seconds instead of 10-15 minutes.

          Stefan Prietl added a comment - - edited Seems to me that the size of the "fingerprints" folder has quite an impact on the performance of "collecting data". Today I backed up and removed the fingerprint data (~1,3Gb) and restarted Jenkins. Data collection now takes ~10 seconds instead of 10-15 minutes.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          xstream/src/java/com/thoughtworks/xstream/mapper/AnnotationMapper.java
          http://jenkins-ci.org/commit/xstream/d7ce9b30f56e0409c558dc1b34d0784e0755b818
          Log:
          JENKINS-13154

          Still seeing some reports of contented lock on annotatedTypes after the initial fix went in May
          2012.

          This is really puzzling, as the processedTypes should warm up quickly
          and include all the relevant classes, so we shouldn't be getting to the
          synchronized block (unless WeakHashSet was silently dropping stuff in
          1.3.x code.)

          This probe will hopefully shed a bit more light on what's going on,
          and if we are lucky, XStream 1.4 removing "Weak" from these collections
          might resolve the problem entirely (fingers crossed.)


          You received this message because you are subscribed to the Google Groups "Jenkins Commits" group.
          To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-commits+unsubscribe@googlegroups.com.
          For more options, visit https://groups.google.com/groups/opt_out.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: xstream/src/java/com/thoughtworks/xstream/mapper/AnnotationMapper.java http://jenkins-ci.org/commit/xstream/d7ce9b30f56e0409c558dc1b34d0784e0755b818 Log: JENKINS-13154 Still seeing some reports of contented lock on annotatedTypes after the initial fix went in May 2012. This is really puzzling, as the processedTypes should warm up quickly and include all the relevant classes, so we shouldn't be getting to the synchronized block (unless WeakHashSet was silently dropping stuff in 1.3.x code.) This probe will hopefully shed a bit more light on what's going on, and if we are lucky, XStream 1.4 removing "Weak" from these collections might resolve the problem entirely (fingers crossed.) – You received this message because you are subscribed to the Google Groups "Jenkins Commits" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-commits+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out .

          We are definitely still seeing this, and it's affecting us badly. The more simultaneous builds end roughly at the same time, the worse the delay from "Waiting for Jenkins to finish collecting data" to "Finished: SUCCESS" is (may be tens of minutes at the worst case).

          Attached is a file with relevant parts of a thread dump taken with Jenkins 1.506, showing hundreds of blocked threads.

          Mikko Peltonen added a comment - We are definitely still seeing this, and it's affecting us badly. The more simultaneous builds end roughly at the same time, the worse the delay from "Waiting for Jenkins to finish collecting data" to "Finished: SUCCESS" is (may be tens of minutes at the worst case). Attached is a file with relevant parts of a thread dump taken with Jenkins 1.506, showing hundreds of blocked threads.

          Jesse Glick added a comment -

          Turns out that with a couple of printlns in the right places the problem becomes obvious.

          Jesse Glick added a comment - Turns out that with a couple of printlns in the right places the problem becomes obvious.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          xstream/src/java/com/thoughtworks/xstream/mapper/AnnotationMapper.java
          http://jenkins-ci.org/commit/xstream/855b306cfcd03a07b873c59173d9e376a75c026a
          Log:
          JENKINS-13154 Heavy thread contention from AnnotationMapper.
          UnprocessedTypesSet.add will fail for java.** types.
          In this case we want to immediately cache the fact that this type has been considered.
          Otherwise we will constantly be acquiring the lock on annotatedTypes just to ignore e.g. java.lang.String.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: xstream/src/java/com/thoughtworks/xstream/mapper/AnnotationMapper.java http://jenkins-ci.org/commit/xstream/855b306cfcd03a07b873c59173d9e376a75c026a Log: JENKINS-13154 Heavy thread contention from AnnotationMapper. UnprocessedTypesSet.add will fail for java.** types. In this case we want to immediately cache the fact that this type has been considered. Otherwise we will constantly be acquiring the lock on annotatedTypes just to ignore e.g. java.lang.String.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          changelog.html
          core/pom.xml
          http://jenkins-ci.org/commit/jenkins/fdc090a3ae3830196b64535862425c6cd844d46b
          Log:
          [FIXED JENKINS-13154] AnnotationMapper bug was causing massive lock contention when saving fingerprints.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: changelog.html core/pom.xml http://jenkins-ci.org/commit/jenkins/fdc090a3ae3830196b64535862425c6cd844d46b Log: [FIXED JENKINS-13154] AnnotationMapper bug was causing massive lock contention when saving fingerprints.

          Jesse Glick added a comment -

          And by the way thanks you @mp3 for your thread dump which led to the fix.

          Jesse Glick added a comment - And by the way thanks you @mp3 for your thread dump which led to the fix.

          dogfood added a comment -

          Integrated in jenkins_main_trunk #2394
          [FIXED JENKINS-13154] AnnotationMapper bug was causing massive lock contention when saving fingerprints. (Revision fdc090a3ae3830196b64535862425c6cd844d46b)

          Result = SUCCESS
          Jesse Glick : fdc090a3ae3830196b64535862425c6cd844d46b
          Files :

          • changelog.html
          • core/pom.xml

          dogfood added a comment - Integrated in jenkins_main_trunk #2394 [FIXED JENKINS-13154] AnnotationMapper bug was causing massive lock contention when saving fingerprints. (Revision fdc090a3ae3830196b64535862425c6cd844d46b) Result = SUCCESS Jesse Glick : fdc090a3ae3830196b64535862425c6cd844d46b Files : changelog.html core/pom.xml

          Eric Denman added a comment -

          FWIW, I just installed 1.510-SNAPSHOT (e837438c0138faf401bcb433450c8dfdc7ebbe6f) and our builds are still taking 5+ minutes on the "Waiting for Jenkins to finish collecting data" step and the whole jenkins app frequently stops responding to any requests. We have 228M of data in the "fingerprints" dir.

          I tried getting a thread dump by using the /threadDump when it was in the "not responding" state but that didn't work so I used jstack -F. Attaching that dump.

          Eric Denman added a comment - FWIW, I just installed 1.510-SNAPSHOT (e837438c0138faf401bcb433450c8dfdc7ebbe6f) and our builds are still taking 5+ minutes on the "Waiting for Jenkins to finish collecting data" step and the whole jenkins app frequently stops responding to any requests. We have 228M of data in the "fingerprints" dir. I tried getting a thread dump by using the /threadDump when it was in the "not responding" state but that didn't work so I used jstack -F. Attaching that dump.

          Eric Denman added a comment -

          @jglick looks like my threaddump is blocked deep underneath Fingerprint.load so it may not be the same issue. Should I re-open this ticket or file a new one?

          Eric Denman added a comment - @jglick looks like my threaddump is blocked deep underneath Fingerprint.load so it may not be the same issue. Should I re-open this ticket or file a new one?

          Jesse Glick added a comment -

          @edenman: separate issue please. (Can use JIRA’s “related” link if appropriate.) Your thread dump anyway does not show Jenkins being blocked, but rather excessively busy doing a couple different things: loading fingerprints; and loading historical build records (perhaps after having discarded them under memory pressure).

          Jesse Glick added a comment - @edenman: separate issue please. (Can use JIRA’s “related” link if appropriate.) Your thread dump anyway does not show Jenkins being blocked, but rather excessively busy doing a couple different things: loading fingerprints; and loading historical build records (perhaps after having discarded them under memory pressure).

          Eric Denman added a comment -

          @jglick thanks! Filed JENKINS-17412

          Eric Denman added a comment - @jglick thanks! Filed JENKINS-17412

          We are still seeing lots of congestion with Jenkins 1.517 and our builds may still stay tens of minutes in "Waiting for Jenkins to finish collecting data" -phase.

          The place has changed though, now thread dumps show lots of these:

          java.lang.Thread.State: BLOCKED (on object monitor)
          at java.util.Collections$SynchronizedMap.get(Collections.java:2031)

          • waiting to lock <0x000000070c6dcfa8> (a java.util.Collections$SynchronizedMap)
            at com.thoughtworks.xstream.core.DefaultConverterLookup.lookupConverterForType(DefaultConverterLookup.java:49)

          Mikko Peltonen added a comment - We are still seeing lots of congestion with Jenkins 1.517 and our builds may still stay tens of minutes in "Waiting for Jenkins to finish collecting data" -phase. The place has changed though, now thread dumps show lots of these: java.lang.Thread.State: BLOCKED (on object monitor) at java.util.Collections$SynchronizedMap.get(Collections.java:2031) waiting to lock <0x000000070c6dcfa8> (a java.util.Collections$SynchronizedMap) at com.thoughtworks.xstream.core.DefaultConverterLookup.lookupConverterForType(DefaultConverterLookup.java:49)

          Attached relevant parts of the whole thread dump.

          Mikko Peltonen added a comment - Attached relevant parts of the whole thread dump.

          Jesse Glick added a comment -

          @mp3 whatever you are seeing is a distinct bug. Please file separately and use JIRA’s “is related to” link as needed.

          (I would have suspected a regression from JENKINS-18775, but that is much newer than 1.517.)

          Jesse Glick added a comment - @mp3 whatever you are seeing is a distinct bug. Please file separately and use JIRA’s “is related to” link as needed. (I would have suspected a regression from JENKINS-18775 , but that is much newer than 1.517.)

            jglick Jesse Glick
            t_kurki Teppo Kurki
            Votes:
            4 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: