Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-24763

Long running OldDataMonitor.doDiscard() results in thread starvation

      A very large set of OldData can take a long time to discard. While this is running, most Jenkins operations which save any data are blocked.

      https://gist.github.com/recampbell/9336d3a32270e75a9333

      doDiscard is synchronized and hudson.diagnosis.OldDataMonitor#remove also wants to lock OldDataMonitor. Most threads eventually lock when saving something, at least until the doDiscard eventually completes.

          [JENKINS-24763] Long running OldDataMonitor.doDiscard() results in thread starvation

          Ryan Campbell created issue -
          Ryan Campbell made changes -
          Description Original: A very large set of OldData can take a long time to discard. While this is running, most Jenkins operations which save any data are blocked.

          https://gist.github.com/recampbell/9336d3a32270e75a9333

          doDiscard is synchronized and hudson.diagnosis.OldDataMonitor#remove also wants to lock OldDataMonitor most threads eventually lock when saving something, at least until the doDiscard eventually completes.

          New: A very large set of OldData can take a long time to discard. While this is running, most Jenkins operations which save any data are blocked.

          https://gist.github.com/recampbell/9336d3a32270e75a9333

          doDiscard is synchronized and hudson.diagnosis.OldDataMonitor#remove also wants to lock OldDataMonitor. Most threads eventually lock when saving something, at least until the doDiscard eventually completes.

          Daniel Beck added a comment -

          Appears to be a rather minor issue (with the workaround being to wait for ODM to finish).

          In what way is this a livelock?

          Daniel Beck added a comment - Appears to be a rather minor issue (with the workaround being to wait for ODM to finish). In what way is this a livelock ?
          Daniel Beck made changes -
          Priority Original: Major [ 3 ] New: Minor [ 4 ]

          Ryan Campbell added a comment -

          In the meantime, no build can finish, and no slaves can come online. In the incident observed, the discard was taking more than an hour and Jenkins could not complete builds or do basically any other operation.

          It's unacceptable for Jenkins to be inoperable for minutes at a time, hence a higher severity.

          I suppose this is really just "starvation' instead of a livelock per se. Will update the description.

          Ryan Campbell added a comment - In the meantime, no build can finish, and no slaves can come online. In the incident observed, the discard was taking more than an hour and Jenkins could not complete builds or do basically any other operation. It's unacceptable for Jenkins to be inoperable for minutes at a time, hence a higher severity. I suppose this is really just "starvation' instead of a livelock per se. Will update the description.
          Ryan Campbell made changes -
          Summary Original: Long running OldDataMonitor.doDiscard() results in livelocks New: Long running OldDataMonitor.doDiscard() results in thread starvation

          Ryan Campbell added a comment -

          Ryan Campbell added a comment - Pull request: https://github.com/jenkinsci/jenkins/pull/1399

          Jesse Glick added a comment -

          What makes you think this is really a problem in OldDataMonitor? I look at the thread dump and I see JobConfigHistorySaveableListener being likely at fault.

          Jesse Glick added a comment - What makes you think this is really a problem in OldDataMonitor ? I look at the thread dump and I see JobConfigHistorySaveableListener being likely at fault.

          Daniel Beck added a comment -

          FWIW JobConfigHistorySaveableListener takes forever in my experience if you enable the deduplication option ("Do not save duplicate history"). I've long disabled it for that reason.

          Daniel Beck added a comment - FWIW JobConfigHistorySaveableListener takes forever in my experience if you enable the deduplication option ("Do not save duplicate history"). I've long disabled it for that reason.

          Ryan Campbell added a comment -

          I see Jesse's point about JobConfigHistorySaveableListener – it's certainly the root cause.

          Nevertheless, OldDataMonitor could be more robust by reducing the scope of it's lock – to better handle other poorly performing SaveableListeners. So I'll change this to an Improvement if that suits you.

          Ryan Campbell added a comment - I see Jesse's point about JobConfigHistorySaveableListener – it's certainly the root cause. Nevertheless, OldDataMonitor could be more robust by reducing the scope of it's lock – to better handle other poorly performing SaveableListeners. So I'll change this to an Improvement if that suits you.
          Ryan Campbell made changes -
          Issue Type Original: Bug [ 1 ] New: Improvement [ 4 ]

            Unassigned Unassigned
            recampbell Ryan Campbell
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: