Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43599

Distributed locks occasionally failing across nodes

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor

      Are pipeline locks expected to behave when used by two nodes simultaneously?

      My Jenkinsfile defines the following helper -

      Jenkinsfile
      def log(msg) {
        echo((new Date().format("yyyy-MM-dd'T'HH:mm:ss: ")) + msg)
      }
      
      def lockDB(number, callback) {
        try {
          log "${number}: Acquiring lock ..."
          lock(resource: "boom-test-database-${number}", inversePrecedence: true) {
            log "${number}: Acquired!"
            callback()
            log "${number}: Releasing..."
          }
        } finally {
          log "${number}: Released!"
        }
      }
      

      I somewhat frequently see two nodes manage to acquire the same lock simultaneously, as in the following example, where master locked a resource from 14:45:57 -> 14:52:04, and the slave locked the same resource from 14:51:24 -> 14:56:43.

      Job1
      [featureTests] Running on master in /var/jenkins_home/workspace/BoomMultiPipeline_master-YRHUUT2RENAD5FNHS4YJ5GQKTJDN2M7B3KGAU3ELBWX7MZ2AEKBA
      [featureTests] 2017-04-07T14:40:44: 1: Acquiring lock ...
      [Pipeline] [featureTests] lock
      [featureTests] Trying to acquire lock on [boom-test-database-1]
      [featureTests] Found 0 available resource(s). Waiting for correct amount: 1.
      [featureTests] [boom-test-database-1] is locked, waiting...
      [featureTests] Lock acquired on [boom-test-database-1]
      [featureTests] 2017-04-07T14:45:57: 1: Acquired!
      ... running tests
      [featureTests] Lock released on resource [boom-test-database-1]
      [featureTests] 2017-04-07T14:52:04: 1: Released!
      
      Job2
      [featureTests] Running on slave in /var/jenkins_home/workspace/eline_tb_remaining_indices2-DTE7IMSYRVCF6GO3T5IRRR52R4YQ3GSRFNEEKCRH7FGSUBQB4SFQ
      [featureTests] 2017-04-07T14:49:33: 1: Acquiring lock ...
      [Pipeline] [featureTests] lock
      [featureTests] Trying to acquire lock on [boom-test-database-1]
      [featureTests] Found 0 available resource(s). Waiting for correct amount: 1.
      [featureTests] [boom-test-database-1] is locked, waiting...
      [featureTests] Lock acquired on [boom-test-database-1]
      [featureTests] 2017-04-07T14:51:24: 1: Acquired!
      ... running tests
      [featureTests] Lock released on resource [boom-test-database-1]
      [featureTests] 2017-04-07T14:56:43: 1: Released!
      

      Is this a bug, or am I misunderstanding how locking is supposed to work in Jenkins?

          [JENKINS-43599] Distributed locks occasionally failing across nodes

          Are you sure that not just the execution of the pipeline on the master has been paused for ~30 seconds while releasing the lock?

          You have only logged the timestamp after execution has resumed, not immediately prior to actually releasing the lock. (Respectively you have removed the interesting "Releasing..." line from the log output.)

          Andreas Ringlstetter added a comment - Are you sure that not just the execution of the pipeline on the master has been paused for ~30 seconds while releasing the lock? You have only logged the timestamp after execution has resumed, not immediately prior to actually releasing the lock. (Respectively you have removed the interesting "Releasing..." line from the log output.)

          Jonathan del Strother added a comment - - edited

          Ugh, the "Releasing" line isn't present in the original logs, because the tests failed so raised an exception in "callback". You're suggesting that in my lockDB function, execution just paused for 30 seconds between releasing the lock and the "finally { log }" line? (Not saying that's impossible, just making sure I understand you)

          I'll add an additional try-catch to ensure I print the "Releasing..." line, will let you know...

          Jonathan del Strother added a comment - - edited Ugh, the "Releasing" line isn't present in the original logs, because the tests failed so raised an exception in "callback". You're suggesting that in my lockDB function, execution just paused for 30 seconds between releasing the lock and the "finally { log }" line? (Not saying that's impossible, just making sure I understand you) I'll add an additional try-catch to ensure I print the "Releasing..." line, will let you know...

          Thomas Rambrant added a comment - - edited

          I have the same problem so I'm very interested in a solution... 
           
          I have tried everything I can think of. I'm currently locking on the outmost level, so that every stage and node blocks is inside the lock. I previously had it inside a node block  but got the impression that the lock only worked for jobs on the same node. But setting it on the outmost level still fails sometimes, even for jobs executed on the same node..

          I have several jobs that is locked on the same resource. 

          1. Long running integration test that locks a database resource and runs for 3 hours
          2. 55 minutes after the start of the first job another job pipe-line job start that locks on the same database. The minute the second job starts to fiddle (drop table) with the database the first job crashes

          When I look in the log both jobs say that they have locked on the same resource

          Thomas Rambrant added a comment - - edited I have the same problem so I'm very interested in a solution...    I have tried everything I can think of. I'm currently locking on the outmost level, so that every stage and node blocks is inside the lock. I previously had it inside a node block  but got the impression that the lock only worked for jobs on the same node. But setting it on the outmost level still fails sometimes, even for jobs executed on the same node.. I have several jobs that is locked on the same resource.  Long running integration test that locks a database resource and runs for 3 hours 55 minutes after the start of the first job another job pipe-line job start that locks on the same database. The minute the second job starts to fiddle (drop table) with the database the first job crashes When I look in the log both jobs say that they have locked on the same resource

            Unassigned Unassigned
            jdelstrother Jonathan del Strother
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: