Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-17999

deadlock in 1.509.1 deleting multiple jobs with REST API

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • core
    • None

      While I left the priority at "Major", this problem prevents us from being able to run 1.509.1. We had to downgrade back to 1.480.3.

      High-level description:

      When deleting multiple jobs using the REST API with 1.509.1, we get a reliable deadlock that results in our having to restart Jenkins. This happens often in our environment, so we have had to downgrade to 1.480.3 for now after less than a day on 1.509.1.

      The stack trace passes through hudson.plugins.parameterizedtrigger.Plugin$RenameListener.onDeleted(Plugin.java:66) but I don't know whether that's part of the issue or whether it's just getting there first. The jobs being deleted share a view and include junit test result aggregation from an upstream job that is also being deleted. I have not determined whether the jobs have to be related for this problem to occur.

      Detailed description:

      Our Jenkins instance has thousands of jobs (over 6,000 at last check). We scale up and down in number of slaves during the day but usually peak at about 200+ slaves. Developers work on branches in git. When a developer pushes changes to a new branch, we create a compile job and a series of downstream test jobs. The compile job runs with every push to the branch, and when the compile job succeeds, it kicks off the test jobs, which run in parallel. All the jobs are in a single view. The mechanism by which the test jobs are kicked off is with a system groovy script that finds other jobs in the view. When the branch is deleted, a hook queries to get the list of jobs in the view and POSTs to /job/<job-name>/doDelete for each job to cause its deletion. Then it deletes the view.

      We've been operating this way for many months, and we have run several consecutive recent LTS versions. When we upgraded to 1.509.1, it only took one occurrence of the bulk job deletion to lock Jenkins up so that it was unresponsive to HTTP requests. Other aspects of Jenkins continued to operate...tailing the log revealed that the queue was still being serviced, jobs were still finishing and archiving results, etc. However, Jenkins is clearly not usable in this state, so we had to restart to restore normal operation.

      While we were running 1.480, we occasionally saw a similar problem where Jenkins stopped service HTTP requests but otherwise appeared to be operating normally. I think we only saw it 2 or maybe 3 times since upgrading to 1.480. We are hoping to report an issue about it, but we still don't have anything to go on. I realize that the number of HTTP request handling threads went from 1000 to 20, which means that an operation that deadlocked 20ish threads would lock up Jenkins right away now and might have taken longer before, but a few observations make me guess that we have not seen this particular failure before. In particular, I'm sure that branch deletion was working before and is once again working now that we have downgraded back to 1.480.3.

      Here's the somewhat abbreviated deadlock section of the jstack output with the job names replaced consistently. I would have to do some sanitizing before I could post the full thread dumps, but I will save them in case it should be necessary.

      Found one Java-level deadlock:
      =============================
      "Handling POST /job/--JOB1--/doDelete : RequestHandlerThread[#246]":
        waiting to lock monitor 0x00007fda48602188 (object 0x00000005f7726ac0, a hudson.model.FreeStyleProject),
        which is held by "Handling POST /job/--JOB2--/doDelete : RequestHandlerThread[#213]"
      "Handling POST /job/--JOB2--/doDelete : RequestHandlerThread[#213]":
        waiting to lock monitor 0x00007fda48e56890 (object 0x00000005f7726790, a hudson.model.FreeStyleProject),
        which is held by "Handling POST /job/--JOB3--/doDelete : RequestHandlerThread[#210]"
      "Handling POST /job/--JOB3--/doDelete : RequestHandlerThread[#210]":
        waiting to lock monitor 0x00007fda48602188 (object 0x00000005f7726ac0, a hudson.model.FreeStyleProject),
        which is held by "Handling POST /job/--JOB2--/doDelete : RequestHandlerThread[#213]"
      
      Java stack information for the threads listed above:
      ===================================================
      "Handling POST /job/--JOB1--/doDelete : RequestHandlerThread[#246]":
      	at hudson.model.Project.getPublishersList(Project.java:114)
      	- waiting to lock <0x00000005f7726ac0> (a hudson.model.FreeStyleProject)
      	at hudson.plugins.parameterizedtrigger.Plugin$RenameListener.onDeleted(Plugin.java:66)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:2431)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:309)
      	at hudson.model.AbstractItem.invokeOnDeleted(AbstractItem.java:523)
      	at hudson.model.AbstractItem.delete(AbstractItem.java:510)
      	- locked <0x00000005f7722eb8> (a hudson.model.FreeStyleProject)
      	at hudson.model.Job.delete(Job.java:587)
      	- locked <0x00000005f7722eb8> (a hudson.model.FreeStyleProject)
      	at hudson.model.AbstractProject.doDoDelete(AbstractProject.java:1880)
      	at sun.reflect.GeneratedMethodAccessor1174.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:601)
              ...
      "Handling POST /job/--JOB2--/doDelete : RequestHandlerThread[#213]":
      	at hudson.model.Project.getPublishersList(Project.java:114)
      	- waiting to lock <0x00000005f7726790> (a hudson.model.FreeStyleProject)
      	at hudson.plugins.parameterizedtrigger.Plugin$RenameListener.onDeleted(Plugin.java:66)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:2431)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:309)
      	at hudson.model.AbstractItem.invokeOnDeleted(AbstractItem.java:523)
      	at hudson.model.AbstractItem.delete(AbstractItem.java:510)
      	- locked <0x00000005f7726ac0> (a hudson.model.FreeStyleProject)
      	at hudson.model.Job.delete(Job.java:587)
      	- locked <0x00000005f7726ac0> (a hudson.model.FreeStyleProject)
      	at hudson.model.AbstractProject.doDoDelete(AbstractProject.java:1880)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:601)
              ...
      "Handling POST /job/--JOB3--/doDelete : RequestHandlerThread[#210]":
      	at hudson.model.Project.getPublishersList(Project.java:114)
      	- waiting to lock <0x00000005f7726ac0> (a hudson.model.FreeStyleProject)
      	at hudson.plugins.parameterizedtrigger.Plugin$RenameListener.onDeleted(Plugin.java:66)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:2431)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:309)
      	at hudson.model.AbstractItem.invokeOnDeleted(AbstractItem.java:523)
      	at hudson.model.AbstractItem.delete(AbstractItem.java:510)
      	- locked <0x00000005f7726790> (a hudson.model.FreeStyleProject)
      	at hudson.model.Job.delete(Job.java:587)
      	- locked <0x00000005f7726790> (a hudson.model.FreeStyleProject)
      	at hudson.model.AbstractProject.doDoDelete(AbstractProject.java:1880)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:601)
              ...
      

      Please let me know if there's anything else you would like to see. I saved a full backup of Jenkins as it looked right before we reverted to 1.480.3 and also saved the Jenkins log file that includes the entire time we were running 1.509.1 (less than a day) as well some time before and after.

          [JENKINS-17999] deadlock in 1.509.1 deleting multiple jobs with REST API

          I can reproduce this issue with just 2 jobs, one downstream of another and trying to delete them both in parallel. The thread dump is very similar to the one above.

          Krishnan Anantheswaran added a comment - I can reproduce this issue with just 2 jobs, one downstream of another and trying to delete them both in parallel. The thread dump is very similar to the one above.

          So, it is pretty clear what's going on:

          • hudson.model.Job.delete is a synchronized method that gets called in parallel for 2 jobs
          • Each of these run the onDeleted method of the parameterized trigger listener
          • The onDeleted method iterates over all projects and gets their publisher list. And getPublishersList is also a synchronized method
          • This leads to the deadlock

          That said, I'm still scratching my head on a proposal for fix. This issue is blocking us too.

          Krishnan Anantheswaran added a comment - So, it is pretty clear what's going on: hudson.model.Job.delete is a synchronized method that gets called in parallel for 2 jobs Each of these run the onDeleted method of the parameterized trigger listener The onDeleted method iterates over all projects and gets their publisher list. And getPublishersList is also a synchronized method This leads to the deadlock That said, I'm still scratching my head on a proposal for fix. This issue is blocking us too.

          I wonder whether a short-term, temporary fix could be implemented by putting a timeout and retry on one or the other of the locks. Probably a dumb idea.

          Jay Berkenbilt added a comment - I wonder whether a short-term, temporary fix could be implemented by putting a timeout and retry on one or the other of the locks. Probably a dumb idea.

          If I'm not mistaken the only reason why getPublishersList(), getBuildersList(), getBuildWrappersList() etc. are synchronized is because the private members are lazily initialized and can be null. Given the fact that these will be populated for every project 95% of the time, it might make sense to eagerly initialize the variables on construction and load so that the synchronization is not required for these methods. Not having hacked on the Jenkins core much, I have no idea what other side-effects this strategy could have.

          Krishnan Anantheswaran added a comment - If I'm not mistaken the only reason why getPublishersList(), getBuildersList(), getBuildWrappersList() etc. are synchronized is because the private members are lazily initialized and can be null. Given the fact that these will be populated for every project 95% of the time, it might make sense to eagerly initialize the variables on construction and load so that the synchronization is not required for these methods. Not having hacked on the Jenkins core much, I have no idea what other side-effects this strategy could have.

          This bug is no longer reproducible starting with 1.509.4. It looks like the commit 5407d9fbce6b10d6902f0cc5971ee95c71619f3a, cherry picked from 7facc7733c7040536d4074a2c5805b75ee1d8f35, which was to fix JENKINS-18589 also solved this problem. I have confirmed that I can easily reproduce the problem with 1.509.1 and can no longer reproduce it with 1.509.4. It also looks like the parameterized trigger plugin, which was then at 2.17 and is currently at 2.24, has a different code path as well. As the reporter, I'm closing this issue since I believe it no longer exists.

          Jay Berkenbilt added a comment - This bug is no longer reproducible starting with 1.509.4. It looks like the commit 5407d9fbce6b10d6902f0cc5971ee95c71619f3a, cherry picked from 7facc7733c7040536d4074a2c5805b75ee1d8f35, which was to fix JENKINS-18589 also solved this problem. I have confirmed that I can easily reproduce the problem with 1.509.1 and can no longer reproduce it with 1.509.4. It also looks like the parameterized trigger plugin, which was then at 2.17 and is currently at 2.24, has a different code path as well. As the reporter, I'm closing this issue since I believe it no longer exists.

            Unassigned Unassigned
            jberkenbilt Jay Berkenbilt
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: