Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-17999

deadlock in 1.509.1 deleting multiple jobs with REST API

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • core
    • None

      While I left the priority at "Major", this problem prevents us from being able to run 1.509.1. We had to downgrade back to 1.480.3.

      High-level description:

      When deleting multiple jobs using the REST API with 1.509.1, we get a reliable deadlock that results in our having to restart Jenkins. This happens often in our environment, so we have had to downgrade to 1.480.3 for now after less than a day on 1.509.1.

      The stack trace passes through hudson.plugins.parameterizedtrigger.Plugin$RenameListener.onDeleted(Plugin.java:66) but I don't know whether that's part of the issue or whether it's just getting there first. The jobs being deleted share a view and include junit test result aggregation from an upstream job that is also being deleted. I have not determined whether the jobs have to be related for this problem to occur.

      Detailed description:

      Our Jenkins instance has thousands of jobs (over 6,000 at last check). We scale up and down in number of slaves during the day but usually peak at about 200+ slaves. Developers work on branches in git. When a developer pushes changes to a new branch, we create a compile job and a series of downstream test jobs. The compile job runs with every push to the branch, and when the compile job succeeds, it kicks off the test jobs, which run in parallel. All the jobs are in a single view. The mechanism by which the test jobs are kicked off is with a system groovy script that finds other jobs in the view. When the branch is deleted, a hook queries to get the list of jobs in the view and POSTs to /job/<job-name>/doDelete for each job to cause its deletion. Then it deletes the view.

      We've been operating this way for many months, and we have run several consecutive recent LTS versions. When we upgraded to 1.509.1, it only took one occurrence of the bulk job deletion to lock Jenkins up so that it was unresponsive to HTTP requests. Other aspects of Jenkins continued to operate...tailing the log revealed that the queue was still being serviced, jobs were still finishing and archiving results, etc. However, Jenkins is clearly not usable in this state, so we had to restart to restore normal operation.

      While we were running 1.480, we occasionally saw a similar problem where Jenkins stopped service HTTP requests but otherwise appeared to be operating normally. I think we only saw it 2 or maybe 3 times since upgrading to 1.480. We are hoping to report an issue about it, but we still don't have anything to go on. I realize that the number of HTTP request handling threads went from 1000 to 20, which means that an operation that deadlocked 20ish threads would lock up Jenkins right away now and might have taken longer before, but a few observations make me guess that we have not seen this particular failure before. In particular, I'm sure that branch deletion was working before and is once again working now that we have downgraded back to 1.480.3.

      Here's the somewhat abbreviated deadlock section of the jstack output with the job names replaced consistently. I would have to do some sanitizing before I could post the full thread dumps, but I will save them in case it should be necessary.

      Found one Java-level deadlock:
      =============================
      "Handling POST /job/--JOB1--/doDelete : RequestHandlerThread[#246]":
        waiting to lock monitor 0x00007fda48602188 (object 0x00000005f7726ac0, a hudson.model.FreeStyleProject),
        which is held by "Handling POST /job/--JOB2--/doDelete : RequestHandlerThread[#213]"
      "Handling POST /job/--JOB2--/doDelete : RequestHandlerThread[#213]":
        waiting to lock monitor 0x00007fda48e56890 (object 0x00000005f7726790, a hudson.model.FreeStyleProject),
        which is held by "Handling POST /job/--JOB3--/doDelete : RequestHandlerThread[#210]"
      "Handling POST /job/--JOB3--/doDelete : RequestHandlerThread[#210]":
        waiting to lock monitor 0x00007fda48602188 (object 0x00000005f7726ac0, a hudson.model.FreeStyleProject),
        which is held by "Handling POST /job/--JOB2--/doDelete : RequestHandlerThread[#213]"
      
      Java stack information for the threads listed above:
      ===================================================
      "Handling POST /job/--JOB1--/doDelete : RequestHandlerThread[#246]":
      	at hudson.model.Project.getPublishersList(Project.java:114)
      	- waiting to lock <0x00000005f7726ac0> (a hudson.model.FreeStyleProject)
      	at hudson.plugins.parameterizedtrigger.Plugin$RenameListener.onDeleted(Plugin.java:66)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:2431)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:309)
      	at hudson.model.AbstractItem.invokeOnDeleted(AbstractItem.java:523)
      	at hudson.model.AbstractItem.delete(AbstractItem.java:510)
      	- locked <0x00000005f7722eb8> (a hudson.model.FreeStyleProject)
      	at hudson.model.Job.delete(Job.java:587)
      	- locked <0x00000005f7722eb8> (a hudson.model.FreeStyleProject)
      	at hudson.model.AbstractProject.doDoDelete(AbstractProject.java:1880)
      	at sun.reflect.GeneratedMethodAccessor1174.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:601)
              ...
      "Handling POST /job/--JOB2--/doDelete : RequestHandlerThread[#213]":
      	at hudson.model.Project.getPublishersList(Project.java:114)
      	- waiting to lock <0x00000005f7726790> (a hudson.model.FreeStyleProject)
      	at hudson.plugins.parameterizedtrigger.Plugin$RenameListener.onDeleted(Plugin.java:66)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:2431)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:309)
      	at hudson.model.AbstractItem.invokeOnDeleted(AbstractItem.java:523)
      	at hudson.model.AbstractItem.delete(AbstractItem.java:510)
      	- locked <0x00000005f7726ac0> (a hudson.model.FreeStyleProject)
      	at hudson.model.Job.delete(Job.java:587)
      	- locked <0x00000005f7726ac0> (a hudson.model.FreeStyleProject)
      	at hudson.model.AbstractProject.doDoDelete(AbstractProject.java:1880)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:601)
              ...
      "Handling POST /job/--JOB3--/doDelete : RequestHandlerThread[#210]":
      	at hudson.model.Project.getPublishersList(Project.java:114)
      	- waiting to lock <0x00000005f7726ac0> (a hudson.model.FreeStyleProject)
      	at hudson.plugins.parameterizedtrigger.Plugin$RenameListener.onDeleted(Plugin.java:66)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:2431)
      	at jenkins.model.Jenkins.onDeleted(Jenkins.java:309)
      	at hudson.model.AbstractItem.invokeOnDeleted(AbstractItem.java:523)
      	at hudson.model.AbstractItem.delete(AbstractItem.java:510)
      	- locked <0x00000005f7726790> (a hudson.model.FreeStyleProject)
      	at hudson.model.Job.delete(Job.java:587)
      	- locked <0x00000005f7726790> (a hudson.model.FreeStyleProject)
      	at hudson.model.AbstractProject.doDoDelete(AbstractProject.java:1880)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:601)
              ...
      

      Please let me know if there's anything else you would like to see. I saved a full backup of Jenkins as it looked right before we reverted to 1.480.3 and also saved the Jenkins log file that includes the entire time we were running 1.509.1 (less than a day) as well some time before and after.

            Unassigned Unassigned
            jberkenbilt Jay Berkenbilt
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: