Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45917

[Jenkins v2.63] Build queue deadlocks

    XMLWordPrintable

Details

    Description

      We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

       

      We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

       

      /var/lib/jenkins/queue.xml
      /var/lib/jenkins/queue.xml.bak
      /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
      /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
      /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
      

       

      Other triage steps we have taken: 

      • Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
      • Starting in shutdown mode with a clear build queue
      • disabling jobs that queue frequently

      (first 3 edits were on the same day the ticket was created, Tuesday August 1, 2017)

      edit: 

      We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

      edit 2: 

      Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

      edit 3:

      We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 

      "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50
      "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e
      "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
      

       

      EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

      In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

      The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

      The directory for a given job is: 

      /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

       

      Attachments

        Issue Links

          Activity

            mpedreiro Marcos Pedreiro created issue -
            mpedreiro Marcos Pedreiro made changes -
            Field Original Value New Value
            Attachment heap.dump [ 39113 ]
            mpedreiro Marcos Pedreiro made changes -
            Attachment heap.dump [ 39113 ]
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server.

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             

            /var/lib/jenkins/queue.xml

            /var/lib/jenkins/queue.xml.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml

             

             

            Triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently
            Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             

            /var/lib/jenkins/queue.xml

            /var/lib/jenkins/queue.xml.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml

             

             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 
            mpedreiro Marcos Pedreiro made changes -
            Attachment active thead count.png [ 39114 ]
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             

            /var/lib/jenkins/queue.xml

            /var/lib/jenkins/queue.xml.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml

             

             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 
            Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             

            /var/lib/jenkins/queue.xml

            /var/lib/jenkins/queue.xml.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml

             

             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             

            /var/lib/jenkins/queue.xml

            /var/lib/jenkins/queue.xml.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml

             

             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way
            Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             

            /var/lib/jenkins/queue.xml

            /var/lib/jenkins/queue.xml.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml

             

             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
             
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

             

             
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             

            /var/lib/jenkins/queue.xml

            /var/lib/jenkins/queue.xml.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak

            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml

             

             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
             
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

             

             
            Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

             

             
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

             

             
            Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

             

             

            (I work with Marcos) We are starting to believe this issue is related to Pipeline resume capabilities. We can definitely produce a "build queue" hung state (jobs not queuing) after a Jenkins restart if there were pipeline jobs running at the time of the restart that weren't terminated (say a crash or unceremonious shutdown). Upon restart you don't even necessarily see the resuming jobs enter the queue, but we do end up deadlocked.

            If we clear the following files:

            • /var/lib/jenkins/queue.xml
            • /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            • /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 

            It often restarts cleanly, but we still get risk the problem described above where intermittantly the queue may sieze 1, 2, even 12 hours later.  We are having some luck then scanning for any jobs that may be trying to resume and cleaning their build history (expensive across a server with 3000+ jobs). In particular the build runs that may need to resume. I'm not clear, but it seems there's a background thread in Jenkins that scans for jobs that need to resume based on their history and resumes them.  This can, potentially, cause the thread deadlocks you see in our thread dump.

            If we cull the histories things seem stable.  See Issue: https://issues.jenkins-ci.org/browse/JENKINS-33761 where we believe if we could disable resume capabilities we could avoid this.

            This does not appear reproduceable on a small jenkins instance with only a few jobs and only a few needing to resume. But if you have a server like ours, with a few thousand jobs and potentially 20-40 jobs interrupted upon restart, it seems to happen. I don't yet have a garaunteed repro case. 

            Versions of things we are running:

             

            • Jenkins 2.63
            • Pipeline 2.5
            • Pipeline: API 2.16
            • Pipeline: Basic Steps 2.5
            • Pipeline: Build Step 2.5
            • Pipeline: Groovy 2.34
            • Pipeline: job 2.12
            • Pipeline: Stage Step 2.2
            • Pipeline: Step API 2.10
            • Pipeline: Shared Groovy Libraries 2.8

             

             

            maxfields2000 Maxfield Stewart added a comment - (I work with Marcos) We are starting to believe this issue is related to Pipeline resume capabilities. We can definitely produce a "build queue" hung state (jobs not queuing) after a Jenkins restart if there were pipeline jobs running at the time of the restart that weren't terminated (say a crash or unceremonious shutdown). Upon restart you don't even necessarily see the resuming jobs enter the queue, but we do end up deadlocked. If we clear the following files: /var/lib/jenkins/queue.xml /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml  It often restarts cleanly, but we still get risk the problem described above where intermittantly the queue may sieze 1, 2, even 12 hours later.  We are having some luck then scanning for any jobs that may be trying to resume and cleaning their build history (expensive across a server with 3000+ jobs). In particular the build runs that may need to resume. I'm not clear, but it seems there's a background thread in Jenkins that scans for jobs that need to resume based on their history and resumes them.  This can, potentially, cause the thread deadlocks you see in our thread dump. If we cull the histories things seem stable.  See Issue: https://issues.jenkins-ci.org/browse/JENKINS-33761  where we believe if we could disable resume capabilities we could avoid this. This does not appear reproduceable on a small jenkins instance with only a few jobs and only a few needing to resume. But if you have a server like ours, with a few thousand jobs and potentially 20-40 jobs interrupted upon restart, it seems to happen. I don't yet have a garaunteed repro case.  Versions of things we are running:   Jenkins 2.63 Pipeline 2.5 Pipeline: API 2.16 Pipeline: Basic Steps 2.5 Pipeline: Build Step 2.5 Pipeline: Groovy 2.34 Pipeline: job 2.12 Pipeline: Stage Step 2.2 Pipeline: Step API 2.10 Pipeline: Shared Groovy Libraries 2.8    
            maxfields2000 Maxfield Stewart made changes -
            Labels build-queue deadlock jenkins threads build-queue deadlock jenkins pipeline-hangs threads
            maxfields2000 Maxfield Stewart made changes -
            Labels build-queue deadlock jenkins pipeline-hangs threads build-queue deadlock jenkins pipeline pipeline-hangs threads
            oleg_nenashev Oleg Nenashev added a comment - It could be specific to Docker Plugin. https://github.com/jenkinsci/docker-plugin/blob/master/docker-plugin/src/main/java/com/nirima/jenkins/plugins/docker/strategy/DockerOnceRetentionStrategy.java#L106 is definitely a bad idea.
            oleg_nenashev Oleg Nenashev added a comment - - edited The same code exists in the Yet Another Docker Plugin: https://github.com/KostyaSha/yet-another-docker-plugin/blob/master/yet-another-docker-plugin/src/main/java/com/github/kostyasha/yad/strategy/DockerOnceRetentionStrategy.java#L123 . CC integer
            oleg_nenashev Oleg Nenashev made changes -
            Link This issue is related to JENKINS-33761 [ JENKINS-33761 ]
            oleg_nenashev Oleg Nenashev added a comment -

            There is also logic from the TCB plugin

            oleg_nenashev Oleg Nenashev added a comment - There is also logic from the TCB plugin
            integer Kanstantsin Shautsou added a comment - logic come from https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/executors/OnceRetentionStrategy.java  i guess non-pipeline jobs has no issues. 

            I was theorizing that the Docker/Jclouds Plugins need to inspect the build queue to decide if they need to provision a new node.  I don't have heap dumps/thread dumps from our Jenkins server that doesn't use the Docker/Jclouds pplugins but we have seen the build queue hang on that server after restart as well. In fact we first saw the behavior there.  Your findings however are hopeful that potentially there's a way to fix this in the future.

            maxfields2000 Maxfield Stewart added a comment - I was theorizing that the Docker/Jclouds Plugins need to inspect the build queue to decide if they need to provision a new node.  I don't have heap dumps/thread dumps from our Jenkins server that doesn't use the Docker/Jclouds pplugins but we have seen the build queue hang on that server after restart as well. In fact we first saw the behavior there.  Your findings however are hopeful that potentially there's a way to fix this in the future.

            integer Would switching to a different Retention strategy avoid this? I have to look at the Yet Another Docker Plugin to see what options exist that might emulate the behavior of provisioning and throwing away the container when done. Or do all the retention strategies have a lock on the build queue?  (I know this doesn't appear specific to the Docker plugins as tghe "Once Retention Strategy" is not just a Docker thing).

            maxfields2000 Maxfield Stewart added a comment - integer Would switching to a different Retention strategy avoid this? I have to look at the Yet Another Docker Plugin to see what options exist that might emulate the behavior of provisioning and throwing away the container when done. Or do all the retention strategies have a lock on the build queue?  (I know this doesn't appear specific to the Docker plugins as tghe "Once Retention Strategy" is not just a Docker thing).

            oleg_nenashev We definitely use the TCB (Throttle Concurrent builds) plugin on both our Jenkisn environments (non Docker Plugin and Docker Plugin) which is a good lead into why we might see this behavior on our more traditional Jenkins setup as well. 

            If a pipeline is trying to resume AND has a throttle associated with it and is queued at the same time for another run, could that cause a deadlock? 

            maxfields2000 Maxfield Stewart added a comment - oleg_nenashev We definitely use the TCB (Throttle Concurrent builds) plugin on both our Jenkisn environments (non Docker Plugin and Docker Plugin) which is a good lead into why we might see this behavior on our more traditional Jenkins setup as well.  If a pipeline is trying to resume AND has a throttle associated with it and is queued at the same time for another run, could that cause a deadlock? 

            As more info, last night before going home, we crawled all the job folders on our server looking for any job that had any historical message in it's history logs of "attempting to resume". If we found that log message, we purged that build history from disk.

             

            We restarted the jenkins server and have had 14 hours of consecutive up time (our previous high was about 2 hrs). Which is why we believe the culprit is the interaction with resume features of jenkins, but we're not sure exactly what causes Jenkins to believe a pipeline job needs to be resumed.  We know it's more than just purging the xml files in the jenkins home folder that control queue (queue.xml, org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml org.jenkinsci.plugins.workflow.support.steps.StageStep.xml ) because resumes still occur after clearing those.

            After doing our build history purge we appeared to get no resumes (and no locks)

            maxfields2000 Maxfield Stewart added a comment - As more info, last night before going home, we crawled all the job folders on our server looking for any job that had any historical message in it's history logs of "attempting to resume". If we found that log message, we purged that build history from disk.   We restarted the jenkins server and have had 14 hours of consecutive up time (our previous high was about 2 hrs). Which is why we believe the culprit is the interaction with resume features of jenkins, but we're not sure exactly what causes Jenkins to believe a pipeline job needs to be resumed.  We know it's more than just purging the xml files in the jenkins home folder that control queue (queue.xml, org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml org.jenkinsci.plugins.workflow.support.steps.StageStep.xml ) because resumes still occur after clearing those. After doing our build history purge we appeared to get no resumes (and no locks)

            I also see ghprb plugin with some weird locks, maybe try disable it. 

            integer Kanstantsin Shautsou added a comment - I also see ghprb plugin with some weird locks, maybe try disable it. 
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

             

             
            Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

            EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

             

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

             

             
            mpedreiro Marcos Pedreiro made changes -
            Attachment clean_zombie_jobs.py [ 39140 ]
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

            EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

             

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

             

             
            Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

            EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

            The directory for a given job is: 

            /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

             
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

            EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

            The directory for a given job is: 

            /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

             
            Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            (first 3 edits were on the same day the ticket was created, Tuesday August 1, 2017)

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

            EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

            The directory for a given job is: 

            /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

             

            We seem to have stabilized the server.

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached to the ticket. 

            The directory for a given job is: 

            /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

            mpedreiro Marcos Pedreiro added a comment - We seem to have stabilized the server. In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang. The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached to the ticket.  The directory for a given job is:  /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server. Link to binary heap dump: [https://drive.google.com/open?id=0BxvWFQX3J0X0Ym1Fc2g4U0oyVjQ]

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            (first 3 edits were on the same day the ticket was created, Tuesday August 1, 2017)

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

            EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

            The directory for a given job is: 

            /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

             
            Attached are the heap and thread dumps from the Jenkins master server. 

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            (first 3 edits were on the same day the ticket was created, Tuesday August 1, 2017)

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

            EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

            The directory for a given job is: 

            /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

             
            mpedreiro Marcos Pedreiro made changes -
            Attachment thread.dump [ 39109 ]
            mpedreiro Marcos Pedreiro made changes -
            Attachment heap.dump [ 39108 ]
            mpedreiro Marcos Pedreiro made changes -
            Description Attached are the heap and thread dumps from the Jenkins master server. 

             

            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            (first 3 edits were on the same day the ticket was created, Tuesday August 1, 2017)

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

            EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

            The directory for a given job is: 

            /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

             
            We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

             

            We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing: 

             
            {code:java}
            /var/lib/jenkins/queue.xml
            /var/lib/jenkins/queue.xml.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
            /var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml 
            {code}
             

            Other triage steps we have taken: 
             * Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
             * Starting in shutdown mode with a clear build queue
             * disabling jobs that queue frequently

            (first 3 edits were on the same day the ticket was created, Tuesday August 1, 2017)

            edit: 

            We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one. 

            edit 2: 

            Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

            edit 3:

            We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps: 
            {code:java}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50{code}
            {code:java}
            "Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e{code}
            {code:java}
            "Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
            {code}
             

            EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

            In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

            The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below. 

            The directory for a given job is: 

            /var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

             
            mpedreiro Marcos Pedreiro made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            mpedreiro Marcos Pedreiro made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            mpedreiro Marcos Pedreiro made changes -
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]
            oleg_nenashev Oleg Nenashev added a comment -

            CC abayer and svanoort ^^^

            oleg_nenashev Oleg Nenashev added a comment - CC abayer and svanoort ^^^
            svanoort Sam Van Oort added a comment -

            Looks like a clone of https://issues.jenkins-ci.org/browse/JENKINS-36013 to me – I have a fix almost ready to go for that one

            svanoort Sam Van Oort added a comment - Looks like a clone of https://issues.jenkins-ci.org/browse/JENKINS-36013  to me – I have a fix almost ready to go for that one
            oleg_nenashev Oleg Nenashev made changes -
            Link This issue is related to JENKINS-36013 [ JENKINS-36013 ]
            danielbeck Daniel Beck added a comment -

            Does the fix for JENKINS-36013 address this?

            danielbeck Daniel Beck added a comment - Does the fix for JENKINS-36013 address this?
            svanoort Sam Van Oort added a comment -

            danielbeck yes, it should - albeit after a 5 minute timeout before the build aborts when it can't resume on the same node (safety measure).  mpedreiro can you try installing the latest version of durable task step plugin and let us know if that doesn't settle it out (bear in mind that there is a delay before the build is killed when it cannot resume cleanly). 

            Thanks!

            svanoort Sam Van Oort added a comment - danielbeck  yes, it should - albeit after a 5 minute timeout before the build aborts when it can't resume on the same node (safety measure).  mpedreiro can you try installing the latest version of durable task step plugin and let us know if that doesn't settle it out (bear in mind that there is a delay before the build is killed when it cannot resume cleanly).  Thanks!
            svanoort Sam Van Oort added a comment -

            Closing as duplicate of the linked issue, since the fix there should cover this as well.

            svanoort Sam Van Oort added a comment - Closing as duplicate of the linked issue, since the fix there should cover this as well.
            svanoort Sam Van Oort made changes -
            Link This issue duplicates JENKINS-36013 [ JENKINS-36013 ]
            svanoort Sam Van Oort made changes -
            Resolution Duplicate [ 3 ]
            Status Reopened [ 4 ] Closed [ 6 ]

            People

              Unassigned Unassigned
              mpedreiro Marcos Pedreiro
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: