Loading...

Type: Bug
Resolution: Duplicate
Priority: Critical
Component/s: core
Labels:

We are experiencing issues with our jenkins server where it will regularly lock the build queue such that no jobs can be scheduled, and no jobs in the queue start running. At this point we are also unable to use the /script endpoint to clear the build queue

We have been unable to determine the root cause of this issue, and restarting the jenkins server causes it to come up in a locked state as well. We have attempted to remove all the files that we know of that would tell Jenkins to resume jobs after a restart, but have been unable to find the comprehensive list. In particular are the pipeline jobs that come up and try to resume. Below is the list of files that we have tried removing:

/var/lib/jenkins/queue.xml
/var/lib/jenkins/queue.xml.bak
/var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml
/var/lib/jenkins/org.jenkinsci.plugins.workflow.flow.FlowExecutionList.bak
/var/lib/jenkins/org.jenkinsci.plugins.workflow.support.steps.StageStep.xml

Other triage steps we have taken:

Increased JENKINS_HANDLER_MAX from 300 to 600 in /etc/sysconfig/jenkins
Starting in shutdown mode with a clear build queue
disabling jobs that queue frequently

(first 3 edits were on the same day the ticket was created, Tuesday August 1, 2017)

edit:

We bumped the number of executors to 2, and we see jobs queueing and running, but only on the second executor, not the first one.

edit 2:

Added an image from the /monitoring endpoint, the active thread count is the only sign that something has gone wrong. No other metric appears to be affected in a meaningful way

edit 3:

We managed to get a thread dump from the /threadDump endpoint. There 3 different suspect thread dumps:

"AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#417]" Id=19597 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@488e43ab at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@529a7a50

"Computer.threadPoolForRemoting [#80]" Id=19565 Group=main BLOCKED on hudson.model.Queue@770f2152 owned by "Computer.threadPoolForRemoting [#84]" Id=19573 at com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy$1.run(DockerOnceRetentionStrategy.java:110) - blocked on hudson.model.Queue@770f2152 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3677ff7e

"Computer.threadPoolForRemoting [#81]" Id=19569 Group=main TIMED_WAITING on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.SynchronousQueue$TransferStack@18c5c88 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

EDIT 4 (08-03-17|August 3): We seem to have stabilized the server.

In our set up, a pipeline build that was interrupted would attempt to resume upon Jenkins restart. However when it does so it uses the exact node name that it had before which in this case is a docker container name with a unique hash that no longer exists. This causes the build to zombie itself and hang.

The only indication on the file system that a pipeline job has not completed is that most recent stage xml file in the workflow directory for a given build does not contain a "<result>" tag. We wrote a python script that we ran on the jenkins master server to look for that and delete any build that did not have a "<result>" tag. We would appreciate any feed back on this approach, the python script is attached below.

The directory for a given job is:

/var/lib/jenkins/jobs/<JOB_NAME>/builds/<BUILD_NUMBER>/workflow

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

active thead count.png
61 kB
2017-08-01 22:45
clean_zombie_jobs.py
3 kB
2017-08-04 04:49

duplicates

JENKINS-36013 Automatically abort ExecutorPickle rehydration from an ephemeral node

Closed

is related to

JENKINS-33761 Ability to disable Pipeline durability and "resume" build.

Closed

JENKINS-36013 Automatically abort ExecutorPickle rehydration from an ephemeral node

Closed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates