-
Bug
-
Resolution: Fixed
-
Critical
-
None
We see periodically that the spot-fleet causes the master to deadlock on scale up events (I think this also occurs on scale-down events too but I don't have logs for that yet). The master stays up and appears functional, but the queue is locked and you can't submit new builds via the UI. I see this in the Jenkins log:
Jun 22, 2017 3:10:31 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel i-005ce3b8f5ae8c029 java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59) Jun 22, 2017 3:11:32 AM com.amazon.jenkins.ec2fleet.EC2FleetCloud updateStatus INFO: Found new instances from fleet (docker_ci ec2-fleet ubuntu-16.04): [i-03a1246c8e590d6eb] Jun 22, 2017 3:11:32 AM com.amazon.jenkins.ec2fleet.IdleRetentionStrategy <init> INFO: Idle Retention initiated Jun 22, 2017 3:12:10 AM jenkins.metrics.api.Metrics$HealthChecker execute WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#8] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@38b86b17 (owned by jenkins.util.Timer [#4]): at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1332) at hudson.model.Queue.withLock(Queue.java:1211) at jenkins.model.Nodes.addNode(Nodes.java:133) at jenkins.model.Jenkins.addNode(Jenkins.java:2115) at com.amazon.jenkins.ec2fleet.EC2FleetCloud.addNewSlave(EC2FleetCloud.java:355) at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateStatus(EC2FleetCloud.java:312) at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:42) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) , jenkins.util.Timer [#4] locked on com.amazon.jenkins.ec2fleet.EC2FleetCloud@5ba7db19 (owned by jenkins.util.Timer [#8]): at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:38) at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:15) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1334) at hudson.model.Queue.withLock(Queue.java:1211) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ]] Jun 22, 2017 3:15:09 AM hudson.model.AsyncPeriodicWork$1 run INFO: Started EC2 alive slaves monitor Jun 22, 2017 3:15:09 AM hudson.model.AsyncPeriodicWork$1 run INFO: Finished EC2 alive slaves monitor. 0 ms
I'm not sure why this doesn't happen all the time. It appears that some of the slaves failed to come up, I wonder if that is a culprit. I also wonder if we can do better than the big lock that we place around the master when doing scale up/down. I haven't looked deeply at the code but the other aws-ec2 plugin doesn't seem to hold such an large lock.
- relates to
-
JENKINS-37483 Deadlock caused by synchronized methods in EC2Cloud
- Closed