Loading...

Type: Bug
Resolution: Fixed
Priority: Critical
Component/s: ec2-fleet-plugin
Labels:
None

We see periodically that the spot-fleet causes the master to deadlock on scale up events (I think this also occurs on scale-down events too but I don't have logs for that yet). The master stays up and appears functional, but the queue is locked and you can't submit new builds via the UI. I see this in the Jenkins log:

Jun 22, 2017 3:10:31 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel i-005ce3b8f5ae8c029
java.io.IOException: Unexpected termination of the channel
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
Caused by: java.io.EOFException
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)

Jun 22, 2017 3:11:32 AM com.amazon.jenkins.ec2fleet.EC2FleetCloud updateStatus
INFO: Found new instances from fleet (docker_ci ec2-fleet ubuntu-16.04): [i-03a1246c8e590d6eb]
Jun 22, 2017 3:11:32 AM com.amazon.jenkins.ec2fleet.IdleRetentionStrategy <init>
INFO: Idle Retention initiated
Jun 22, 2017 3:12:10 AM jenkins.metrics.api.Metrics$HealthChecker execute
WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#8] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@38b86b17 (owned by jenkins.util.Timer [#4]):
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at hudson.model.Queue._withLock(Queue.java:1332)
at hudson.model.Queue.withLock(Queue.java:1211)
at jenkins.model.Nodes.addNode(Nodes.java:133)
at jenkins.model.Jenkins.addNode(Jenkins.java:2115)
at com.amazon.jenkins.ec2fleet.EC2FleetCloud.addNewSlave(EC2FleetCloud.java:355)
at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateStatus(EC2FleetCloud.java:312)
at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:42)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
, jenkins.util.Timer [#4] locked on com.amazon.jenkins.ec2fleet.EC2FleetCloud@5ba7db19 (owned by jenkins.util.Timer [#8]):
at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:38)
at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:15)
at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
at hudson.model.Queue._withLock(Queue.java:1334)
at hudson.model.Queue.withLock(Queue.java:1211)
at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
]]
Jun 22, 2017 3:15:09 AM hudson.model.AsyncPeriodicWork$1 run
INFO: Started EC2 alive slaves monitor
Jun 22, 2017 3:15:09 AM hudson.model.AsyncPeriodicWork$1 run
INFO: Finished EC2 alive slaves monitor. 0 ms

I'm not sure why this doesn't happen all the time. It appears that some of the slaves failed to come up, I wonder if that is a culprit. I also wonder if we can do better than the big lock that we place around the master when doing scale up/down. I haven't looked deeply at the code but the other aws-ec2 plugin doesn't seem to hold such an large lock.

relates to

JENKINS-37483 Deadlock caused by synchronized methods in EC2Cloud

Closed

Details

Description

Attachments

Issue Links

Activity

People

Dates