Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45074

spot-fleet plugin deadlocks master when scaling up

    XMLWordPrintable

Details

    Description

      We see periodically that the spot-fleet causes the master to deadlock on scale up events (I think this also occurs on scale-down events too but I don't have logs for that yet). The master stays up and appears functional, but the queue is locked and you can't submit new builds via the UI. I see this in the Jenkins log:

      Jun 22, 2017 3:10:31 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
      SEVERE: I/O error in channel i-005ce3b8f5ae8c029
      java.io.IOException: Unexpected termination of the channel
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
      Caused by: java.io.EOFException
      at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)
      at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
      at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
      at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
      at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
      at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
      
      Jun 22, 2017 3:11:32 AM com.amazon.jenkins.ec2fleet.EC2FleetCloud updateStatus
      INFO: Found new instances from fleet (docker_ci ec2-fleet ubuntu-16.04): [i-03a1246c8e590d6eb]
      Jun 22, 2017 3:11:32 AM com.amazon.jenkins.ec2fleet.IdleRetentionStrategy <init>
      INFO: Idle Retention initiated
      Jun 22, 2017 3:12:10 AM jenkins.metrics.api.Metrics$HealthChecker execute
      WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#8] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@38b86b17 (owned by jenkins.util.Timer [#4]):
      at sun.misc.Unsafe.park(Native Method)
      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
      at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
      at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
      at hudson.model.Queue._withLock(Queue.java:1332)
      at hudson.model.Queue.withLock(Queue.java:1211)
      at jenkins.model.Nodes.addNode(Nodes.java:133)
      at jenkins.model.Jenkins.addNode(Jenkins.java:2115)
      at com.amazon.jenkins.ec2fleet.EC2FleetCloud.addNewSlave(EC2FleetCloud.java:355)
      at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateStatus(EC2FleetCloud.java:312)
      at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:42)
      at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      , jenkins.util.Timer [#4] locked on com.amazon.jenkins.ec2fleet.EC2FleetCloud@5ba7db19 (owned by jenkins.util.Timer [#8]):
      at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:38)
      at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:15)
      at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
      at hudson.model.Queue._withLock(Queue.java:1334)
      at hudson.model.Queue.withLock(Queue.java:1211)
      at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
      at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      ]]
      Jun 22, 2017 3:15:09 AM hudson.model.AsyncPeriodicWork$1 run
      INFO: Started EC2 alive slaves monitor
      Jun 22, 2017 3:15:09 AM hudson.model.AsyncPeriodicWork$1 run
      INFO: Finished EC2 alive slaves monitor. 0 ms
      

       

      I'm not sure why this doesn't happen all the time. It appears that some of the slaves failed to come up, I wonder if that is a culprit. I also wonder if we can do better than the big lock that we place around the master when doing scale up/down. I haven't looked deeply at the code but the other aws-ec2 plugin doesn't seem to hold such an large lock.

      Attachments

        Issue Links

          Activity

            People

              schmutze Chad Schmutzer
              elatt Erik Lattimore
              Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: