Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-38115

spot instance slaves are not removed, blocks instance termination

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Blocker Blocker
    • ec2-fleet-plugin
    • Jenkins ver. 2.7.1
      ec2-fleet-plugin ver. 1.0
      Amazon Linux AMI 2015.03
      java version "1.7.0_85"
      OpenJDK Runtime Environment (amzn-2.6.1.3.61.amzn1-x86_64 u85-b01)
      OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)

      Scenario:
      I've created a fleet under clouds in jenkins config, and selected my spot request. test returns an error, which may or may not be related. the ids i've redacted all match the ids i see in the ec2 management console.

      The SpotFleetRequestId(s) sfr-<stuff> do not exist. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidSpotFleetRequestId.NotFound; Request ID: <other stuff> )

      to test the plugin, i set it up with a max cluster size of 3, scaledown timeout of 15 minutes. the spot request capacity defaults to 1.

      expected:
      when the test job runs, two additional slaves are created, backed by new spot instances. everything runs fine until the job is done and the scaledown timeout hits, at which point the slaves are not removed. one or more may show disconnected, and the log begins to show errors:

      SEVERE: I/O error in channel i-cdcef210
      java.io.IOException: Unexpected termination of the channel
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      Caused by: java.io.EOFException
      at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2332)
      at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2801)
      at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
      at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
      at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
      at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

      SEVERE: Timer task hudson.slaves.ComputerRetentionWork@5654d212 failed
      java.lang.IllegalStateException: Unknown instance terminated: i-cdcef210
      at com.amazon.jenkins.ec2fleet.EC2FleetCloud.terminateInstance(EC2FleetCloud.java:266)
      at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:28)
      at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:12)
      at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
      at hudson.model.Queue._withLock(Queue.java:1315)
      at hudson.model.Queue.withLock(Queue.java:1192)
      at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
      at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)

      the second of those errors repeats until the slave is manually removed (which sometimes results in a jenkins stack trace in the UI with the same error), at which point the rest of the process is unblocked. the spot instances are terminated and the slaves are removed.

      the plugin is essentially nonfunctional in this state

            cyberax Aleksei Besogonov
            jvais_soasta jude vais
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: