Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-30284

EC2 plugin too aggressive in timing in contacting new AWS instance over SSH

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • ec2-plugin
    • None
    • EC2 Plugin v 1.29
      Jenkins 1.624

      In every 10 or so instance launches, I see cases where the EC2 plugin opens a SSH connection to the new instance before the time where I believe the corresponding private key has been put in place by the AWS infrastructure during the launch of the instance. As a result, we see errors in the node launch. The output for the launch is below.

      just before slave CentOS (i-f295e51b) gets launched ...
      executing pre-launch scripts ...
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is still pending/launching, waiting 5s
      Node CentOS (i-f295e51b)(i-f295e51b) is ready
      Connecting to 10.240.1.146 on port 22, with timeout 10000.
      Failed to connect via ssh: The kexTimeout (10000 ms) expired.
      Waiting for SSH to come up. Sleeping 5.
      Connecting to 10.240.1.146 on port 22, with timeout 10000.
      Failed to connect via ssh: The kexTimeout (10000 ms) expired.
      Waiting for SSH to come up. Sleeping 5.
      Connecting to 10.240.1.146 on port 22, with timeout 10000.
      Failed to connect via ssh: The kexTimeout (10000 ms) expired.
      Waiting for SSH to come up. Sleeping 5.
      Connecting to 10.240.1.146 on port 22, with timeout 10000.
      Connected via SSH.
      bootstrap()
      Getting keypair...
      Using key: jenkins
      dc:xx:xx:xx
      ----BEGIN RSA PRIVATE KEY----
      MIIEow<private key info>
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      Authentication failed. Trying again...
      Authenticating as centos
      ERROR: Publickey authentication failed.
      java.io.IOException: Publickey authentication failed.
      at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:315)
      at com.trilead.ssh2.Connection.authenticateWithPublicKey(Connection.java:467)
      at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:260)
      at hudson.plugins.ec2.ssh.EC2UnixLauncher.launch(EC2UnixLauncher.java:91)
      at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:107)
      at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:238)
      at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: The connection is closed.
      at com.trilead.ssh2.auth.AuthenticationManager.deQueue(AuthenticationManager.java:63)
      at com.trilead.ssh2.auth.AuthenticationManager.getNextMessage(AuthenticationManager.java:86)
      at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:290)
      ... 10 more
      Caused by: java.io.IOException: Peer sent DISCONNECT message (reason code 2): Too many authentication failures for centos
      at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766)
      at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489)
      ... 1 more

          [JENKINS-30284] EC2 plugin too aggressive in timing in contacting new AWS instance over SSH

          Mike Kingsbury created issue -

          Tenyo Grozev added a comment -

          We are running into the same problem with a CentOS slave (not sure if the OS matters) and it seems to be approximately 1 in 10 or so. Our output looks the same as the one from the description above.
          When that happens the affected slave just says "Ping response time is too long or timed out." and hangs around until we delete it.

          Tenyo Grozev added a comment - We are running into the same problem with a CentOS slave (not sure if the OS matters) and it seems to be approximately 1 in 10 or so. Our output looks the same as the one from the description above. When that happens the affected slave just says "Ping response time is too long or timed out." and hangs around until we delete it.

          Same issue here, running CentOS 7 slaves. Output looks the same as above please make these timing configurable! Thanks

          Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
          Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
          Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
          Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
          Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
          Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s
          Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is ready
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Failed to connect via ssh: The kexTimeout (10000 ms) expired.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000.
          Connected via SSH.
          bootstrap()
          Getting keypair...
          Using key: np-dev-key
          92:xx:yy:zz
          -----BEGIN RSA PRIVATE KEY-----
          MIIEoXXXX<snip>
          Authenticating as centos
          Authentication failed. Trying again...
          Authenticating as centos
          Authentication failed. Trying again...
          Authenticating as centos
          Authentication failed. Trying again...
          Authenticating as centos
          Authentication failed. Trying again...
          Authenticating as centos
          Authentication failed. Trying again...
          Authenticating as centos
          ERROR: Publickey authentication failed.
          java.io.IOException: Publickey authentication failed.
          	at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:315)
          	at com.trilead.ssh2.Connection.authenticateWithPublicKey(Connection.java:467)
          	at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:260)
          	at hudson.plugins.ec2.ssh.EC2UnixLauncher.launch(EC2UnixLauncher.java:91)
          	at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:107)
          	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:253)
          	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          Caused by: java.io.IOException: The connection is closed.
          	at com.trilead.ssh2.auth.AuthenticationManager.deQueue(AuthenticationManager.java:63)
          	at com.trilead.ssh2.auth.AuthenticationManager.getNextMessage(AuthenticationManager.java:86)
          	at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:290)
          	... 10 more
          Caused by: java.io.IOException: Peer sent DISCONNECT message (reason code 2): Too many authentication failures for centos
          	at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766)
          	at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489)
          	... 1 more
          

          E Camden Fisher added a comment - Same issue here, running CentOS 7 slaves. Output looks the same as above please make these timing configurable! Thanks Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is still pending/launching, waiting 5s Node Jenkins Docker Slave (i-386b0fc1)(i-386b0fc1) is ready Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: There was a problem while connecting to ec2-52-91-128-246.compute-1.amazonaws.com:22 Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Failed to connect via ssh: The kexTimeout (10000 ms) expired. Waiting for SSH to come up. Sleeping 5. Connecting to ec2-52-91-128-246.compute-1.amazonaws.com on port 22, with timeout 10000. Connected via SSH. bootstrap() Getting keypair... Using key: np-dev-key 92:xx:yy:zz -----BEGIN RSA PRIVATE KEY----- MIIEoXXXX<snip> Authenticating as centos Authentication failed. Trying again... Authenticating as centos Authentication failed. Trying again... Authenticating as centos Authentication failed. Trying again... Authenticating as centos Authentication failed. Trying again... Authenticating as centos Authentication failed. Trying again... Authenticating as centos ERROR: Publickey authentication failed. java.io.IOException: Publickey authentication failed. at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:315) at com.trilead.ssh2.Connection.authenticateWithPublicKey(Connection.java:467) at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:260) at hudson.plugins.ec2.ssh.EC2UnixLauncher.launch(EC2UnixLauncher.java:91) at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:107) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:253) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Caused by: java.io.IOException: The connection is closed. at com.trilead.ssh2.auth.AuthenticationManager.deQueue(AuthenticationManager.java:63) at com.trilead.ssh2.auth.AuthenticationManager.getNextMessage(AuthenticationManager.java:86) at com.trilead.ssh2.auth.AuthenticationManager.authenticatePublicKey(AuthenticationManager.java:290) ... 10 more Caused by: java.io.IOException: Peer sent DISCONNECT message (reason code 2): Too many authentication failures for centos at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766) at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489) ... 1 more

          This seems to be getting worse - maybe extra load on AWS. Even having the ability to tweak these timeouts would be very helpful. I'm manually killing off 10 - 15 slaves a day now.

          E Camden Fisher added a comment - This seems to be getting worse - maybe extra load on AWS. Even having the ability to tweak these timeouts would be very helpful. I'm manually killing off 10 - 15 slaves a day now.

          Francis Upton added a comment -

          Currently the timeout for this is set at 10000ms (10 seconds). Should the default be more? We can certainly put a mechanism in to make this configurable.

          Francis Upton added a comment - Currently the timeout for this is set at 10000ms (10 seconds). Should the default be more? We can certainly put a mechanism in to make this configurable.

          Looking more closely at this, I'm not 100% sure it's related to a "timeout", but maybe a chicken and egg situation where AWS hasn't dropped the SSH key yet. Would it be possible to retry that SSH some configurable number of times if it fails?

          E Camden Fisher added a comment - Looking more closely at this, I'm not 100% sure it's related to a "timeout", but maybe a chicken and egg situation where AWS hasn't dropped the SSH key yet. Would it be possible to retry that SSH some configurable number of times if it fails?

          Francis Upton added a comment - - edited

          Below is our code. And it looks like the problem is we are retrying too frequently and have 5 failures and centos does not like this. But I'm not sure what we are waiting for to change on the slave so that the authentication will eventually succeed. I mean 10 seconds between each try seems like a pretty long time.

          Any ideas of what it's waiting for?

          private int bootstrap(Connection bootstrapConn, EC2Computer computer, PrintStream logger) throws IOException,
                      InterruptedException, AmazonClientException {
                  logger.println("bootstrap()");
                  boolean closeBootstrap = true;
                  try {
                      int tries = 20;
                      boolean isAuthenticated = false;
                      logger.println("Getting keypair...");
                      KeyPair key = computer.getCloud().getKeyPair();
                      logger.println("Using key: " + key.getKeyName() + "\n" + key.getKeyFingerprint() + "\n"
                              + key.getKeyMaterial().substring(0, 160));
                      while (tries-- > 0) {
                          logger.println("Authenticating as " + computer.getRemoteAdmin());
                          isAuthenticated = bootstrapConn.authenticateWithPublicKey(computer.getRemoteAdmin(), key.getKeyMaterial().toCharArray(), "");
                          if (isAuthenticated) {
                              break;
                          }
                          logger.println("Authentication failed. Trying again...");
                          Thread.sleep(10000);
                      }
                      if (!isAuthenticated) {
                          logger.println("Authentication failed");
                          return FAILED;
                      }
                      closeBootstrap = false;
                      return SAMEUSER;
                  } finally {
                      if (closeBootstrap)
                          bootstrapConn.close();
                  }
              }
          

          Francis Upton added a comment - - edited Below is our code. And it looks like the problem is we are retrying too frequently and have 5 failures and centos does not like this. But I'm not sure what we are waiting for to change on the slave so that the authentication will eventually succeed. I mean 10 seconds between each try seems like a pretty long time. Any ideas of what it's waiting for? private int bootstrap(Connection bootstrapConn, EC2Computer computer, PrintStream logger) throws IOException, InterruptedException, AmazonClientException { logger.println( "bootstrap()" ); boolean closeBootstrap = true ; try { int tries = 20; boolean isAuthenticated = false ; logger.println( "Getting keypair..." ); KeyPair key = computer.getCloud().getKeyPair(); logger.println( "Using key: " + key.getKeyName() + "\n" + key.getKeyFingerprint() + "\n" + key.getKeyMaterial().substring(0, 160)); while (tries-- > 0) { logger.println( "Authenticating as " + computer.getRemoteAdmin()); isAuthenticated = bootstrapConn.authenticateWithPublicKey(computer.getRemoteAdmin(), key.getKeyMaterial().toCharArray(), ""); if (isAuthenticated) { break ; } logger.println( "Authentication failed. Trying again..." ); Thread .sleep(10000); } if (!isAuthenticated) { logger.println( "Authentication failed" ); return FAILED; } closeBootstrap = false ; return SAMEUSER; } finally { if (closeBootstrap) bootstrapConn.close(); } }

          E Camden Fisher added a comment - - edited

          I'm not totally sure, but it looks like I see OpenSSH start

          Jan  7 19:39:14 ip-10-1-1-219 systemd: Started OpenSSH server daemon.
          

          and then get restarted later by cloud-init when the authorized_keys file is put in place.

          Jan  7 19:39:31 ip-10-1-1-219 systemd: Started Initial cloud-init job (metadata service crawler).
          Jan  7 19:39:31 ip-10-1-1-219 systemd: Starting Cloud-config availability.
          Jan  7 19:39:31 ip-10-1-1-219 systemd: Reached target Cloud-config availability.
          Jan  7 19:39:31 ip-10-1-1-219 systemd: Starting Apply the settings specified in cloud-config...
          Jan  7 19:39:32 ip-10-1-1-219 systemd: Stopping OpenSSH server daemon...
          Jan  7 19:39:32 ip-10-1-1-219 systemd: Started OpenSSH Server Key Generation.
          Jan  7 19:39:32 ip-10-1-1-219 systemd: Starting OpenSSH server daemon...
          Jan  7 19:39:32 ip-10-1-1-219 systemd: Started OpenSSH server daemon.
          Jan  7 19:39:32 ip-10-1-1-219 systemd: Started Apply the settings specified in cloud-config.
          Jan  7 19:39:32 ip-10-1-1-219 systemd: Starting Execute cloud user/final scripts...
          Jan  7 19:39:32 ip-10-1-1-219 ec2:
          Jan  7 19:39:32 ip-10-1-1-219 ec2: #############################################################
          Jan  7 19:39:32 ip-10-1-1-219 ec2: -----BEGIN SSH HOST KEY FINGERPRINTS-----
          Jan  7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx   (ECDSA)
          Jan  7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx   (ED25519)
          Jan  7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx   (RSA)
          Jan  7 19:39:32 ip-10-1-1-219 ec2: -----END SSH HOST KEY FINGERPRINTS-----
          Jan  7 19:39:32 ip-10-1-1-219 ec2: #############################################################
          Jan  7 19:39:33 ip-10-1-1-219 systemd: Started Execute cloud user/final scripts.
          

          Maybe you are starting your authentication attempts during that first phase of SSH being up which eventually times out or centos says 'go away'.

          EDIT: the above logs are from a slave that succeeded by the way, so those times could be much different in the failure case.

          E Camden Fisher added a comment - - edited I'm not totally sure, but it looks like I see OpenSSH start Jan 7 19:39:14 ip-10-1-1-219 systemd: Started OpenSSH server daemon. and then get restarted later by cloud-init when the authorized_keys file is put in place. Jan 7 19:39:31 ip-10-1-1-219 systemd: Started Initial cloud-init job (metadata service crawler). Jan 7 19:39:31 ip-10-1-1-219 systemd: Starting Cloud-config availability. Jan 7 19:39:31 ip-10-1-1-219 systemd: Reached target Cloud-config availability. Jan 7 19:39:31 ip-10-1-1-219 systemd: Starting Apply the settings specified in cloud-config... Jan 7 19:39:32 ip-10-1-1-219 systemd: Stopping OpenSSH server daemon... Jan 7 19:39:32 ip-10-1-1-219 systemd: Started OpenSSH Server Key Generation. Jan 7 19:39:32 ip-10-1-1-219 systemd: Starting OpenSSH server daemon... Jan 7 19:39:32 ip-10-1-1-219 systemd: Started OpenSSH server daemon. Jan 7 19:39:32 ip-10-1-1-219 systemd: Started Apply the settings specified in cloud-config. Jan 7 19:39:32 ip-10-1-1-219 systemd: Starting Execute cloud user/final scripts... Jan 7 19:39:32 ip-10-1-1-219 ec2: Jan 7 19:39:32 ip-10-1-1-219 ec2: ############################################################# Jan 7 19:39:32 ip-10-1-1-219 ec2: -----BEGIN SSH HOST KEY FINGERPRINTS----- Jan 7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx (ECDSA) Jan 7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx (ED25519) Jan 7 19:39:32 ip-10-1-1-219 ec2: xxxxxxxx (RSA) Jan 7 19:39:32 ip-10-1-1-219 ec2: -----END SSH HOST KEY FINGERPRINTS----- Jan 7 19:39:32 ip-10-1-1-219 ec2: ############################################################# Jan 7 19:39:33 ip-10-1-1-219 systemd: Started Execute cloud user/final scripts. Maybe you are starting your authentication attempts during that first phase of SSH being up which eventually times out or centos says 'go away'. EDIT: the above logs are from a slave that succeeded by the way, so those times could be much different in the failure case.

          Francis Upton added a comment -

          Thanks, that's helpful. Any way of getting the logs on both sides from the failure case (of the same failure)?

          Francis Upton added a comment - Thanks, that's helpful. Any way of getting the logs on both sides from the failure case (of the same failure)?

          Sure, I'll collect logs on the next failure. Unfortunately, it looks like the slave.log's don't have timestamps.

          E Camden Fisher added a comment - Sure, I'll collect logs on the next failure. Unfortunately, it looks like the slave.log's don't have timestamps.

            francisu Francis Upton
            mkingsbury Mike Kingsbury
            Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: