Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-72351

Unresponsive Agent Nodes Marked Offline But Not Reaped or Replaced

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • ec2-plugin
    • None

      Periodically one of our agent nodes becomes unresponsive.  At that point, the node is marked offline by the Jenkins controller.  However, the Jenkins controller fails to reap the offline node or spin up a new one despite the fact that a new one is required to meet the configured minimum number of instances now that the node is question is offline.  So on one hand the controller recognizes that the node is dead, but on the other it seems to not.

      Below is an example of all of the Jenkins logs for a node which became unresponsive and had to be manually removed via the console UI.  The middle block of log messages seems to be common to only the nodes that fail to be reaped by the controller.

      Nov  6 07:03:36 ip-10-10-6-184 jenkins: 2023-11-06 07:03:36.518+0000 [id=1265345]#011INFO#011hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate...
      Nov  6 07:03:36 ip-10-10-6-184 jenkins: 2023-11-06 07:03:36.518+0000 [id=1265345]#011INFO#011h.p.ec2.EC2RetentionStrategy#start: Start requested for EC2 (ec2-Raken AWS Account) - Raken Amzlinux2  (i-098740cb3d378c751)
      Nov  6 07:03:36 ip-10-10-6-184 jenkins: 2023-11-06 07:03:36.518+0000 [id=1265176]#011INFO#011hudson.plugins.ec2.EC2Cloud#log: Launching instance: i-098740cb3d378c751
      Nov  6 07:04:32 ip-10-10-6-184 jenkins: 2023-11-06 07:04:32.266+0000 [id=1265365]#011INFO#011hudson.plugins.ec2.EC2Cloud#log: The SSH key ssh-ed25519 03:14:a3:a6:a4:26:f7:0c:5a:d8:68:ee:a9:91:0d:28 has been automatically trusted for connections to EC2 (ec2-Raken AWS Account) - Raken Amzlinux2  (i-098740cb3d378c751)
      Nov  6 08:52:37 ip-10-10-6-184 jenkins: Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to EC2 (ec2-Raken AWS Account) - Raken Amzlinux2  (i-098740cb3d378c751)
      
      Nov  6 09:23:48 ip-10-10-6-184 jenkins: 2023-11-06 09:23:48.389+0000 [id=1265376]#011INFO#011hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel EC2 (ec2-Raken AWS Account) - Raken Amzlinux2  (i-098740cb3d378c751).
      Nov  6 17:22:39 ip-10-10-6-184 jenkins: 2023-11-06 17:22:39.831+0000 [id=1277434]#011WARNING#011hudson.model.Slave#reportLauncherCreateError: Issue with creating launcher for agent EC2 (ec2-Raken AWS Account) - Raken Amzlinux2  (i-098740cb3d378c751). The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
      Nov  6 17:22:39 ip-10-10-6-184 jenkins: 2023-11-06 17:22:39.832+0000 [id=1277434]#011WARNING#011hudson.model.Slave#reportLauncherCreateError: Issue with creating launcher for agent EC2 (ec2-Raken AWS Account) - Raken Amzlinux2  (i-098740cb3d378c751). The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
      Nov  6 17:22:42 ip-10-10-6-184 jenkins: 2023-11-06 17:22:42.225+0000 [id=1277394]#011WARNING#011hudson.model.Slave#reportLauncherCreateError: Issue with creating launcher for agent EC2 (ec2-Raken AWS Account) - Raken Amzlinux2  (i-098740cb3d378c751). The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
      Nov  6 17:22:42 ip-10-10-6-184 jenkins: 2023-11-06 17:22:42.226+0000 [id=1277394]#011WARNING#011hudson.model.Slave#reportLauncherCreateError: Issue with creating launcher for agent EC2 (ec2-Raken AWS Account) - Raken Amzlinux2  (i-098740cb3d378c751). The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
      Nov  6 20:13:08 ip-10-10-6-184 jenkins: 2023-11-06 20:13:08.930+0000 [id=36]#011INFO#011hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description='Raken Amzlinux2 ', labels='default worker'}. checkInstance: i-098740cb3d378c751.. false - found existing corresponding Jenkins agent: i-098740cb3d378c751
      Nov  6 23:58:46 ip-10-10-6-184 jenkins: 2023-11-06 23:58:46.240+0000 [id=1290378]#011INFO#011hudson.plugins.ec2.EC2Cloud#log: Launching instance: i-098740cb3d378c751
      
      Nov  7 00:17:01 ip-10-10-6-184 jenkins: 2023-11-07 00:17:01.064+0000 [id=41]#011INFO#011h.p.ec2.EC2RetentionStrategy#internalCheck: Idle timeout of EC2 (ec2-Raken AWS Account) - Raken Amzlinux2  (i-098740cb3d378c751) after 16 idle minutes, instance statusRUNNING Nov  7 00:17:01 ip-10-10-6-184 jenkins: 2023-11-07 00:17:01.064+0000 [id=41]#011INFO#011h.plugins.ec2.EC2AbstractSlave#idleTimeout: EC2 instance idle time expired: i-098740cb3d378c751
      

      This has been happening since this Jenkins cluster was built in June and has continued across a couple of plugin version upgrades.

       

            thoulen FABRIZIO MANFREDI
            kevin_palmer Kevin
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: