Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49816

swarm node says connected succesffuly, but master has placed it offline

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • remoting
    • None
    • Jenkins ver. 2.89.4
      Swarm 3.9

      We spin up 1000's of nodes with swarm per month.

      Every month we encounter a few scenarios where the swarm agent says it connected successfully, but the jenkins master does not show it.

      The node has these logs (notice it does not say "INFO: Connected", which it usually does):

      Swarm Logs

      INFO: Client.main invoked with: [-name eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be -description Created by Swarm. InstanceID=i-03918a0ef1ef6d8be AmiId=ami-a030b2d8 -executors 1 -fsroot /mnt/ope/ws -labels eod-us-west-2_spot_m3.xlarge -master https://jenkins.clearcare.it/ -mode normal -retry 30 -username sre@clearcareonline.com -password nJ0yuLYBcOJE -disableSslVerification]
      Feb 28, 2018 7:49:57 PM hudson.plugins.swarm.Client run
      INFO: Discovering Jenkins master
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.Client run
      INFO: Attempting to connect to https://jenkins.clearcare.it/ ea7ab441-78d0-4548-a571-5feaae0be121 with ID fd8127ce
      Feb 28, 2018 7:50:14 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
      SEVERE: Could not obtain CSRF crumb. Response code: 404
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among https://jenkins.foo.it/
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
      Agent address: jenkins.foo.it
      Agent port: 30001
      Identity: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins.foo.it:30001
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect
      Feb 28, 2018 7:50:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Remote identity confirmed: c9:5a:43:aa:0e:bc:16:0a:c5:92:09:91:03:46:f7:ec

      On the master logs, I see this:
      WARNING: Making eod-us-west-2_spot_m3.xlarge-i-03918a0ef1ef6d8be-fd8127ce offline because it’s not responding

      Restarting the java process does the trick, but I hate manually doing this.
      It seems the swarm jar gets stuck after the log, "Remote identity confirmed".

      Again, out of 1000 times a month, this issue occurs maybe 2-4 times.

            jthompson Jeff Thompson
            grayaii Alex Gray
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: