Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-51252

redundant implied label causes slave reconnect loop

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • implied-labels-plugin
    • None
    • Jenkins 2.121
      Swarm plugin 3.12
      Client: tested with swarm plugin 3.7 and 3.12

      (See this comment for latest summary of this issue)

      Specifying the label "windows" causes the client to go into reconnect loop as follows:

      INFO: Attempting to connect to ....
      May 10, 2018 1:17:30 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
      SEVERE: Could not obtain CSRF crumb. Response code: 404
      May 10, 2018 1:17:31 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: slave-010-9533db6e
      May 10, 2018 1:17:31 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      May 10, 2018 1:17:31 PM hudson.remoting.Engine startEngine
      WARNING: No Working Directory. Using the legacy JAR Cache location:...
      May 10, 2018 1:17:31 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among [....]
      May 10, 2018 1:17:31 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, CLI2-connect, JNLP-connect, Ping, CLI-connect, JNLP2-connect]
      May 10, 2018 1:17:31 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
      {{ Agent address: .....}}
      {{ Agent port: 42442}}
      {{ Identity: 2b:48:59:ed:b6:88:6b:da:b5:7c:ef:8e:f4:70:a6:41}}
      May 10, 2018 1:17:31 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      May 10, 2018 1:17:31 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to tc-jenkins-master-001.eur.ad.sag:42442
      May 10, 2018 1:17:31 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect
      May 10, 2018 1:17:31 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Remote identity confirmed: 2b:48:59:ed:b6:88:6b:da:b5:7c:ef:8e:f4:70:a6:41
      May 10, 2018 1:17:31 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected
      May 10, 2018 1:17:32 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      May 10, 2018 1:17:32 PM hudson.plugins.swarm.Client run
      INFO: Retrying in 10 seconds

      This occurs whether using auto-discovery or if specifying -master.  It behaves the same way with client 3.7 or 3.12.  It doesn't matter if the label is specified on command line or in a labelsFile.  There are no errors in the server log file.

      The following command works

      java -Xrs -jar swarm-client-3.7.jar -mode exclusive -name slave-010 -e
      xecutors 1 -labels "test test2 windows64"

      The following command fails:

      java -Xrs -jar swarm-client-3.7.jar -mode exclusive -name slave-010 -e
      xecutors 1 -labels "test test2 windows"

       

      This problem did not exist in Jenkins 2.98 and swarm plugin 3.7

          [JENKINS-51252] redundant implied label causes slave reconnect loop

          Oleg Nenashev added a comment -

          There is no Windows-specific code in the plugin.
          Any chance you have master logs? Maybe there is a job running on them and causing failures.

          Oleg Nenashev added a comment - There is no Windows-specific code in the plugin. Any chance you have master logs? Maybe there is a job running on them and causing failures.

          I don't at the moment.  I wound up removing the label "windows" from all slaves and adding it through the Label Implications plugin instead.  I can try to reproduce it later this week.

          Alexander Komarov added a comment - I don't at the moment.  I wound up removing the label "windows" from all slaves and adding it through the Label Implications plugin instead.  I can try to reproduce it later this week.

          Alexander Komarov added a comment - - edited

          I'm attaching the master and slave logs during an attempt to connect with "windows" in the labels list.  If I remove "windows" (but keep all others, including "windows64"), the connection succeeds. There is no other activity in the logs at this time.

          Also, my workaround with labels-implication plugin assigning "windows" label to everything with "windows64" label isn't always working... as long as that implication is configured, I get the same connect issue even for slaves with only "windows64".  So, there is something magical about "windows".

          log-slave.txt

          log-master.txt

          Alexander Komarov added a comment - - edited I'm attaching the master and slave logs during an attempt to connect with "windows" in the labels list.  If I remove "windows" (but keep all others, including "windows64"), the connection succeeds. There is no other activity in the logs at this time. Also, my workaround with labels-implication plugin assigning "windows" label to everything with "windows64" label isn't always working... as long as that implication is configured, I get the same connect issue even for slaves with only "windows64".  So, there is something magical about "windows". log-slave.txt log-master.txt

          This is probably not a problem with jenkins-swarm...  I just had the same issue with windows docker containers using the docker plugin... until I removed "windows" from the labels list, they could not connect (and this don't use swarm at all).  

          Any advice on what component to reassign to?

          Alexander Komarov added a comment - This is probably not a problem with jenkins-swarm ...  I just had the same issue with windows docker containers using the docker plugin... until I removed "windows" from the labels list, they could not connect (and this don't use swarm at all).   Any advice on what component to reassign to?

          Oleg Nenashev added a comment -

          Docker Plugin, likely

          Oleg Nenashev added a comment - Docker Plugin, likely

          pjdarton added a comment -

          The docker-plugin code doesn't do anything special for different OSs.  It has no more special knowledge of the label "windows" than it has for the label "quirkafleeg"

          I wonder if there's something in the master's configuration that's got the label "windows" meaning something special that's triggering some functionality that's causing the connection loss?

          pjdarton added a comment - The docker-plugin code doesn't do anything special for different OSs.  It has no more special knowledge of the label "windows" than it has for the label "quirkafleeg" I wonder if there's something in the master's configuration that's got the label "windows" meaning something special that's triggering some functionality that's causing the connection loss?

          Alexander Komarov added a comment - - edited

          pjdarton thanks for the clarification.  I believe that I narrowed this down to the labels-implication plugin. 

          My setup is:

          Implied Labels Plugin

          • Labels Implications rule: Inferred Label "windows" => Expression "windows64" 

          Docker Plugin

          • Docker template defines labels: "... windows windows64 ..." 

          This appears to conflict with the label Implication, and causes Docker plugin to go into allocate loop. 

          Jenkins-Swarm plugin

          •  -labels "... windows windows64 ..."

          This prevents the slave from connecting as long as the above label implication is in place.  

           


          I have since eliminated duplication from my label setup and everything works, however, I am leaving this report open because the effects of this (arguably) misconfiguration are cryptic and unexpected. 

          Alexander Komarov added a comment - - edited pjdarton thanks for the clarification.  I believe that I narrowed this down to the labels-implication plugin.  My setup is: Implied Labels Plugin Labels Implications rule: Inferred Label "windows" =>  Expression "windows64"  Docker Plugin Docker template defines labels: "... windows windows64 ..."  This appears to conflict with the label Implication, and causes Docker plugin to go into allocate loop.  Jenkins-Swarm plugin  -labels "... windows windows64 ..." This prevents the slave from connecting as long as the above label implication is in place.     I have since eliminated duplication from my label setup and everything works, however, I am leaving this report open because the effects of this (arguably) misconfiguration are cryptic and unexpected. 

          pjdarton added a comment -

          Nice bit of detective work there

          My guess is that the "causes Docker plugin to go into allocate loop" is actually the same symptom as you see elsewhere - if the docker container that the docker-plugin creates is unable to connect to Jenkins then the docker-plugin will end up trying to create a new one to replace it.
          i.e. this is more of a "with implied labels set this way, slave nodes will fail to connect" issue.

          pjdarton added a comment - Nice bit of detective work there My guess is that the "causes Docker plugin to go into allocate loop" is actually the same symptom as you see elsewhere - if the docker container that the docker-plugin creates is unable to connect to Jenkins then the docker-plugin will end up trying to create a new one to replace it. i.e. this is more of a "with implied labels set this way, slave nodes will fail to connect" issue.

          Alexander Komarov added a comment - - edited

          There is another note: recovery is painful and further confuses investigators:

          1. I remove "windows" from the docker template labels in Jenkins main config, leaving the implication configuration in place.
          2. I even stop and restart the job trying to allocate the container.
          3. Reconnect loop continues indefinitely, anyway (filling up the cloud with orphaned containers, by the way)
          4. I remove the label implication.
          5. Container is allocated successfully.
          6. I re-add the label implication.
          7. Container allocation still works.  (???)

          Alexander Komarov added a comment - - edited There is another note: recovery is painful and further confuses investigators: I remove "windows" from the docker template labels in Jenkins main config, leaving the implication configuration in place. I even stop and restart the job trying to allocate the container. Reconnect loop continues indefinitely, anyway (filling up the cloud with orphaned containers, by the way) I remove the label implication. Container is allocated successfully. I re-add the label implication. Container allocation still works.  (???)

            Unassigned Unassigned
            akom Alexander Komarov
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: