Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-13735

Jenkins starts wrong slave for job restricted to specific one

      I'm using the following setup:

      • WinXP slaves A,B,C
      • jobs jA, jB, jC, tied to slaves A,B,C respectively using "Restrict where this job can run"

      Assume all slaves are disconnected and powered off, no builds are queued.
      When starting a build manually, say jC, the following will happen:

      • job jC will be scheduled and also displayed accordingly in the build queue
      • tooltip will say it's waiting because slave C is offline
      • next, slave A is powered on by Jenkins and connection is established
      • jC will not be started, Jenkins seems to honor the restriction correctly
      • after some idle time, Jenkins realizes the slave is idle and causes shut down
      • then, same procedure happens with slave B
      • on occasion, next one is slave A again
      • finally (on good luck?) slave C happens to be started
      • jC is executed

      It is possible that jC is waiting for hours (indefinitely?), because the required
      slave is not powered on. I also observed this behaviour using a time-trigger
      instead of manual trigger, so I assume it is independent of the type of trigger.
      Occasionally it also happens that the correct slave is powered up right away,
      but that seems to happen by chance. The concrete pattern is not obvious to me.

      Note that the component selection above is just my best guess.

      Cheers, Marco

          [JENKINS-13735] Jenkins starts wrong slave for job restricted to specific one

          Jason Swager added a comment -

          I've been encountering the same problem. I thought it was in the code of the vSphere Plugin, but it turns out that it's not. Jenkins is issuing a connect() call on slaves that have no reason to be starting up due to the queued jobs that I can see.

          Part of the problem IS the vSphere Plugin itself. Originally, when a job was fired up, any slave that was down that could the job would be started by the vSphere Plugin because the connect() method would get called on all those slaves, which resulted in a large number of VMs being powered on for a single job. I added code to the plugin to throttle that behavior. Unfortunately, the throttling is causing this problem to get worse. Where as originally, jA, jB, and jC might have been started up, jC now MIGHT get started up due to the vSphere plugin throttling the VM startups.

          Initial investigation seems to indicate that the Slave.canTake() function might not be functioning as expected. If I find anything further during my investigation, I'll post here.

          Jason Swager added a comment - I've been encountering the same problem. I thought it was in the code of the vSphere Plugin, but it turns out that it's not. Jenkins is issuing a connect() call on slaves that have no reason to be starting up due to the queued jobs that I can see. Part of the problem IS the vSphere Plugin itself. Originally, when a job was fired up, any slave that was down that could the job would be started by the vSphere Plugin because the connect() method would get called on all those slaves, which resulted in a large number of VMs being powered on for a single job. I added code to the plugin to throttle that behavior. Unfortunately, the throttling is causing this problem to get worse. Where as originally, jA, jB, and jC might have been started up, jC now MIGHT get started up due to the vSphere plugin throttling the VM startups. Initial investigation seems to indicate that the Slave.canTake() function might not be functioning as expected. If I find anything further during my investigation, I'll post here.

          jsiirola added a comment -

          I am also seeing this problem after upgrading from 1.459 -> 1.464 (running Winstone under Linux). I do not have the vSphere plugin installed. In my case, the problem is being exacerbated by one of the build slaves being down for maintenance. This has led to jobs stacking up in the queue, which in turn has led to Jenkins starting every slave in the farm.

          jsiirola added a comment - I am also seeing this problem after upgrading from 1.459 -> 1.464 (running Winstone under Linux). I do not have the vSphere plugin installed. In my case, the problem is being exacerbated by one of the build slaves being down for maintenance. This has led to jobs stacking up in the queue, which in turn has led to Jenkins starting every slave in the farm.

          Jason Swager added a comment - - edited

          I believe that I have a fix for this, but being new to git and even more to Jenkins core programming, I'll just submit the patch (hopefully did that right) as part of this comment. The patch should address a flaw in the code logic where a slave that cannot handle a build request is started. The very minor change is to add one additional check to make sure that the slave CAN handle the request before flagging it to be startable.

           core/src/main/java/hudson/slaves/RetentionStrategy.java |    2 +-
           1 file changed, 1 insertion(+), 1 deletion(-)
          
          diff --git a/core/src/main/java/hudson/slaves/RetentionStrategy.java b/core/src/main/java/hudson/slaves/RetentionStrategy.java
          index 02611e5..f007ac6 100644
          --- a/core/src/main/java/hudson/slaves/RetentionStrategy.java
          +++ b/core/src/main/java/hudson/slaves/RetentionStrategy.java
          @@ -218,7 +218,7 @@ public abstract class RetentionStrategy<T extends Computer> extends AbstractDesc
                                   }
                               }
           
          -                    if (needExecutor) {
          +                    if (needExecutor && (c.getNode().canTake(item) == null)) {
                                   demandMilliseconds = System.currentTimeMillis() - item.buildableStartMilliseconds;
                                   needComputer = demandMilliseconds > inDemandDelay * 1000 * 60 /*MINS->MILLIS*/;
                                   break;
          

          Jason Swager added a comment - - edited I believe that I have a fix for this, but being new to git and even more to Jenkins core programming, I'll just submit the patch (hopefully did that right) as part of this comment. The patch should address a flaw in the code logic where a slave that cannot handle a build request is started. The very minor change is to add one additional check to make sure that the slave CAN handle the request before flagging it to be startable. core/src/main/java/hudson/slaves/RetentionStrategy.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/main/java/hudson/slaves/RetentionStrategy.java b/core/src/main/java/hudson/slaves/RetentionStrategy.java index 02611e5..f007ac6 100644 --- a/core/src/main/java/hudson/slaves/RetentionStrategy.java +++ b/core/src/main/java/hudson/slaves/RetentionStrategy.java @@ -218,7 +218,7 @@ public abstract class RetentionStrategy<T extends Computer> extends AbstractDesc } } - if (needExecutor) { + if (needExecutor && (c.getNode().canTake(item) == null)) { demandMilliseconds = System.currentTimeMillis() - item.buildableStartMilliseconds; needComputer = demandMilliseconds > inDemandDelay * 1000 * 60 /*MINS->MILLIS*/; break;

          Marco Lehnort added a comment -

          I forked, applied the change and added a pull request:
          https://github.com/jenkinsci/jenkins/pull/481.

          Marco Lehnort added a comment - I forked, applied the change and added a pull request: https://github.com/jenkinsci/jenkins/pull/481 .

          Jason Swager added a comment -

          Thank you! I've really got to learn how to do this myself...

          Jason Swager added a comment - Thank you! I've really got to learn how to do this myself...

          Code changed in jenkins
          User: fma1977
          Path:
          changelog.html
          core/src/main/java/hudson/slaves/RetentionStrategy.java
          http://jenkins-ci.org/commit/jenkins/71ad43a141ddeb62d6df4a13b8513c42d73c0b82
          Log:
          [FIXED JENKINS-13735]

          Added test whether the currently checked slave computer actually can take the buildable item before flagging it as needed (avoids powering up and connecting to slaves for jobs they can't build).

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: fma1977 Path: changelog.html core/src/main/java/hudson/slaves/RetentionStrategy.java http://jenkins-ci.org/commit/jenkins/71ad43a141ddeb62d6df4a13b8513c42d73c0b82 Log: [FIXED JENKINS-13735] Added test whether the currently checked slave computer actually can take the buildable item before flagging it as needed (avoids powering up and connecting to slaves for jobs they can't build).

          dogfood added a comment -

          Integrated in jenkins_ui-changes_branch #30
          [FIXED JENKINS-13735] (Revision 71ad43a141ddeb62d6df4a13b8513c42d73c0b82)

          Result = SUCCESS
          Kohsuke Kawaguchi : 71ad43a141ddeb62d6df4a13b8513c42d73c0b82
          Files :

          • core/src/main/java/hudson/slaves/RetentionStrategy.java
          • changelog.html

          dogfood added a comment - Integrated in jenkins_ui-changes_branch #30 [FIXED JENKINS-13735] (Revision 71ad43a141ddeb62d6df4a13b8513c42d73c0b82) Result = SUCCESS Kohsuke Kawaguchi : 71ad43a141ddeb62d6df4a13b8513c42d73c0b82 Files : core/src/main/java/hudson/slaves/RetentionStrategy.java changelog.html

          Marco Lehnort added a comment -

          I deployed the jenkins v1.470 today and tested the fis. Works like a charm!!!
          No irrelevant slaves are powered up, the correct slave required to execute a job is started.

          Thanks to everyone for the fast responses to my problem!
          @Jason: thanks for your analysis and fix!

          Cheers, Marco.

          Marco Lehnort added a comment - I deployed the jenkins v1.470 today and tested the fis. Works like a charm!!! No irrelevant slaves are powered up, the correct slave required to execute a job is started. Thanks to everyone for the fast responses to my problem! @Jason: thanks for your analysis and fix! Cheers, Marco.

          Marco Lehnort added a comment -

          Closing as everything seems to work as expected.

          Marco Lehnort added a comment - Closing as everything seems to work as expected.

            kohsuke Kohsuke Kawaguchi
            fma1977 Marco Lehnort
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: