Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-8408

Large number of jobs triggered on Hudson restart

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • None

      Hi Tom

      I havent heard of any similar issue so it would be great if you could
      add it to the jira so it can be tracked.

      Are all builds triggered by a SCM change? You can tell by looking at
      one of the builds page, and if it says "triggered by SCM".
      Are the jobs built on the same machine?

      The build is started either because the workspace is invalid (or not
      present on the machine) or if there has been a commit during a period.
      Newer Hudson versions store the SCM polling log together with the
      build (if it was started by a SCm change), so hopefully we can get
      some info from that. To see the polling log for a certain build, go to
      the Build page, click on the "Started by a SCM change and you should
      see the full log. (similar to
      http://ramfelt.se/job/Mockito/528/pollingLog/?)

      If you dont have a link for the SCM change then you will have to
      manually watch the SCM polling log just after you reboot your server
      to see why the plugin triggers a new build.

      Regards
      //Erik
      =========================================================================================
      On Tue, Jan 4, 2011 at 13:57, Tom wrote:
      > Good morning,
      >
      > We have a large Hudson master/slave farm using both the TFS and base clear
      > case SCM plugins. We've been trying to figure out why each time we restart
      > Hudson (usually to install a plugin), Hudson triggers a lot of builds. Not
      > all, but a large number. This morning I restarted, and looking at the
      > list, realized all of the builds it's triggering on restart are TFS
      > based. Some blow away the workspace, some do not.

          [JENKINS-8408] Large number of jobs triggered on Hudson restart

          redsolo added a comment -

          Maybe JENKINS-1348 is connected to this issue. As we dont see any output from the TFS command line, we can assume that it isnt the TFS command line tool triggering the change.

          Are you building the jobs on different slaves? Are they available when the server is restarted?

          redsolo added a comment - Maybe JENKINS-1348 is connected to this issue. As we dont see any output from the TFS command line, we can assume that it isnt the TFS command line tool triggering the change. Are you building the jobs on different slaves? Are they available when the server is restarted?

          tdiz added a comment -

          Maybe, that's the one I kept coming up with in my searches on this problem. Not sure how to know for sure.

          Here are the first two lines of the console log:

          07:40:02 Started by an SCM change
          07:40:02 Building remotely on xxx-xxx-3

          So yes it's building on different slaves (using labels). The slave machines and Hudson slave process are up when we're restarting Hudson on the master. Wondering if there's some timing issue either:

          1. Restarting the master, and while it's still establishing that the slaves are up and running, it kicks off jobs. Or

          2. Something happening as a result of the ~60 TFS jobs all attempting to fire up tf.exe to look for changes at the same time (either delay on the server, or delay on the TFS server).

          Those are just guesses though.

          tdiz added a comment - Maybe, that's the one I kept coming up with in my searches on this problem. Not sure how to know for sure. Here are the first two lines of the console log: 07:40:02 Started by an SCM change 07:40:02 Building remotely on xxx-xxx-3 So yes it's building on different slaves (using labels). The slave machines and Hudson slave process are up when we're restarting Hudson on the master. Wondering if there's some timing issue either: 1. Restarting the master, and while it's still establishing that the slaves are up and running, it kicks off jobs. Or 2. Something happening as a result of the ~60 TFS jobs all attempting to fire up tf.exe to look for changes at the same time (either delay on the server, or delay on the TFS server). Those are just guesses though.

          redsolo added a comment -

          Hudson should try to build the job on the latest node that was used for building, but if it cant find it or use any other it will need to create a new workspace to be able to determine if there is any cahnge. I am not sure how the polling works or not, if it requires to be used on the last node or not.

          Is Hudson building the jobs on different slaves? ie, job A was last built on node X; after the restart will job A be built on node X or will is use node Y?

          I know there has been some changes in the SCM API, I will look into them and see if they apply to this kind of issue.

          What Hudson version are you using?

          redsolo added a comment - Hudson should try to build the job on the latest node that was used for building, but if it cant find it or use any other it will need to create a new workspace to be able to determine if there is any cahnge. I am not sure how the polling works or not, if it requires to be used on the last node or not. Is Hudson building the jobs on different slaves? ie, job A was last built on node X; after the restart will job A be built on node X or will is use node Y? I know there has been some changes in the SCM API, I will look into them and see if they apply to this kind of issue. What Hudson version are you using?

          tdiz added a comment -

          After restart is it sticking to the last slave - good question. I'll have to set up a test and check for that.

          Maybe I need to spend more time w/ core Hudson logging to look at what actually happens on a master restart, assuming there's a way to turn on more verbose logging.

          We're on 1.389 with the 1.11 TFS plugin.

          I don't think this is anything new, we've been seeing it since we got slaves hooked up about 6 months ago. Wasn't a big problem when we had 10 build jobs set up in the system, but we're over 100 already and rapidly growing.

          tdiz added a comment - After restart is it sticking to the last slave - good question. I'll have to set up a test and check for that. Maybe I need to spend more time w/ core Hudson logging to look at what actually happens on a master restart, assuming there's a way to turn on more verbose logging. We're on 1.389 with the 1.11 TFS plugin. I don't think this is anything new, we've been seeing it since we got slaves hooked up about 6 months ago. Wasn't a big problem when we had 10 build jobs set up in the system, but we're over 100 already and rapidly growing.

          redsolo added a comment -

          Are you still seeing this problem?

          redsolo added a comment - Are you still seeing this problem?

          tdiz added a comment -

          Yes, but we haven't updated in a while, we're still on Hudson 1.377. We got around this by slowing polling down a bit, to 2 minutes instead of 1. Our theory (completely unverified) is that upon restart, something is getting to the polling interval before everything has finished starting. But that's just a guess.

          We're in the process of picking and starting to test a Jenkins build. After that's running, I'll let you know if I still see this. We're up to 800 jobs now, by the way.

          tdiz added a comment - Yes, but we haven't updated in a while, we're still on Hudson 1.377. We got around this by slowing polling down a bit, to 2 minutes instead of 1. Our theory (completely unverified) is that upon restart, something is getting to the polling interval before everything has finished starting. But that's just a guess. We're in the process of picking and starting to test a Jenkins build. After that's running, I'll let you know if I still see this. We're up to 800 jobs now, by the way.

          Ben Dean added a comment -

          I added mercurial and pollscm components because I believe this isn't a TFS issue. We don't use TFS at all and we see hundreds of jobs queued when we restart Jenkins. We use Mercurial for our SCM. Here's the polling long for one build after a Jenkins restart:

          Started on Sep 6, 2013 9:46:12 AM
          Workspace is offline.
          Scheduling a new build to get a workspace. (nonexisting_workspace)
          Done. Took 69 ms
          Changes found
          

          However, there weren't actually any changes. It would be worth noting that we have a pool of about 45 build slaves, and none of our jobs are configured to build on the master. I'm sure that affects SCM polling somewhat since it has to talk to build slaves to figure out what revision the SCM is using there.

          I edited the environment as well to reflect our different environment. I also changed the priority to critical because when we restart Jenkins we have a frenzy to remove builds from the queue. Yes that can be made easier with a bit of Groovy, but we don't always think of that.

          Ben Dean added a comment - I added mercurial and pollscm components because I believe this isn't a TFS issue. We don't use TFS at all and we see hundreds of jobs queued when we restart Jenkins. We use Mercurial for our SCM. Here's the polling long for one build after a Jenkins restart: Started on Sep 6, 2013 9:46:12 AM Workspace is offline. Scheduling a new build to get a workspace. (nonexisting_workspace) Done. Took 69 ms Changes found However, there weren't actually any changes. It would be worth noting that we have a pool of about 45 build slaves, and none of our jobs are configured to build on the master. I'm sure that affects SCM polling somewhat since it has to talk to build slaves to figure out what revision the SCM is using there. I edited the environment as well to reflect our different environment. I also changed the priority to critical because when we restart Jenkins we have a frenzy to remove builds from the queue. Yes that can be made easier with a bit of Groovy, but we don't always think of that.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          changelog.html
          core/src/main/java/hudson/model/AbstractProject.java
          core/src/main/resources/hudson/model/Messages.properties
          http://jenkins-ci.org/commit/jenkins/28737ee9d1ae4ab02d650a284ec52e98e50d9f63
          Log:
          [FIXED JENKINS-8408]

          If slaves are late to come online after a Jenkins startup, we will see a huge spike of builds as Jenkins attempt to get a workspace for polling.

          Compare: https://github.com/jenkinsci/jenkins/compare/e68ec055fda2...28737ee9d1ae

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: changelog.html core/src/main/java/hudson/model/AbstractProject.java core/src/main/resources/hudson/model/Messages.properties http://jenkins-ci.org/commit/jenkins/28737ee9d1ae4ab02d650a284ec52e98e50d9f63 Log: [FIXED JENKINS-8408] If slaves are late to come online after a Jenkins startup, we will see a huge spike of builds as Jenkins attempt to get a workspace for polling. Compare: https://github.com/jenkinsci/jenkins/compare/e68ec055fda2...28737ee9d1ae

          dogfood added a comment -

          Integrated in jenkins_main_trunk #2987

          Result = SUCCESS

          dogfood added a comment - Integrated in jenkins_main_trunk #2987 Result = SUCCESS

          Code changed in jenkins
          User: Jesse Glick
          Path:
          core/src/main/java/hudson/model/AbstractProject.java
          http://jenkins-ci.org/commit/jenkins/3ddef512b21b336e2911598ad3f62def62cb0e18
          Log:
          Fix of JENKINS-8408 broke some tests of workspace-based polling; disable the fix when inside a test, for better predictability.
          (Ideally Jenkins would actually detect whether there was a plan to connect a slave of a given name.)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: core/src/main/java/hudson/model/AbstractProject.java http://jenkins-ci.org/commit/jenkins/3ddef512b21b336e2911598ad3f62def62cb0e18 Log: Fix of JENKINS-8408 broke some tests of workspace-based polling; disable the fix when inside a test, for better predictability. (Ideally Jenkins would actually detect whether there was a plan to connect a slave of a given name.)

            redsolo redsolo
            tdiz tdiz
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: