Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-8408

Large number of jobs triggered on Hudson restart

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Critical
    • Resolution: Fixed
    • Labels:
      None
    • Environment:
    • Similar Issues:

      Description

      Hi Tom

      I havent heard of any similar issue so it would be great if you could
      add it to the jira so it can be tracked.

      Are all builds triggered by a SCM change? You can tell by looking at
      one of the builds page, and if it says "triggered by SCM".
      Are the jobs built on the same machine?

      The build is started either because the workspace is invalid (or not
      present on the machine) or if there has been a commit during a period.
      Newer Hudson versions store the SCM polling log together with the
      build (if it was started by a SCm change), so hopefully we can get
      some info from that. To see the polling log for a certain build, go to
      the Build page, click on the "Started by a SCM change and you should
      see the full log. (similar to
      http://ramfelt.se/job/Mockito/528/pollingLog/?)

      If you dont have a link for the SCM change then you will have to
      manually watch the SCM polling log just after you reboot your server
      to see why the plugin triggers a new build.

      Regards
      //Erik
      =========================================================================================
      On Tue, Jan 4, 2011 at 13:57, Tom wrote:
      > Good morning,
      >
      > We have a large Hudson master/slave farm using both the TFS and base clear
      > case SCM plugins. We've been trying to figure out why each time we restart
      > Hudson (usually to install a plugin), Hudson triggers a lot of builds. Not
      > all, but a large number. This morning I restarted, and looking at the
      > list, realized all of the builds it's triggering on restart are TFS
      > based. Some blow away the workspace, some do not.

        Attachments

          Issue Links

            Activity

            tdiz tdiz created issue -
            Hide
            tdiz tdiz added a comment -

            Yes all of these builds show that they were started by an SCM change, and right above that, "no changes."

            Looking through the links on the SCM change text though, I see this:

            Polling Log
            View as plain text

            This page captures the polling log that triggered this build.

            Started on Jan 4, 2011 6:59:44 AM
            Workspace is offline.
            Scheduling a new build to get a workspace.
            Done. Took 15 ms
            Changes found

            What does workspace is offline mean?

            Thanks again!

            Show
            tdiz tdiz added a comment - Yes all of these builds show that they were started by an SCM change, and right above that, "no changes." Looking through the links on the SCM change text though, I see this: Polling Log View as plain text This page captures the polling log that triggered this build. Started on Jan 4, 2011 6:59:44 AM Workspace is offline. Scheduling a new build to get a workspace. Done. Took 15 ms Changes found What does workspace is offline mean? Thanks again!
            Hide
            redsolo redsolo added a comment -

            Maybe JENKINS-1348 is connected to this issue. As we dont see any output from the TFS command line, we can assume that it isnt the TFS command line tool triggering the change.

            Are you building the jobs on different slaves? Are they available when the server is restarted?

            Show
            redsolo redsolo added a comment - Maybe JENKINS-1348 is connected to this issue. As we dont see any output from the TFS command line, we can assume that it isnt the TFS command line tool triggering the change. Are you building the jobs on different slaves? Are they available when the server is restarted?
            Hide
            tdiz tdiz added a comment -

            Maybe, that's the one I kept coming up with in my searches on this problem. Not sure how to know for sure.

            Here are the first two lines of the console log:

            07:40:02 Started by an SCM change
            07:40:02 Building remotely on xxx-xxx-3

            So yes it's building on different slaves (using labels). The slave machines and Hudson slave process are up when we're restarting Hudson on the master. Wondering if there's some timing issue either:

            1. Restarting the master, and while it's still establishing that the slaves are up and running, it kicks off jobs. Or

            2. Something happening as a result of the ~60 TFS jobs all attempting to fire up tf.exe to look for changes at the same time (either delay on the server, or delay on the TFS server).

            Those are just guesses though.

            Show
            tdiz tdiz added a comment - Maybe, that's the one I kept coming up with in my searches on this problem. Not sure how to know for sure. Here are the first two lines of the console log: 07:40:02 Started by an SCM change 07:40:02 Building remotely on xxx-xxx-3 So yes it's building on different slaves (using labels). The slave machines and Hudson slave process are up when we're restarting Hudson on the master. Wondering if there's some timing issue either: 1. Restarting the master, and while it's still establishing that the slaves are up and running, it kicks off jobs. Or 2. Something happening as a result of the ~60 TFS jobs all attempting to fire up tf.exe to look for changes at the same time (either delay on the server, or delay on the TFS server). Those are just guesses though.
            Hide
            redsolo redsolo added a comment -

            Hudson should try to build the job on the latest node that was used for building, but if it cant find it or use any other it will need to create a new workspace to be able to determine if there is any cahnge. I am not sure how the polling works or not, if it requires to be used on the last node or not.

            Is Hudson building the jobs on different slaves? ie, job A was last built on node X; after the restart will job A be built on node X or will is use node Y?

            I know there has been some changes in the SCM API, I will look into them and see if they apply to this kind of issue.

            What Hudson version are you using?

            Show
            redsolo redsolo added a comment - Hudson should try to build the job on the latest node that was used for building, but if it cant find it or use any other it will need to create a new workspace to be able to determine if there is any cahnge. I am not sure how the polling works or not, if it requires to be used on the last node or not. Is Hudson building the jobs on different slaves? ie, job A was last built on node X; after the restart will job A be built on node X or will is use node Y? I know there has been some changes in the SCM API, I will look into them and see if they apply to this kind of issue. What Hudson version are you using?
            Hide
            tdiz tdiz added a comment -

            After restart is it sticking to the last slave - good question. I'll have to set up a test and check for that.

            Maybe I need to spend more time w/ core Hudson logging to look at what actually happens on a master restart, assuming there's a way to turn on more verbose logging.

            We're on 1.389 with the 1.11 TFS plugin.

            I don't think this is anything new, we've been seeing it since we got slaves hooked up about 6 months ago. Wasn't a big problem when we had 10 build jobs set up in the system, but we're over 100 already and rapidly growing.

            Show
            tdiz tdiz added a comment - After restart is it sticking to the last slave - good question. I'll have to set up a test and check for that. Maybe I need to spend more time w/ core Hudson logging to look at what actually happens on a master restart, assuming there's a way to turn on more verbose logging. We're on 1.389 with the 1.11 TFS plugin. I don't think this is anything new, we've been seeing it since we got slaves hooked up about 6 months ago. Wasn't a big problem when we had 10 build jobs set up in the system, but we're over 100 already and rapidly growing.
            Hide
            redsolo redsolo added a comment -

            Are you still seeing this problem?

            Show
            redsolo redsolo added a comment - Are you still seeing this problem?
            Hide
            tdiz tdiz added a comment -

            Yes, but we haven't updated in a while, we're still on Hudson 1.377. We got around this by slowing polling down a bit, to 2 minutes instead of 1. Our theory (completely unverified) is that upon restart, something is getting to the polling interval before everything has finished starting. But that's just a guess.

            We're in the process of picking and starting to test a Jenkins build. After that's running, I'll let you know if I still see this. We're up to 800 jobs now, by the way.

            Show
            tdiz tdiz added a comment - Yes, but we haven't updated in a while, we're still on Hudson 1.377. We got around this by slowing polling down a bit, to 2 minutes instead of 1. Our theory (completely unverified) is that upon restart, something is getting to the polling interval before everything has finished starting. But that's just a guess. We're in the process of picking and starting to test a Jenkins build. After that's running, I'll let you know if I still see this. We're up to 800 jobs now, by the way.
            Hide
            b_dean Ben Dean added a comment -

            I added mercurial and pollscm components because I believe this isn't a TFS issue. We don't use TFS at all and we see hundreds of jobs queued when we restart Jenkins. We use Mercurial for our SCM. Here's the polling long for one build after a Jenkins restart:

            Started on Sep 6, 2013 9:46:12 AM
            Workspace is offline.
            Scheduling a new build to get a workspace. (nonexisting_workspace)
            Done. Took 69 ms
            Changes found
            

            However, there weren't actually any changes. It would be worth noting that we have a pool of about 45 build slaves, and none of our jobs are configured to build on the master. I'm sure that affects SCM polling somewhat since it has to talk to build slaves to figure out what revision the SCM is using there.

            I edited the environment as well to reflect our different environment. I also changed the priority to critical because when we restart Jenkins we have a frenzy to remove builds from the queue. Yes that can be made easier with a bit of Groovy, but we don't always think of that.

            Show
            b_dean Ben Dean added a comment - I added mercurial and pollscm components because I believe this isn't a TFS issue. We don't use TFS at all and we see hundreds of jobs queued when we restart Jenkins. We use Mercurial for our SCM. Here's the polling long for one build after a Jenkins restart: Started on Sep 6, 2013 9:46:12 AM Workspace is offline. Scheduling a new build to get a workspace. (nonexisting_workspace) Done. Took 69 ms Changes found However, there weren't actually any changes. It would be worth noting that we have a pool of about 45 build slaves, and none of our jobs are configured to build on the master. I'm sure that affects SCM polling somewhat since it has to talk to build slaves to figure out what revision the SCM is using there. I edited the environment as well to reflect our different environment. I also changed the priority to critical because when we restart Jenkins we have a frenzy to remove builds from the queue. Yes that can be made easier with a bit of Groovy, but we don't always think of that.
            b_dean Ben Dean made changes -
            Field Original Value New Value
            Component/s mercurial [ 15502 ]
            Component/s pollscm [ 17336 ]
            Environment Windows 2008 R2 build farm. 1 master, 7 slaves. Windows 2008 R2 build farm. 1 master, 7 slaves.

            reproduced on different environment:

            Jenkins Master OS: CentOS 6.4
            Jenkins Slave OSes: CentOS 6.4 and Windows Server 2008R2
            approx. 45-50 slaves

            Jenkins version: 1.528
            Mercurial plugin: 1.46
            Priority Major [ 3 ] Critical [ 2 ]
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Kohsuke Kawaguchi
            Path:
            changelog.html
            core/src/main/java/hudson/model/AbstractProject.java
            core/src/main/resources/hudson/model/Messages.properties
            http://jenkins-ci.org/commit/jenkins/28737ee9d1ae4ab02d650a284ec52e98e50d9f63
            Log:
            [FIXED JENKINS-8408]

            If slaves are late to come online after a Jenkins startup, we will see a huge spike of builds as Jenkins attempt to get a workspace for polling.

            Compare: https://github.com/jenkinsci/jenkins/compare/e68ec055fda2...28737ee9d1ae

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: changelog.html core/src/main/java/hudson/model/AbstractProject.java core/src/main/resources/hudson/model/Messages.properties http://jenkins-ci.org/commit/jenkins/28737ee9d1ae4ab02d650a284ec52e98e50d9f63 Log: [FIXED JENKINS-8408] If slaves are late to come online after a Jenkins startup, we will see a huge spike of builds as Jenkins attempt to get a workspace for polling. Compare: https://github.com/jenkinsci/jenkins/compare/e68ec055fda2...28737ee9d1ae
            scm_issue_link SCM/JIRA link daemon made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            kohsuke Kohsuke Kawaguchi made changes -
            Link This issue is duplicated by JENKINS-20227 [ JENKINS-20227 ]
            Hide
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #2987

            Result = SUCCESS

            Show
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #2987 Result = SUCCESS
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            core/src/main/java/hudson/model/AbstractProject.java
            http://jenkins-ci.org/commit/jenkins/3ddef512b21b336e2911598ad3f62def62cb0e18
            Log:
            Fix of JENKINS-8408 broke some tests of workspace-based polling; disable the fix when inside a test, for better predictability.
            (Ideally Jenkins would actually detect whether there was a plan to connect a slave of a given name.)

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: core/src/main/java/hudson/model/AbstractProject.java http://jenkins-ci.org/commit/jenkins/3ddef512b21b336e2911598ad3f62def62cb0e18 Log: Fix of JENKINS-8408 broke some tests of workspace-based polling; disable the fix when inside a test, for better predictability. (Ideally Jenkins would actually detect whether there was a plan to connect a slave of a given name.)
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 138518 ] JNJira + In-Review [ 188033 ]

              People

              Assignee:
              redsolo redsolo
              Reporter:
              tdiz tdiz
              Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: