-
Improvement
-
Resolution: Fixed
-
Critical
-
None
-
Platform: All, OS: All
-
Powered by SuggestiMate
Seems that the usual way how jobs work on hudson is that there is one job for
building a project and then if I want to have a subsequent job to run all unit
tests, that job needs to download artifacts produced by that build job from
Hudson. As a consequence, people usually add separate targets to their build
scripts for running unit tests from hudson which download the build artifacts
from the network.
I think it would be nice if things could work the way how users would do it when
running tests locally - i.e. use usual targets for building and then simply run
a target for unit tests in the same workspace. This could be achieved by
providing the subsequent jobs an exact snapshot of the workspace at the time
when the parent job finished running.
- is blocking
-
JENKINS-2655 Allow Maven jobs to share workspace
-
- Closed
-
- is duplicated by
-
JENKINS-3977 Better workflow
-
- Closed
-
-
JENKINS-398 Shared workspace between jobs or different actions in the same job?
-
- Closed
-
[JENKINS-682] Clone workspace between jobs
The problem with doing this is that then you won't be able to run the main build
and tests in parallel, so it increases the turn-around time.
What's the difference between setting up two jobs with the same workspace, as
opposed to setting up one job that does both the build and the test?
I see. I was not sure how it is implemented - i.e. whether different slaves
access the same physical workspace. I thought each slave checks out its own
workspace for a given job in which case tests and a new build could still run in
parallel.
I thought the difference between running it as a single job would be exactly
that possibility to start another build before tests (and possibly other
subsequent jobs such as coverage, findbugs, javadoc generation, etc.) finished
running.
Every slave uses a workspace on local file system, so different slaves get
different workspaces.
The problem I was trying to point out in sharing a workspace is that if a new
build starts while a test run is in progress, it will most likely mess up. For
example, a build might try to overwrite the file that a test is using. So in
general, I don't see how you can reliably run multiple tasks over the same
workspace in parallel.
Or maybe what you are really saying is a slightly different model, where the
execution will go like:
ws #0 <--- build #N --><-- test #N --->
ws #1 <--- build #N+1 --><--test #N+1 --->
?
Have a look at Buildbot, where each build is composed of several build steps.
Steps are tied to one another, no overlapping.
The first one is mostly cvs-checkout and then you are free to do in steps what
you think is necessary. Each step has its own result and derective how to
proceed on WARNING and ERROR cases. This concept is nice.
Kohsuke, what I would like is to be able to configure my job in such a way that
it could assume the workspace is in the same state as it was after it's parent
job. So, if some job is triggered by other job, then I thought it would be good
if there was a way for this triggered job to inherit the workspace of that
triggering job. So that people would not have to do additional "magic" specific
to hudson (and maintain additional hudson-specific targets) in their build
scripts. Hudson does know that a given job is triggered by another job and most
likely it is going to use the artifacts that other job produces. So I thought it
would be nice if those could be provided to these subsequent jobs automatically.
As a user that does not know much about hudson internals, I cannot suggest how
to implement it. I thought one way (the basic one) it could be done is what you
showed in your ASCII diagram. Maybe other possible way could be to have a
mechanism to push a snapshot of the workspace from a job to other jobs it
triggers (potentially to other machines if those jobs run on other systems). I
guess this would be least limiting when it comes to parallel execution.
A note to myself — recent introduction of Resource interface can be used as a
lock mechanism for multiple jobs to use the same workspace.
This is convenient for some use cases, like occasionally running lengthy-task
(like "mvn site") where normally a quick CI build runs.
I want to mention the case where workspace is on NFS.
By many of the jobs we lose most of the time checking out the sources so being
able to share workspace saves considerable amount time.
That is in the case of a multiple configuration project or the job can be
separated into few tasks.
I think that by being able to mark arbitrary jobs as sequential that will work
for us.
this is a commonly asked feature, and this is from my colleague, so bumping up
the priority a bit.
See another recent discussion at
http://www.nabble.com/Efficiently-using-Hudson-tf4649823.html
Another recent discussion about this:
http://www.nabble.com/Multiple-jobs-per-project---to14341510.html
What would be sufficient for me would not necessarily be the workspace of the
job that triggers me, but rather, the build artifacts. Even better would be if
I could pass the build artifact URL's as parameters to the triggered job.
i would to describe another scenario where having the ability to share a
workspace between jobs would be useful.
Note i want to keep using the built-in Maven 2 project type and not have to drop
to a 'free-style software project' and use shell scripting to get this job done.
Simply put I need to run maven more than once on the project to finish my build.
The build needs to do the following:
- Run a full build and generate results for all projects (i.e. "mvn install -fae
-Dmaven.test.failure.ignore=true"), yes ignore all failures.
- Run some data collection mojos across the now fully built project hierarchy
and then produce some reports (i.e. "mvn site -fae
-Dmaven.test.failure.ignore=true site:deploy")
Firstly a normal build is run and a number of code tools (checkstyle, pmd,
clover, cobertura, findbugs, javadoc, jxr) do their thing, these are my own
modified versions of the plugins that have all been changed to decouple their
analysis functionality from their reporting functionality. The reason we have
decoupled analysis from reporting is for the next stage.
The second invocation of mvn runs our custom plugins again (their bound to site
and pre-site phases), but this time in 'aggregate' mode. These 'aggregate' mojos
have the job of looking through the module hierarchy and aggregating (pulling up
and merging) the various result files they find (e.g. checkstyle-results.xml) up
the hierarchy, producing new analysis files as they go (i.e. for the parent pom
projects). The result of this is that at every level of the project hierarchy
one can see the aggregated results from JXR, javadoc, checkstyle, pmd, clover,
tests, etc.
Note, this is very different to the standard maven reporting plugin's aggregate
feature (i.e. <javadoc><aggregate>=true), when one uses these standard plugins
in aggregate mode only the top-most project gets the aggregated report, all the
other modules in the hierarchy do not generate any report at all.
Once the pre-site phase has run, the standard site plugin kicks. This in turn
runs all our reporting mojos and we get some rather lovely fully aggregated
multi-tier reports for javadoc, jxr, checkstyle, et al.
Was of doing this kind of thing with Hudson.
1) Prefered option? - Be able to define multiple build actions, i.e. dont chain
jobs but chain multiple build commands. a bit like the batch task functionality.
Thus you are in the same workspace as its the same job. Note this would need
some kind of 'continue if failed' functionality. I can see how this stretches
the hudson job model a bit so may not be viable for design reasons.
2) Call batch task on same project when main build finishes - this is what i'm
trying to do at the moment but unlike the 'Build other projects' post-build
action, the 'Invoke batch tasks of other projects' action does not have a
'Trigger even if the build is unstable' option which would allow us to only call
the batch task if the main build was successful (i.e. if the first mvn run fails
i do not want to call the batch task).
3) use seprate jobs, one that does the initial mvn build and then another, that
is downstream from that, shartes its workspace, and does the data aggregation
and site building. I think this one fits most easily into the hudson model...
Comments welcomed...
John
My understanding of this issue would be, that the child job would get a copy of
the parent workspace (when the parent job successfully finished). This would
allow parallel/overlapping execution of the jobs as proposed e.g. in
http://hudson.gotdns.com/wiki/display/JENKINS/Splitting+a+big+job+into+smaller+jobs .
I would suppose that making this copy is much faster then checking out and
rebuilding the project in the child(s) again. Mayby better than copying:
creating an archive if the parent job successfully finished and extracting the
archive when the child(s) start(s) - so the children always get a valid version,
at any time.
One additional use case for this proposal:
I have a project that checks out from a subversion repository if anything has
changed. That is the only thing the project does. If I go to the project page
and look at "Recent Changes", I can see what changes were checked in to the
repository since the last checkout.
The project then triggers a downstream project which does a build and various
tests. If the build fails I would like Hudson to email all everyone who
committed changes in the last checkout. This is easy to do if the build/test is
done in the same project as the checkout, but does not work if the checkout and
build are separate projects (naturally enough, since the build project does not
know what changed).
It would be useful if the ability to share a worksapce included the ability to
pass changes to downstream projects.
Since it's close to hitting the 2-year mark since this issue was created, is it
safe to assume that there are no immediate plans to resolve this?
I think many of the watchers and users concerned with this issue is really
nothing more than HOW to split a job into "build" and "test". A concrete
example of how to pass things from one job to another job is very much
appreciated. All I can google up is people saying that a heavy job SHOULD be
split, WITHOUT actually being clear on HOW to properly do what they're recommending.
There is a wiki page which explains how to split up a large job:
http://wiki.jenkins-ci.org/display/JENKINS/Splitting+a+big+job+into+smaller+jobs
When using only one node you could configure the upstream projects to use a
fixed workspace (the one of the base project; configure job > extended project
settings > configure working directory)
A thought:
A new SCM plugin that takes another project as the SCM source - it would keep track of the last build of that other project for polling purposes, and would copy in that other project's workspace for the checkout (and inherit the upstream project's changelog too). It's not perfect - duplicating the workspace means using more space used, for example. But it seems like it might work - thoughts?
SCM sounds good. Although perfect will be if it is possible to force the depending jobs to execute on the same slave so they can use the exact same workspace. This appears complicated to me but probably can be achieved as a separate effort.
I mean there could be a separate plug-in to force dependent jobs execute on the same slave as a given job. And the custom workspace plug-in could be used to set the workspace of the desired job. So what you suggest would probably be enough.
A workspace SCM solves part of the problem, but I had assumed those asking for a shared workspace wanted a single space that handles both read and write activity. The workspace SCM solves the read problem, but doesn't address the issue of shared writes - your artifacts still get spread around in each individual job workspace.
if you execute on a single slave, then you can change workspace. If you're executing on different slaves, you can point the scm plug-in to the last build in the sequence (if builds are one after another).
If multiple builds should use virtually the same workspace and execute in parallel, then I'm not sure how the individual results could be merged into a single location. If the concern is artifacts then probably a merge back mask could be needed so the new plug-in knows which files to return into the original workspace?
Yeah, the problem with actually sharing a workspace is doing so across slaves - unless you set up a network share available to all slaves and put the workspace there (which, actually, is how we do things in my setup, though we're not using one workspace in multiple jobs), I just don't see an elegant way to let both upstream and downstream jobs write to the same workspace. Most of the use cases I'm seeing mentioned here are linear - A runs, then B runs in the same workspace as A so it can reuse the results of A's build, etc. It doesn't handle the circular use case, but, frankly, that's a bit of a crazy use case. For that fairly rare and fairly strange use case, I think the right approach would be to lock both jobs to the same slave and then use a custom workspace - sure, that creates some limitations, but I think they're reasonable.
I've still got some design kinks to work out - what to do if B tries to run but there's no workspace of A available and what to do if B tries to run while A is already running most notably. I'm wondering whether it might make sense to have a publisher that will keep an archive of the most recent completed build's workspace of A, and then B pulls down that archived workspace. We wouldn't have to worry about concurrency collisions with that approach.
I don't see any use in having two jobs sharing a workspace but only one can write to this workspace. Than I can do that same stuff already with custom workspace and the locks-and-latches plugin or even simpler have one job do all the work.
For me the idea of creating two jobs, is that they can run at the same time. Which actually means that they have to different versions of the same workspace.
You also have to be careful with holding only the most recent workspace. Because what happens if job A is done and updates the most recent workspace while job is still working on the previous one? This must be supported.
IMHO keeping archive will be more error prone. If B wants to run and there is no A workspace, then A should be executed prior A and B blocked till A completes. If A tries to run during B's checkout, then it should be blocked until checkout finishes.
The hard part here would how to return B in the queue so A can execute. I'm not much into the code but probably the SCM plug-in in B could:
1. start A
2. queue B
3. fail current build but block sending notifications
4. remove the failed build
5. change last build number -1
Or something along these lines?
Another interpretation of this would be a nice UI that does the locks-and-latches plus custom workspace setup properly for a group of jobs with a single setting, instead of managing these things on each job, and leaving open the possibility of configuration skew.
Even in the read-only case, a common use-case appears to be sharing the IO cost among jobs by having a common parent do checkout or some other IO intensive task, and then the other jobs can just read that data without having to copy it anywhere, since it lives in a shared workspace.
But this use-case is in contrast to the original description, so I think the workspace-copy idea is in line with what was originally described. We'll have to have new bugs for the other use-cases.
@akostadinov:
Where is the advantage to configure
- either one big job
- or the use of custom workspace + locks-and-latches
I think the main use case that needs to be supported is described on the following page:
http://wiki.jenkins-ci.org/display/JENKINS/Splitting+a+big+job+into+smaller+jobs
@akostadinov: I'm leaning towards the archiving approach because it better fits the SCM model - you're getting the controlled workspace of the previous build of job A as the workspace for a build of job B. I want to make sure this plugin can handle these use cases cleanly:
- Job B uses Job A's workspace, but both Job A and Job B have concurrent builds enabled.
- Jobs B, C, and D use Job A's workspace, and all need to get identical copies of Job A's workspace.
- Job B uses Job A's workspace, but Job B doesn't automatically run right after Job A finishes - it runs at a later point in time, when kicked off manually, and by that time, Job A's workspace has been cleaned out, or the slave Job A ran on is no longer available, etc.
Archiving Job A's workspace works in all of those use cases, and keeps the overall picture smoother. There'll be a publisher extension to turn on in Job A to archive the workspace, and an SCM extension to use in Job B, which will let you choose a parent project to use as the SCM source from a list of jobs with the publisher enabled. The SCM extension will handle polling by checking to see if there's a new archive of the parent project workspace (and that the parent project isn't in the process of writing that archive, so we don't get a partial archive), and it'll handle checkout by pulling down the archive and expanding it. I'm not yet sure exactly how the inherited changelogs will work, since this plugin won't know or care what the parent project's SCM is and the changelog is determined by the SCM, but I'll figure that out. =)
This doesn't solve every case, this isn't a perfect solution, but I think it fits the most common use case requested - run a build, and then kick off build(s) of 1..n additional projects to run tests against the results of the first build. So I'm gonna implement it - and me implementing it doesn't mean someone else can't write a plugin to fit another use case. =)
@peter_schuetze - I'm not sure sharing of workspaces is exactly what's needed to support this use case. Actually it seems sharing has a pretty broad meaning...
@abayer - how will your archiving be better than setting A to archive whole workspace as artifacts and then tell B to download these? I've always seen sharing of workspaces to imply lower hudson IO operations and less disk utilization.
I fully agree though that all of the proposed solutions solve some problems and that everyone having the knowledge and time to implement any of them is welcome
Regards.
Given the direction this is going, I think a better issue summary is "Clone workspace between jobs"
@akostadinov - The biggest difference is that it'll be formalized - there won't be a need to do the zipping/archiving/downloading/extracting manually, and there'll be an automatically defined relationship between the parent build and the child build. Also, we'll be able to have the child build inherit the parent build's SCM changelog. You're right in that this won't do much for lowering IO/disk space usage, but it will answer most of the use cases mentioned in earlier comments here, so that's what I'm going after.
Hmm - I now remember looking at this vaguely before and stumbling across hudson.fsp.WorkspaceSnapshotSCM. But I still haven't seen anything actually using that, so I'm not sure whether to start from scratch or not.
The main thing which is missing in hudson to resolve this issue (and many others) is to be able to define several "builds" in a job like in Continuum for example : http://continuum.apache.org/docs/1.3.5/user_guides/managing_builddef/builddefProject.html
The idea is to reuse SCM & Workspace for several builds. For each build we can define triggers and builders.
In hudson the ideal should be to be able in one job to have several sets of Build (Triggers + Settings + Environment + Post build)
This could be backward compatible by proposing by default only one set.
I know this is an important change because it touches the code model.
The challenge with that model is the slave situation - since each slave has their own workspace area, how do we have several jobs using the same workspace across multiple slaves? I'm by no means averse to the kind of thing you're talking about, I'm just not sure at all how to do it within Hudson.
@aheritier
I believe what you describe is a distinct enhancement request from JENKINS-682 (this issue)
Careful reading of the original description and the early comments indicates a desire to automatically clone the entire workspace to downstream jobs. Sharing a single workspace doesn't yet have an issue filed, that I'm aware of.
In my company we split jobs in several task-oriented jobs such as 01-SCM, 02-Build, 03-UnitTesting, 04-Deploy, 05-RegressionTesting, etc.
Each of these jobs highly depends on the result of the previous one (+ generally the SCM one), that is, on the content of the workspace of other jobs. We currently achieve this by doing some shell scripting, taking a look at the workspace path to see if we're running on a slave or the master, to copy files from one job to another. This means all the jobs must be assigned to the same node (or otherwise you have to use the Copy To Slave plugin)
My idea some months ago was to create some kind of $WORKSPACE[<job name>] variable which would ease the pointing to upstream/downstream job workspaces. This does not address the issue regarding "one commonly shared" workspace, nor the node assignement one, but I think it can be a first easy & useful solution to set-up.
We can also think about a new out-of-the-box build step (or a wrapper, etc.) "Grab from other job", with job name and includes/excludes (Ant style).
@mdonohue
I did mean sharing the workspace when I filed this issue. Whether it is done by ensuring these related jobs run on the same slave or by cloning the exact state of the workspace to other slave that runs the subsequent job is an implementation detail.
Or maybe not - I see the difference is that in case of cloning the workspace for the previous job would not be affected by the subsequent one - which may be desirable - but not for the use-case I had in mind. When filing this issue I was mostly concerned about not being able to easily share the workspace of a job with subsequent jobs so they could use the results of the previous job without doing additional hudson-specific magic in my build scripts.
@mmatula - does the approach I proposed (archive the workspace - or a selected subset thereof - from one job's build and then one or more other jobs taking that archive and exploding it as the basis of their workspace) fit your request? I want to be sure before I really dive into coding this up. =)
@abayer - Yes, that looks good . Btw., re changelog - I guess it should just inherit the parent job's changelog.
FYI, I've got this working (more or less - still a couple things to test and docs/tests to write) - it's up at http://github.com/abayer/hudson-clone-workspace-scm-plugin, though it'll need 1.350 to be released before I can release it.
I've released the clone-workspace-scm plugin, which contains both a publisher for archiving workspaces and an SCM for, well, using those archived workspaces as SCM sources. It does need Hudson 1.350 or later - it'll blow up with earlier versions. I've got a wiki page up at http://wiki.jenkins-ci.org/display/JENKINS/Clone+Workspace+SCM+Plugin, but I need to do more work on the documentation. And as I mentioned before, the actual source is at GitHub (http://github.com/abayer/hudson-clone-workspace-scm-plugin) rather than the Hudson SVN repo.
The plugin should show up in the Update Center within the next 6-12 hours.
This plugin is compatible with the matrix projects? If not it is possible?
Not sure what you mean - matrix projects should be able to use the SCM part of it, at the very least. I'm not 100% sure how the publisher/archiver would work with a matrix project, though.
Oops, this an enhancement request, not a defect...