-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Windows XP, Windows 7 using MSBuild or devenv.exe to build MS Visual Studio Projects
-
Powered by SuggestiMate
I run into errors when using a customized build system which uses Visual Studio's devenv.exe under the hood to compile VisualStudio 2005 projects (with VC++ compiler). When starting two parallel builds with Jenkins (on different code base) the second job will always fail with "Fatal error C1090: PDB API call failed, error code '23' : '(" in exactly the same second the first job finishes processing. Running both jobs outside Jenkins does not produce the error.
This has also been reported for builds executed by MSBuild on the Jenkins user mailing list [1].
I analysed this issue thoroughly and can track the problem down to the usage of mspdbsrv.exe. This program is automatically spawned when building a VisualStudio project. All Visual Studio instances normally share one common pdb-server which shutdown itself after a idle period (standard is 10 minutes). "It ensures access to .pdb files is properly serialized in parallel builds when multiple instances of the compiler try to access the same .pdb file" [2].
I assume that Jenkins does a clean up of its build environment when a automatically started job finishes (like as described at http://wiki.jenkins-ci.org/display/JENKINS/Aborting+a+build). I checked mspbsrv.exe with ProcessExplorer and the process indeed has a variable JENKINS_COOKIE/HUDSON_COOKIE set in its environment if started through Jenkins. Killing mspdbsrv.exe while projects are still connected will break compilation.
Jenkins mustn't kill mspdbsrv.exe to be able to build more than one Visual Studio project at the same time.
–
[1] http://jenkins.361315.n4.nabble.com/MSBuild-fatal-errors-when-build-triggered-by-timer-td385181.html
[2] http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/b1d1bceb-06b6-47ef-a0ea-23ea752e0c4f/
- envinject-config.png
- 13 kB
- screenshot.JPG
- 29 kB
- is duplicated by
-
JENKINS-24753 MSBuild fails with error "fatal error C1090: PDB API call failed, error code '23'"
-
- Resolved
-
- is related to
-
JENKINS-19156 Jenkins does not invoke ProcessKillers for Windows recursively
-
- Open
-
-
JENKINS-3105 Configuration UI to disable process tree killer selectively
-
- Resolved
-
[JENKINS-9104] Visual studio builds started by Jenkins fail with "Fatal error C1090" because mspdbsrv.exe gets killed
Christoph: Have you tried this on Windows with winp doing the killing? I think it works differently there.
sorry, no I haven't tried this with winp (to be honest, I don't even know what winp is). I will try with an current Jenkins setup.
Winp in the library doing the recursive killing on Windows (if available), and that is what was fixed between 1.532.x and 1.554.x – so Kevin is correct that this was changed between these versions.
There's a few reported issues related to winp not working reliably, maybe one of them can be exploited as a workaround to prevent it from killing pspdbsrv.
"Aren't you able to launch that service manually instead of having it launched by the first build to come along?"
This is covered in the comments. The service times out, which can still happen mid-build. If you set the timeout long, then you risk memory leaks.
Setting the BUILD_ID to "dontKillMe" still works as expected with Jenkins 1.554.1 LTS. Even though I'm not able to test the original set up (as I don't use VS and mspdbsrv.exe any longer) a new process spawned during run with python subprocess.Popen() will not be killed by the process tree killer. Running without setting the BUILD_ID will kill the subprocess as expected.
Great! In that case, this is not a defect but behaves as intended.
What would be a good location to document setting BUILD_ID to prevent process killing? Obviously, there's a need there...
It is already documented at https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller
Random wiki pages aren't exactly discoverable. Unless you know it's there you wouldn't even bother searching.
Maybe add to the description of shell/batch build steps that launched processes are cleaned up after the script exits, and that this can be disabled?
I don't know if the build step is the right scope for the documentation. The descriptions of the different build steps are provided by the plug-ins, aren't they? Would be hard to have a consistent message across all plug-ins. At least python, msbuild, shell, windows batch come to my mind. Maybe also groovy, qmake, cmake and others that provide an api to spawn a process.
In order to solve this issue it would be nice to have some sort of process name white-list with processes that will never be killed by the tree killer. This could then be configured globally (master/per slave). What do you think?
Christoph: Shell and Batch are both in core and the most straightforward choices for launching programs. Specialist plugins might not even allow this much flexibility.
I'd just like to get more discoverability, and that doesn't need annotating at every conceivable location. If users knows this solution from Batch/Shell descriptions, and maybe transfer it over to similar plugin-provided builders, that's perfect.
Tree killer configuration might be helpful, but that should be filed in a new issue. AFAICT this needs to touch a lot of parts, so this would be a rather large project.
I have a few follow up questions:
1. If I understand correctly, this "process tree killer" feature was pre-existing in earlier Jenkins releases, but only in the latest update was it "changed" to add recursive killing of processes, correct?
2. That being the case, does setting "BUILD_ID=dontKillMe" disable termination of all processes or just this new "recursive" behavior? If it disables all process terminations I'd say this proposal would not be a viable workaround since it could risk leaving other rogue processes orphaned on a build machine, which has many adverse side effects (which, I'm guessing you already know since I suspect this feature was implemented to resolve these exact problems)
3. Won't setting the "BUILD_ID=dontKillMe" affect other parts of the build? The BUILD_ID env var is used as a unique identifier throughout the job after all. Changing it from the unique identifier it is meant to be, to a statically defined character string seems fragile at best.
So far, based on the recent comment threads, my admittedly superficial understanding of the root cause, and some quick Googling, it seems there are only a few viable options to resolve this issue:
1. A python script was written by an earlier commenter, which leverages the BUILD_ID env var to strategically control the lifetime of the pdbsrv process itself without affecting other parts of the build.
- This seems like a pretty harsh workaround to what is obviously a problem introduced by changes made in the latest Jenkins LTS update.
2. Roll back the version of this "process tree killer" used by Jenkins LTS to v1.16, before this new "recursive" behavior was added according to an earlier comment.
- I assume LTS releases are expected to maintain a certain level of stability and consistency in their behaviors. That being the case, this change obviously caused critical, debilitating side effects to Visual Studio users and thus should not have been included in an update release.
3. Provide some kind of workaround within the "process tree killer" or the Jenkins core libraries to compensate for this newly discovered problem.
- From what I gather from the earlier comments, this may be a non trivial task. However, if this new recursive logic in the process tree killer is absolutely required in Jenkins LTS for some reason, I think this work must be done. Anything else (scripting, documentation notes, etc.) would just be trying to hide the fact that this is an underlying architectural problem - imo.
4. Accept the fact that Visual Studio users will likely never use Jenkins version that include this "new feature", forcing them to use versions of Jenkins that predate this change.
- Currently this is the solution that my team and I have chosen to adopt until a more reasonable solution can be found.
- Just to clarify our rationale for this decision: Using v1.532.x works just fine with Visual Studio. Upgrading to v1.554.x does not work - at all. Period. To do otherwise would require extra time (and, hence, money) on our part to workaround the problem, for little to no benefit on our part.
Aside
I probably should say that I truly believe the real root cause of this problem is an underlying architectural issue with Visual Studio and it's use of this pdbsrv process in their newer compilers, but numerous forums and bug reports to Microsoft appear to fall on deaf ears (ie: they claim it's working this way by design). Given the fact that this has been a problem in Visual Studio for several releases spread across many years it's unlikely to change any time soon, so you may be forced to compensate for it here in your tool. To do otherwise will simply make it more difficult (and, by extension, less likely) for Visual Studio users to adopt / continue using your tool.
Does this also happen with MSBuild, or only Devenv? Can you switch to the former? What about systems without Visual Studio installed, instead using only MSBuild/Windows SDK?
(I'm not too familiar with Visual Studio projects beyond pressing an F-key to build them, so this might well be a stupid question)
From what I understand this is a problem with the compiler, which I think is the same compiler used under the hood by both msbuild and devenv, however I have not confirmed first hand the same problems arise in both situations. I'd be surprised if they didn't.
As for building our projects without Visual Studio, with just MSBuild / Windows SDK, we have as of yet been unable to do so. We have heavy dependencies on MFC which hasn't, until recently, been available outside of Visual Studio. Plus we have had numerous technical issues migrating to the newer versions of the SDK / MSBuild that do include them. Regardless, again I'd be surprised if any of this made any difference unless the compiler that ships with the SDK is fundamentally architecturally different than the one that ships with VS.
If I can spare some time to confirm a few of these details I'll let you know, even if just for curiosities sake.
Solution 5: Don't run MsBuild projects in parallel.
Before I built the python workaround, that's what I did using a throttling plugin. Works fine. pdbsrv gets killed at the end of each build, and started afresh by the microsoft toolchain on the next job. But if you are trying to do continuous build on development branches, then this won't have the capacity to keep up.
Solution 6: Set BUILD_ID to hide pdbsrv from the processtreekiller. Live with the chance that once in a while pdbsrv might time out mid-build.
Solution 5: Don't run MsBuild projects in parallel.
That may be fine for small projects but not for larger ones. For example, our main codebase is configured with about 40 jobs per configuration to build each "tier" or "module" in our codebase more efficiently - running jobs in parallel whenever possible. Doing so reduced our "clean" build times from 14 hours to 3. Numbers like that are hard to argue against.
Are there other ways we could achieve similar results? Possibly, but they all require time and effort (aka: money) which we do not have.
Solution 6: Set BUILD_ID to hide pdbsrv from the processtreekiller. Live with the chance that once in a while pdbsrv might time out mid-build.
Could you clarify what you are referring to here? I assume you mean something other than using your python script since that was the very first "potential fix" I had mentioned above.
It has been my experience that so long as you leave Visual Studio to it's own internal details to manage pdbsrv it works reliably for extended periods, keeping the service alive when needed and terminating it safely when it isn't, even if you run multiple builds in parallel via Jenkins. In fact that is what we do now and it never causes problems with our builds. This is saying something considering the size and scale of our build farm, with hundreds of jobs spread across nearly a dozen servers, all running 24/7!
Maybe the following workaround would work: If mspdbsrv.exe runs as the user launching devenv, you could create a whole bunch of slaves all running on the same machine, but as different users, each having a single executor.
Seems a bit heavy. The extra overhead of running multiple agents alone seems like it would be significant, let alone the complexities involved with having multiple user profiles being used, all of which would need to have a consistent configuration to ensure the agents all behave the same, not to mention managing security and permissions and whatnot. Given that each of our agents currently runs with between 4 and 6 executors, that would increase our agent count by the same factor.
Also, this would make managing overall load on a given system more complex. Consider jobs that are configured to use 100% of the agents resources to prevent parallel build problems, as an example. These would need to be configured to work across agents somehow. I'm not even sure that is possible....
I looked into the difficulty of adding a "process whitelist" for processes that must not be killed. It would require some changes to winp but it's the only workable solution, besides "disable process killing for this entire task", which can, itself, cause build failures.
Unfortunately, because the necessary changes have to span two projects, it'll be a bit of a large task without cooperation from everyone involved.
> It has been my experience that so long as you leave Visual Studio to it's own internal details to manage pdbsrv it works reliably for extended periods, keeping the service alive when needed and terminating it safely when it isn't, even if you run multiple builds in parallel via Jenkins. In fact that is what we do now and it never causes problems with our builds. This is saying something considering the size and scale of our build farm, with hundreds of jobs spread across nearly a dozen servers, all running 24/7!
Unfortunately I've found this isn't the case - there seem to be situations where mspdbsrv times out mid-build and is restarted cleanly, and if that doesn't happen within a BUILD_ID replacement block, then when the restarting build finishes, Jenkins will happily kill mspdbsrv and break other builds.
I suspect "running 24/7" is why you're not seeing this - it's happening somewhat frequently on a much smaller farm of mine with much fewer jobs.
I suspect "running 24/7" is why you're not seeing this - it's happening somewhat frequently on a much smaller farm of mine with much fewer jobs.
That is totally possible. Running so many jobs in parallel so often it is probably a rare condition that no jobs are running at all on any given server on our farm, and this may be preventing the service from timing out.
Thanks for pointing that out.
I very new to the Jenkins world. I am running into this issue a lot. This would be a show stopper for us when it comes to adopting Jenkins for our build processes. Our builds get manually triggered by many users at random times. We could have 20 or more builds running at the same time; all running in parallel. I tried the Python script given by Steve Carter in a Execute Shell command box but I get an error about some "sh" -ex was not found! what gives? I thought I am running a Python script not Linux? or do they both need to run Linux?
In short, if I do not get this resolved, we will have to go back to our previous way of building.
Has anyone solved this issue yet?
Thank you,
Tony: Please address requests for assistance to the jenkinsci-users mailing list, or #jenkins IRC channel on Freenode.
I just ran into this one for the first time as far as I can tell. I did a quick look back and see no other instances and I don't recall seeing this before. For now, I'll take no action. danielbeck or anyone else, please let me know if I can provide you with any information that could help in resolving this. Build env where we saw this error: MS Win 7 x64, VS2010
I hit three more instances of this. Two yesterday and one other a week ago.
How are you starting these builds? Batch? MSBuild plugin? What exact commands? If batch, did you try setting BUILD_ID as described on https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller ?
Batch. In the Jenkins project, we use "Execute Windows Batch Command" to call a batch script that automates a bunch of pre build work and ends up calling the builds via devenv.
I did not try the BUILD_ID suggestion as I saw that there were still issues mentioned in this ticket with this work-around. I was trying hang in there until the final solution was provided, but the failures seem to be picking up for us lately. I guess we'll use this work around for now.
I'm am trying the BUILD_ID suggestion now, but this is a hack (right?) and not a final solution? The final solution is to have Jenkins not kill specified jobs like mspdbsrv.exe. Whether that is in a whitelist managed by the user or hardcoded by Jenkins for now, doesn't matter to me. Hopefully there will be a long-term fix to Jenkins for this.
Does anyone know if there is an option to stop Jenkins from killing processes completely as a global option instead of having to add the BUILD_ID to every single job? I have tried adding this as an env variable at the node level but it doesn't appear to give the desired results (same PDB errors were still occurring), maybe I'm doing something wrong or misunderstanding how this is working under the hood?
We were running CruiseControl for years and never had this problem but we did however have issues where processes were not terminating properly and builds would run forever until someone intervened. Sometimes we still get this with Jenkins so from my point of view one problem is better than two so I'd rather just have an option to tell Jenkins not to force terminate anything - ever. If this cannot be done with the current version (ours is 1.566) Can we at least add a check box that says "Do not auto-terminate processes" as an option in a future release and let the user decide?
How are you trying to run this, Del? At first, I didn't have success getting it going, but now I seem to have it working fine. The BUILD_ID does seem to be an effective solution (I do worry about the memory leak though). I'm using the simple batch solution in comment 6, not the python solution. You just have to make sure that the mspdbsrv file is in your path and it should work fine. We use a batch wrapper, which is under version control, for our builds and I added code that says "If this is a Jenkins build, execute this block". To decide if this if a Jenkins build, I just check to see if JENKINS_URL is defined. Since I added that, we've not seen this issue return. Let me know if I can help in some way.
I've added the block from above into the Jenkins command for the job at the moment but yesterday I got this error and there was only one build running so it is likely a different issue.
33>X509Helper.h(118): fatal error C1090: PDB API call failed, error code '23' : '(
I've even tried setting BUILD_ID=dontKillMe under the node configuration in Environment variables but I have been getting the original problem with that setting also. I even tried restarting the jenkins client service on the build server just in case it was needed for the env variable to be set for all child processes but it's not helping it seems. If this is working for yourself (@Shannon) I have to be doing something stupid.
It seems that putting BUILD_ID under the node settings will be overridden when the build runds and will set BUILD_ID back to the build time. Which rules out having a global setting allowing me to turn this off.
One thing I felt needed to be expressed here is that the fact that this defect arose in an update to the LTS edition at all worries me. Combined with the fact that this defect has been opened and under active discussion for months now without any 'real' resolution - other than some hacks and workarounds - is even more concerning. According to the Jenkins website LTS editions should "...change(s) less often and only for important bug fixes...". This policy seems to have been completely negated here. Given the severity / impact of this change I would have expected whatever "improvement" was made that caused this problem would have been reserved for the "latest" release, or at the very least reverted from the LTS edition after this problem was discovered.
Perhaps someone with more knowledge about the cause of this error could elaborate on why neither of these approaches has been taken here.
@Del, Yes, you cannot set BUILD_ID for a slave setting. It is set by Jenkins on a per build basis. You'd either have to set it in the batch section of the job itself (we did this for our most frequently used builds) or if you call a batch script or some other script, you can put it there.
200$ is up for grabs for solving this issue at: https://freedomsponsors.org/issue/596/visual-studio-builds-started-by-jenkins-fail-with-fatal-error-c1090-because-mspdbsrvexe-gets-killed
I implemented a whitelist solution, see pull request: https://github.com/jenkinsci/jenkins/pull/1562
Nice work Daniel. Will be interesting to see whether that solves the problem.
For the good of the thread, I'm going to try to summarize this from the top down as there's a lot of talk on here that seems to miss the key points.
1) BUILD_ID is an environment variable, set by Jenkins when it starts a job.
2) Environment variables are inherited when processes start other processes, except when overwritten. For e.g. in bash scripts you can go
MYVAR=myvalue myscript.sh
and myscript.sh will run with MYVAR set to myvalue.
3) Therefore, all processes started by a jenkins job have the same BUILD_ID. This is recursive.
4) Jenkins, in order to catch rogue processes at job end (i.e. those that have broken ties with their parent process) scans the whole process space for those with the particular BUILD_ID in their environment, and kills them.
This is correct and good behavior by Jenkins.
5) When you start an MSBUILD job, pdbsrv is started, which catches requests from parallel compilations and serializes them to write pdb files. When started from Jenkins, that pbdsrv process inherits BUILD_ID from the job.
6) If you run two MSBUILD builds at once, then they share the same pdbsrv process.
7) When the first job ends, it kills the pdbsrv process – because its BUILD_ID matches the first job's build id. The second job then fails.
8) Solution 1: start pdbsrv with a BUILD_ID that doesn't match the build jobs. Then pdbsrv will not be killed at the end of the job.
9) Solution 2: use Daniel's whitelist feature to not kill pdbsrv at the end of the job.
Casual readers stop here.
=========================
10) The problem with Solutions 1 and 2 are this: pdbsrv still has a timeout, so you will get sporadic failures when the server goes away.
11) My "heavyweight" python fix is trying to deal with that. Basically wrapping pdbsrv with a proper timeout and reference counting so that pdbsrv is present exactly when needed.
12) pdbsrv's timeout doesn't get a new lease every time you use pdbsrv. I regard this as a bug in pdbsrv.
13) You can't leave pdbsrv running forever because it (allegedly) has memory leaks. I regard this as a bug in pdbsrv.
I really think to roll back Jenkins' ProcessTreeKiller is NOT a solution. The use of BUILD_ID brings the Jenkins machine under better control against rogue processes, and the workaround (for well-behaved servers) is easy, set BUILD_ID before starting the server, or use Daniel's whitelist.
14) Solution 3: start pdbsrv periodically, e.g. every day with a day-long timeout. That will mitigate against the memory leaks. If you use some concurrency control, e.g. Job Weight plugin, you can make sure this "kill and restart pdbsrv" job does not fire during a build.
=========================
Solution 0: Finally, it would be remiss of me not to mention again my python workaround, which has been happily keeping parallel builds working for 54 weeks now without trouble.
penny drops just seen how whitelisting differs from BUILD_ID solution subtle, but it might just work...
Just a quick ping-back on this issue. Outstanding for like 4 years, no comments for months now, and all for a debilitating, crippling problem in the system! I did notice the pull request Daniel Webber created, which does seem to have some more recent activity on it but still no complete resolution to the issue even in the latest LTS release.
Are there plans for finishing this work any time soon? We are still stuck on an LTS version from like a year or two ago because we can not accept this bug into our production environment. If there is any way to get this fix in sooner rather than later I know I'd appreciate it and I'm sure many others would as well.
@steve carter
First, let me thank you for summarizing the earlier comment threads. That does help bring everything into focus.
4) Jenkins, in order to catch rogue processes at job end (i.e. those that have broken ties with their parent process) scans the whole process space for those with the particular BUILD_ID in their environment, and kills them. This is correct and good behavior by Jenkins.
Agreed. This is a perfectly valid and useful enhancement for the majority of cases. However, given the debilitating effect it has on this specific use case combined with the fact that the change was included on an LTS release which is expected to be kept as stable as possible is where I take issue. I see this problem as a bug, albeit a difficult to detect bug and admittedly a bug that is really caused by some questionable behavior provided by the Microsoft build tools, but a bug none the less. In that case critical, production halt kind of bugs like this should be fixed immediately or reverted until an appropriate fix can be made. Doing otherwise reduces users' confidence in the stability of the tool. There is a reason shops like ours choose to use LTS editions for production work - to avoid problems like this that may be found on the latest, cutting edge versions.
8) Solution 1: start pdbsrv with a BUILD_ID that doesn't match the build jobs. Then pdbsrv will not be killed at the end of the job.
This should be called a workaround or hack rather than a solution. That point aside, this workaround again won't work for our particular build environment. We use the BUILD_ID throughout our build processes to embed metadata in the binary files we generate. If we reset that environment variable as part of our build this metadata will essentially get corrupted. Changing our tooling to use an alternative environment variable would require significant effort as well, having to be propagated out to dozens of products across several release branches each.
9) Solution 2: use Daniel's whitelist feature to not kill pdbsrv at the end of the job.
Based on my review of his pull request, Daniel's feature has not yet been completed nor has it been included in any actual LTS release. I do believe this would be a reasonable and appropriate solution to this defect though, so hopefully this work can be completed sooner rather than later.
10) The problem with Solutions 1 and 2 are this: pdbsrv still has a timeout, so you will get sporadic failures when the server goes away.
I know some earlier posters did indicate that this was an issue for them I have not been able to reproduce the problem as described. When a compile begins and this process is running it makes use of the existing process, and if the process is not already running it starts it. I have never had a compile running and seen the mspdbsrv process terminate mid-compile without any other background process or system event occurring. Also, I work with many development teams including many dozens of developers and have never once had a report of this bug outside of the reproducible use cases I've stated before.
Conversely, I have shown the problem is reproducible outside of Jenkins in very hard to detect ways which I suspect may appear to some to be an intermittent timeout. For example, if you are logged in to a system which is performing a compile in a background process which is also running under the same user profile as your local session, by simply logging out of the system the service terminates. The reason for this is the pdbsrv process is shared by the background process and your local user session and when you log out from the local session all processes in that memory space are terminated, including pdbsrv. This was a very difficult use case to isolate and not very obvious to users of the target systems and even went undiagnosed at my place of work for months under the assumption that the failure was unpredictable and intermittent.
I know that my argument doesn't prove that this particular problem couldn't ever happen but I am extremely skeptical to say the least. If someone does believe that this problem does in fact exist I would greatly appreciate a detailed description on how to reproduce the problem. Maybe we're using a slightly older or slightly newer version of the compiler that doesn't exhibit the problem or something. Either way, if these individuals were willing to compare notes maybe we can help further isolate the root of this discrepancy.
12) pdbsrv's timeout doesn't get a new lease every time you use pdbsrv. I regard this as a bug in pdbsrv.
As I've stated in earlier posts, my team manages a build farm with close to a dozen agents now, running over 1000 build jobs and never once have I ever had this error occur on any of those systems, nor have any of the development teams we support report this problem on any of their local development machines. I would have to say that if this were in fact a core issue with the Microsoft toolset we would have discovered it by now. Again, if anyone can give me a reproducible use case that proves otherwise I would be happy to hear from them. Maybe we are doing something they aren't, or vice versa.
13) You can't leave pdbsrv running forever because it (allegedly) has memory leaks. I regard this as a bug in pdbsrv.
Again, this is something we have not been able to reproduce. For example, I have watches some of our agents that are under the most considerable load wrt build operations - machines which essentially run 24/7 compiling one or more projects in parallel nearly all the time and these systems continue to run stably day after day, week after week without requiring any outside intervention from me or my team. The pdbsrv process is nearly always active, the memory consumption increases and decreases with the load on the machines, and never causes any fatal errors in our build processes.
If anyone can provide specific, reproducible criteria for this problem I would be interested to hear it. If there is something we have overlooked that may be causing us grief elsewhere that we have not yet considered I would definitely want to know about it.
I really think to roll back Jenkins' ProcessTreeKiller is NOT a solution.
Agreed. I don't think 'just' rolling back this change is the best solution. I think fixing this bug is the best solution. However in the absence of an appropriate fix for this bug, combined with the severity of it's impact, I think that rolling back the change until an appropriate fix was put in place would have been a better solution rather than stranding users of your tool on an old, out of date release as we have been.
Just my 2 cents.
The use of BUILD_ID brings the Jenkins machine under better control against rogue processes...
Totally agree that the improvement is well worth the effort. My concern is that the change includes a relatively significant bug.
...and the workaround (for well-behaved servers) is easy, set BUILD_ID before starting the server, or use Daniel's whitelist.
Again, 'easy' workaround is a relative term. As just mentioned we would need to rework our build tools and roll that change out to many teams for many products, and backport those changes to many branches for this to work, after which we'd need to going through all 1000+ jobs on our farm and update them with the hack to the environment variable. Obviously significant effort in our case. Also the whitelist solution has yet to be completed from what I can tell, so that is not a usable solution yet.
14) Solution 3: start pdbsrv periodically, e.g. every day with a day-long timeout. That will mitigate against the memory leaks. If you use some concurrency control, e.g. Job Weight plugin, you can make sure this "kill and restart pdbsrv" job does not fire during a build.
Again, just to be clear this is clearly a workaround and not a solution.
This hack may work for us in the interim until an appropriate fix can be made. I will test it out as soon as I can and report back. In our case we'll likely just setup a scheduled task that runs on boot and forces the service to start, and stay running indefinitely as there is no need for it to shut down ever that we have seen.
However, for those individuals who claim that the service does need periodic resetting a solution like this would likely be more complex. Assuming they to need to ensure the utmost stability of their build farm as we do, they would need to ensure the pdbsrv service gets started before any compilation operation runs, including after reboots, power outages, crashes and the like. I don't believe there is any way to achieve this using a Jenkins operation. This means an external process would be needed like the Scheduled Task idea I just mentioned. But then the external process would be running independently from the Jenkins agent making it even more difficult to coordinate the two. For example, I suspect it would be difficult at best to make sure the scheduled task restarts the service at an opportune moment when no compilation operations are happening on the agent. Just something else for those users to keep in mind.
PS: Sorry for the rant. My team and I have been aggravated for some time now, hoping this bug would be fixed so we can move off the old version of Jenkins we're currently stuck on and thus able to pick up some new bug fixes both in the core as well as in numerous plugins which only support newer versions. Hopefully I don't come across as overly adversarial.
Maybe there is a way to shut down the mspdbsrv.exe softly, so it stops only after all active request (by parallel builds) are done. Then it should simply restart on the next request.
Another solution would be to allow the user to give a list of process names not to kill (or maybe hardcode not to kill mspdbsrv.exe).
Stopping after a timeout period after all active requests and continuing to run when it gets a new request are the way mspdbsrv runs normally when something doesn't go around killing it (ala Jenkins).
I believe the correct solution is a whitelist.
Update
So, it turns out setting up some kind of background process to spawn a copy of the pdbsrv process isn't going to work as expected. From what I can tell Windows seems to be able to tell when a process has been launched from a system service and it will prevent those sub-processes from using other processes that are spawned elsewhere. The particulars of my test case are as follows:
- Setup a small Python script that launches a copy of mspdbsrv.exe when called
- Setup a scheduled task in Windows to run the python script on boot
- Reboot the agent - confirm the mspdbsrv.exe process is running
- trigger a compilation operation via the Jenkins dashboard
- A new, secondary copy of mspdbsrv.exe is spawned to serve the Jenkins agent. This sub-process is then terminated as per usual once the Jenkins build is complete.
I have confirmed that both the service that runs the Jenkins agent and the scheduled task use the same user profile and credentials and that both environments are using the same version of mspdbsrv.exe with the same set of command line parameters (ie: -start -spawn).
Looks like I have to head back to the drawing board.
Update
As a quick sanity check I decided to throw together a quick ad-hoc test configuration where by I overload the BUILD_ID in the environment for one of my compilation jobs just to see if one of the hacks proposed earlier will potentially work. Unfortunately it looks like this is not a robust solution either. I have confirmed in the trivial case that the solution does work, as in:
- Setup a job with a single shell operation as a build step, configured as follows:
- override the BUILD_ID env var with some arbitrary value
- call into MSBuild to perform the compilation
- run a build of the given job
- upon completion, confirm that the mspdbsrv.exe process is still running - TEST SUCCESSFUL
However, unfortunately I've found another case where this solution doesn't work. Apparently if you manually kill the build while it is running Jenkins still somehow manages to locate the orphaned pdbsrv process and kill it, despite the changes described above. So, to put it more clearly:
- Setup a job with a single shell operation as a build step, configured as follows:
- override the BUILD_ID env var with some arbitrary value
- call into MSBuild to perform the compilation
- run a build of the given job
- while the compilation operation is running, and you have confirmed the mspdbsrv.exe process has been launched, manually force the running build to terminate (ie: by clicking on the X icon next to the running build on the Jenkins dashboard)
- FAILURE - Jenkins still terminates the pdbsrv process
I have confirmed that the pdbsrv process does correctly inherit the overloaded BUILD_ID, so Jenkins is somehow able to locate and terminate the process in this case. I suspect what may be happening in my test env is that at the point at which I manually kill the build Jenkins is still running one or more Visual Studio operations which have a direct link to the mspdbsrv.exe process and thus it detects and kills the thread by recursively transcending the process tree killing all running processes / threads that are tied to the agent at the time.
Either way, this example shows that even this 'hack' of overriding the BUILD_ID is fragile at best. It looks like we may have no choice but to wait for a fix for that 'whitelist' solution before we can consider upgrading our Jenkins instance.
Update
While reporting the issue in my last comment I had the idea for a slight variation of the configuration described there which does appear to work in both use cases. The main modification that I made was to separate the build operation into two separate build operations:
- the first is a simple Windows command line call which overrides BUILD_ID and then launches mspdbsrv.exe. Once this first operation completes, Jenkins terminates the shell session that is linked to the pdbsrv process thus decoupling it from the agent. Combined with the overloaded BUILD_ID env var, Jenkins can no longer track the process.
- the second operation is just another instance of a Windows shell session that then calls into msbuild to proceed with the build.
Theoretically even this solution "could" fall prey to the same problem I described in my previous comment, however the execution time of this initial build step is negligible and is highly unlikely to be exploited in practice (ie: a user would need to hit the kill button on the build at just that small fraction of a second it takes Jenkins to launch mspdbsrv.exe).
I'm not sure how easy this hack will be for us to roll out into production at the scale we need, but just in case others find this tidbit of information helpful I thought I'd provide it here.
Code changed in jenkins
User: Daniel Weber
Path:
core/src/main/java/hudson/util/ProcessKillingVeto.java
core/src/main/java/hudson/util/ProcessTree.java
test/src/test/java/hudson/util/ProcessTreeKillerTest.java
http://jenkins-ci.org/commit/jenkins/a220431770cfe716e4f69fd76a4a59bbb27aa045
Log:
JENKINS-9104 Add ProcessKillingVeto extension point
This allows extensions to veto killing of certain processes.
Issue 9104 is not yet solved by this, it is only part of the solution. The
rest should be taken care of in plugins.
Code changed in jenkins
User: Daniel Beck
Path:
core/src/main/java/hudson/util/ProcessKillingVeto.java
core/src/main/java/hudson/util/ProcessTree.java
test/src/test/java/hudson/util/ProcessTreeKillerTest.java
http://jenkins-ci.org/commit/jenkins/9a047acd4b5a4e805cee7260f3d091405dc7b930
Log:
Merge pull request #1684 from DanielWeber/JENKINS-9104
JENKINS-9104 Add extension point that allows extensions to veto killing...
Compare: https://github.com/jenkinsci/jenkins/compare/3c785d5af0ad...9a047acd4b5a
Integrated in jenkins_main_trunk #4205
JENKINS-9104 Add ProcessKillingVeto extension point (Revision a220431770cfe716e4f69fd76a4a59bbb27aa045)
Result = UNSTABLE
daniel.weber.dev : a220431770cfe716e4f69fd76a4a59bbb27aa045
Files :
- core/src/main/java/hudson/util/ProcessKillingVeto.java
- core/src/main/java/hudson/util/ProcessTree.java
- test/src/test/java/hudson/util/ProcessTreeKillerTest.java
When you use the commandline switch /Z7 the debug info is stored in the object and no server process is needed. This should also solve the problem.
How does the /Z7 flag affect performance? My impression is that the point of mspdbsrv.exe is to keep the data around for other builds to use, thus decreasing build times for subsequent builds.
It does not affect performance but size of object file. with this option the debug information is stored in each object file instead of one pdb. At linktime, the debug information is written in a PDB file.
Just wanted to note that this also occurs on my slave nodes and each slave node only has one executor. So at first glance, since I'm not running concurrent builds on any individual slave node, it seems like this error occurring on my slave nodes doesn't make any sense.
Code changed in jenkins
User: Daniel Weber
Path:
pom.xml
src/main/java/hudson/plugins/msbuild/MsBuildKillingVeto.java
src/test/java/hudson/plugins/msbuild/MsBuildKillingVetoTest.java
http://jenkins-ci.org/commit/msbuild-plugin/855a84479b64f32ceb30f73433858dfe2efb5e9f
Log:
[FIXED JENKINS-9104] Veto killing mspdbsrv.exe
Making use of the newly introduced ProcessKillingVeto extension point,
we now make sure that mspdbsrv.exe survives process killing during build
cleanup.
This requires a Jenkins version >= 1.625, the new extension point was
added there. I marked the extension as optional, so that the msbuild
plugin should still work with older Jenkins releases.
Code changed in jenkins
User: Gregory Boissinot
Path:
pom.xml
src/main/java/hudson/plugins/msbuild/MsBuildKillingVeto.java
src/test/java/hudson/plugins/msbuild/MsBuildKillingVetoTest.java
http://jenkins-ci.org/commit/msbuild-plugin/48084be76d434195c9e8b2ddc66f1fb5255a78de
Log:
Merge pull request #19 from DanielWeber/master
[FIXED JENKINS-9104] Veto killing mspdbsrv.exe
Compare: https://github.com/jenkinsci/msbuild-plugin/compare/98f71956d897...48084be76d43
Code changed in jenkins
User: Gregory Boissinot
Path:
pom.xml
src/main/java/hudson/plugins/msbuild/MsBuildKillingVeto.java
src/test/java/hudson/plugins/msbuild/MsBuildKillingVetoTest.java
http://jenkins-ci.org/commit/msbuild-plugin/b9a5b02117e0ee097aaf030ab2574daa3dcd217d
Log:
Revert "[FIXED JENKINS-9104] Veto killing mspdbsrv.exe"
Code changed in jenkins
User: Gregory Boissinot
Path:
pom.xml
src/main/java/hudson/plugins/msbuild/MsBuildKillingVeto.java
src/test/java/hudson/plugins/msbuild/MsBuildKillingVetoTest.java
http://jenkins-ci.org/commit/msbuild-plugin/031a05982b16e42cba5544c4ba9511515941c62f
Log:
Merge pull request #20 from jenkinsci/revert-19-master
Revert "[FIXED JENKINS-9104] Veto killing mspdbsrv.exe"
Compare: https://github.com/jenkinsci/msbuild-plugin/compare/48084be76d43...031a05982b16
> Revert "[FIXED JENKINS-9104] Veto killing mspdbsrv.exe"
I'm confused why has the code fix been reverted?
The reason I am looking at this again is that the BUILD_ID work around is no longer working for me.
Neither is the 1.25 msbuild plugin which is meant to have the fix in.
I upgraded from 1.595 to 1.645.
damiandixon: My changes have been reverted by accident, the msbuild plugin release 1.25 does not contain the change required to fix this issue.
There is a new PR reverting the revert: https://github.com/jenkinsci/msbuild-plugin/pull/21
This is still not resolved. We need an update of the msbuild-plugin, see PR https://github.com/jenkinsci/msbuild-plugin/pull/21
danielweber This issue is filed against the core component, and that change has been included a long time ago.
Is there a plan for Visual Studio builds not started by the msbuild-plugin, please?
I'm asking because our job configurations use a "Execute Windows batch command" build step rather than "Build a Visual Studio project or solution using MSBuild" build step (and our batch process is non-trivial).
akb The proposed MSBuild Plugin change only requires the plugin to be installed to be effective (assuming mspdbsrv.exe is what you don't want killed).
That's great - thank you very much for clarifying this, and for your efforts to fix the wider issue - I'm looking forward to having more projects and configurations built automatically in a timely fashion through judicious use of parallelization
akb Forwarding the praise to my (first)namesake danielweber who did all the work
danielbeck: Well, the core stuff is done. But from a user's perspective the issue still exists.
How can I get someone to merge the pending PR and create a release of the msbuild plugin?
What's happened to this fix? It sounds like its ready to go. How can we get a new release of the plugin?
I tried parallel builds with MSBuild plugin 1.25 on top of Jenkins 1.580.1 but unfortunately I still get this error (fatal error C1090: PDB API call failed, error code '23'). Did I miss something ?
When do you publish new version of plugin with fix? It's been month since you released version with(out) fix...
I'm in need of a fix for this too, it's consistently failing numerous jobs for me. Is there an old version of Jenkins to revert to that avoids this particular problem? I'm willing to go that route as a workaround.
So far this has been a cause of a pretty bad first impressions for a team I setup a CI build setup for who had never seen Jenkins before.
I'm using VS2010 devenv.exe to build the solution files.
Hello Jaime,
I found a solution.
I think it is a workaround, but it works for me.
I set for every project the addition String parameter.
Go to the Jenkins Project and set "This build is parameterized", “Name” – “BUILD_ID”, “Default Value” – “DoNotKillMe”.
Stumbled upon this issue immediately after trying parallel builds. Been open for 5 years now, so I guess you can simply check for 'mspdbsrv.exe' and leave it alone? Please free us of our pain.
Somebody, publish the new version please. Apparently, the fix is already in the source code on GitHub. Can someone else (other than the maintainer) release the new version?
FWIW, we implemented a workaround to this issue that doesn't involve wiping out the BUILD_ID variable (as we need to use it). Having a release with the Veto would be better, but this avoids random crashes in the meantime.
Instead of allowing the MSBuild process to start the daemon itself, you cause the daemon to start using an environment that you choose. MSBuild then just uses the instance you started rather than starting its own.
The Powershell we use is as follows. Use the Powershell plugin to run this as a step before the MSBuild plugin step (could be translated to Windows batch too if you like).
# https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller var originalBuildID = $Env:BUILD_ID $Env:BUILD_ID = "DoNotKillMe" try { start mspdbsrv -argumentlist '-start','-spawn' -NoNewWindow } catch {} $Env:BUILD_ID = originalBuildID
msbuild-1.26 should contain the fix. Can we finally resolve this, or is something missing?
*sigh*
1.26 is tagged in GitHub but no artifacts are uploaded. Looks like a failed release. Sorry about that.
Note that MSBuild Plugin is almost certainly not currently maintained, as Gregory stopped working on his plugins, so if someone here wants to take over (danielweber perhaps?) that should be possible.
As a workaround I have created a Jenkins Job that executes a Windows batch command on the jenkins node where Visual Studio is installed.
The jenkins job triggers the batch command once a day and works in my environment for several years now.
The batch command looks like this:
set MSPDBSRV_EXE=mspdbsrv.exe set MSPDBSRV_PATH=C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE set PATH=%MSPDBSRV_PATH%;%PATH% set ORIG_BUILD_ID=%BUILD_ID% set BUILD_ID=DoNotKillMe echo stop mspdbsrv.exe %MSPDBSRV_EXE% -stop echo wait 7 sec %windir%\system32\ping.exe -n 7 localhost> nul echo restart mspdbsrv.exe with a shutdowntime of 25 hours start /b %MSPDBSRV_EXE% -start -spawn -shutdowntime 90000 set BUILD_ID=%ORIG_BUILD_ID% set ORIG_BUILD_ID= exit 0
What the batch command does is:
stop the mspdbsrv.exe to free up resources
start mspdbsrv.exe with BUILD_ID=DoNotKillMe and a shutdowntime of 25 hours, that leaks the mspdbsrv process without getting killed and it runs for 25 hours so that other build jobs can use the already running process
What you maybe have to do is to change the Path to mspdbsrv -> set MSPDBSRV_PATH=C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE
Updating the msbuild plugin won't work in our situation. We run into this issue, but we don't have the plugin installed. Rather the issue comes for us in the Final Builder scripts we run via Jenkins that call msbuild.
set the environment variable
_MSPDBSRV_ENDPOINT_=$JENKINS_COOKIE
(The variable starts and ends with a single '_')
This will lead to separate instance of mspdbsrv being started.
mwinter69, thanks for the pointer.
We couldn't get it working with $JENKINS_COOKIE but managed to correct it by adding the following property via EnvInject prior to kicking off the build
_MSPDBSRV_ENDPOINT_=$BUILD_TAG
This resulted in a separate process being initiated for each build and no conflicts/error.
Edit: Correction due to formatting. Refer below
It is
_MSPDBSRV_ENDPOINT_
(with underlines) not MSPDBSRV_ENDPOINT.
Just realized it myself that it's a formatting issue. If you enclose the word in underlines it will get italicised and the underlines disappear.
We recently re-encountered this on our build network and I did some investigation, here's what I found:
- On the master node, the veto from MSBuild plugin works properly, I was able to confirm the log message show it.
- On a slave node, I do not see the log message from the veto. Instead I see a message that my process is being killed recursively (I was watching the process list to get the id during the build).
It appears that the veto logic doesn't execute on the slave nodes. Is there something special that has to be done in order for it to be detected and executed there? I don't understand enough about how the remoting logic in Jenkins operates to know the answer to this.
Most of the other work-arounds for this are ones that we cannot easily deploy in our environment. If this is truly the issue, does anyone have an idea what it would take to fix it and how long that would take to carry out?
I spent some more time chasing code and I have a suspicion as to the cause of the issue. In ProcessTree.java, there are two different functions that appear to need information from the master and yet operate in different manners
- getVeto() is how the whitelist extension is accessed to block the killing of the process. This function just gets the list as it exists, no attempt to go ask the master for any information.
- getKillers() is used to access the list of ProcessKillers if there are any classes implementing that extension point. This function gets the channel back to the master so it can ask for the master's list of classes implementing this extension.
I think that getVeto() needs to have part of it implemented more like getKillers(), so that it will go to the master for the list. It may be also that the accessor belongs in ProcessTree instead, so that it caches the data and doesn't go back to the master quite as much. Then, I think the veto logic should work properly on both a master and a slave. Unfortuntely, this means a change to Jenkins core and upgrading the full instance to fix the issue instead of just a fix to the plugin itself.
Is there any workaround to this issue, because it completely breaks our usage of Jenkins?
Hi grillba, thanks a lot for your suggestion. It seems that this solved our issues.
Little side note: It might not be sufficient to just specify _MSPDBSRV_ENDPOINT_ env variable in order to avoid conflicts. I recommend to additionally also set TMP , TEMP and TEMPDIR to an isolated folder if you plan on invoking MSBUILD in parallel as various plugins for MSBUILD as well as MSBUILD itself will place files there.
Further catch of using _MSPDBSRV_ENDPOINT_ is, that now serialization of parallel builds in the same working directory will break in return, unless you made sure that the tempoary files for the different architectures (e.g. the temporary program database created with the individual object files, and commonly named just e.g. "Debug\vc120.pdb", notice the lack of a prefix for the architecture) are completely isolated as well. Otherwise the different mspdbsrv-instances will now collide accessing the same file.
grillba, walteste Hi there, we've got this issue too, and we followed your suggestions to config the master Jenkins node like this:
Configure system > Environment variables > Add new key value pair below:
KEY: _MSPDBSRV_ENDPOINT_
VALUE: $BUILD_TAG
But we got nothing, the error still raised up on windows slave, could you please explain the solution in detail? Should we set this Key-Value on the slave node? Thanks in advance
@billhoo,
You need to do it at the Job level - Not the system level. Use envinject to add the environment variable
Have a look here for how to use envinject, https://wiki.jenkins.io/display/JENKINS/EnvInject+Plugin
Make sure you follow the "Inject variables as a build step" topic
Regards
Mark
Thanks for the timely reply, we've followed your guide and found that there were already 3 seprated mspdbsvr.exe processes(for test purpose, we've ran 3 jobs on one windows slave concurrently) ran in background, so it seems worked, but unfortunately, one of our job still failed due to C1090 error.
This is the screenshot of EnvInject in each of our 3 Pipeline jobs configuration page,
I don't think there's anything wrong here, do I miss something?
Thanks,
Bill.
Just in case this helps anyone, I was able to fix all problems mentioned so far in this issue and comments by following the recommendations on this blog post:
http://blog.peter-b.co.uk/2017/02/stop-mspdbsrv-from-breaking-ci-build.html
The solution involves
1. Installing the MSBuild plugin ver. 1.26 or higher in Jenkins. Setup for use on the server is optional, only needs to be installed. This stops Jenkins from killing the mspdbsrv process automatically.
2. Using the _MSPDBSRV_ENDPOINT_ environment variable as done in the comment above.
3. Spawning and killing a new specific mspdbsrv instance of the right Visual Studio version at the beginning and end of each job which uses it.
Powershell implementation of the Python solution in the blog (change VS140COMNTOOLS to the version of Visual Studio being used):
# Manually start mspdbsrv so a parallel job's instance isn't used, works because _MSPDBSRV_ENDPOINT_ is set to a unique value # (otherwise results in "Fatal error C1090: PDB API call failed, error code '23'" when one of the builds completes). $mspdbsrv_proc = Start-Process -FilePath "${env:VS140COMNTOOLS}\..\IDE\mspdbsrv.exe" -ArgumentList ('-start','-shutdowntime','-1') -passthru .\{PowershellBuildScriptName}.ps1 # Manually kill mspdbsrv once the build completes using the previously saved process id Stop-Process $mspdbsrv_proc.Id
I had the same problem with parallel builds (eg. running in parallel job A from trunk and job A from branch), I tried the solution with _MSPDBSRV_ENDPOINT_ with value BUILD_TAG and it worked almost for all jobs. In one situation I still had that error. So I replaced BUILD_TAG with JOB_NAME environment variable and suddenly it was fine, for now we are out of problems. If anyone has still the problem with ENDPOINT solution, try to change BUILD_TAG for something else. If you do not allow parallel build in single job, JOB_NAME should be enough, otherwise you can try JOB_NAME + BUILD_NUMBER combination.
Maybe ENDPOINT has some restrictions, but I did not have a time to inspect this deeper. What I know is that the problematic job has the longest name in my Jenkins - approx. 48 characters.
Please can anyone advise me how to set _MSPDBSRV_ENDPOINT_ with value BUILD_TAG in a pipeline declarative script?
I don’t really understand the difference between defining and injecting an environment variable. I could do:
stage('build_VisualStudio') {
environment { _MSPDBSRV_ENDPOINT_=$BUILD_TAG }
etc.
Would that be sufficient or must environment variable injection be done in a different way?
Code changed in jenkins
User: Daniel Beck
Path:
content/_data/changelogs/weekly.yml
http://jenkins-ci.org/commit/jenkins.io/0391fcb9b4c957e9e41fde03409de330a3de571d
Log:
Remove JENKINS-9104 fix from release to unblock it
Code changed in jenkins
User: Daniel Beck
Path:
content/_data/changelogs/weekly.yml
http://jenkins-ci.org/commit/jenkins.io/62409d42a5769cac66337cbd4b5df5754f0e2384
Log:
Merge pull request #1522 from daniel-beck/changelog-2.119-amended
Remove JENKINS-9104 fix from release to unblock it
Compare: https://github.com/jenkins-infra/jenkins.io/compare/58f029c79331...62409d42a576
I set the severity to minor because an easy workaround is available. What is the reason you can't use "BUILD_ID=dontKillMe" environment variable? This disables the process killer for the job (or globally if set for the whole Jenkins instance). Generally I think the process killer is a good thing, but normally it shouldn't be needed.