-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
Windows 10 x64
java version "1.8.0_71"
2.0-beta-1
-
Powered by SuggestiMate
Start Jenkins with java -jar jenkins-2.0-beta-1.jar on windows
In the console hit ctrl+c
expected behaviour
Jenkins does a graceful shutdown and this is appropriately logged
actual behaviour
it appears as though Jenkins is brutally terminated and does not do a graceful shutdown.
There are no logs to indicate a graceful termination
In 1.x I see the following logs
Mar 31, 2016 12:02:39 PM winstone.Logger logInternal INFO: JVM is terminating. Shutting down Winstone
in 2.x I do not see those entries - and there is nothing to indicate in the log that Jenkins is shutting down gracefully
- is related to
-
JENKINS-30909 queue.xml only gets persisted on successful shutdown
-
- Resolved
-
-
JENKINS-34281 Queue isn't saved on any shutdown
-
- Resolved
-
[JENKINS-33926] Jenkins no longer appears to shutdown correctly
Critical to investigate for RC. If this is only a problem for interactive runs, we could probably live with it for GA, but if this is affecting production instances running as services, that would be really nasty.
Same issue for me on OS X so does not seem limited to Windows.
If this means that e.g. service restarts don't do a clean restart, and e.g. lose the queue contents, that would be bad.
If it's just a case of Jetty 9 no longer logging this, we'll survive
While investigating, I didn't see any error show up consistently before missing the log statement. I didn't see this occur as often on 2.0 before clicking the "Install New Features". Luckily it doesn't seem to be disruptive to my Jenkins, so that's a plus. Trying to see if there's a problem after "Installing new features" or running it under heavy stress.
Would be interesting to know whether the shutdown procedure (that includes saving the queue to disk) gets executed every time, log statement or not.
Right, the investigation needs to prove/disprove if the shutdown sequence is consistently running as expected. Are there some clear signatures to this that Kristin could rely on? Examples: Some expected file update times change, a debug message that a Servlet.destroy got called, etc.
swashbuck1r queue.xml gets written (possibly only when there are items in the queue, not sure).
I think this (JENKINS-30909) is similar since it's referring to a missing queue log.
In testing where I physically go through the whole computer shutdown process, it appears that things run properly. (I find a properly timestamped queue.xml.bak file) When I just stop the process from the command line, I don't see any queue.xml file. I assume this case falls under this defect.
So far, the behavior I've seen echos this work item. I'll have to look to a different mechanism (James Nord brought up TermMilestone) to see if there's another issue.
After installing Jenkins 2.0 as a Windows service and running a heavy mock load, I'm still seeing the queue.xml file appear. Since the service is able to start up earlier than I can access my file system, I know this file is created because of queue.xml.bak which isn't removed & contains the same information.
The logs left in jenkins.out.log and jenkins.err.log don't have any messages for shutting down, but I have a feeling they're overwritten on service restart, so they're likely not useful. Checking the log on the Jenkins instance also only covers the current run.
I linked to JENKINS-30909 to highlight the behavior I'm seeing as part of restarting/destroying-recreating the service. Nothing really appears to be out of the ordinary.
From what I can tell during these tests, I don't think this is a blocker issue for 2.0 release.
When I just stop the process from the command line, I don't see any queue.xml file. I assume this case falls under this defect.
Would be interesting to know whether a Linux service behaves differently in shutting down. Not sure which signals each sends. Loss of the queue on every shutdown would be a regression.
kwhetstoneno need to restart the OS, just start stop the service either via the SCM or sc stop jenkins so you always have access to the filesystem
So I ran through a number of scenarios on Windows and Linux on 2.0 and 1.X and it appears that for Windows we never actually save the queue when we stop the service. Linux is fine; I started and stopped the service while every executor is used, and everything is saved and restarted. I think this might be a difference in the command signals sent to the service when stopping in Windows vs Linux, but that's just a theory at this point.
As to other shutdown actions taken when the system is shutting down, since I'm not able to replicate the initial condition, all of my exit conditions (TermMilestones, etc) are hit. Other than this queue conditional, I don't see anything odd with the service.
Summary of test results from Kristin:
- Does the message get printed on shutdown? (Windows: 100% appears consistently, Linux: 100% appears consistently)
- Does the shutdown sequence seem to complete?
- Marker:queue.xml gets saved (yes on Linux, yes on Windows cmdline, no on Windows service). No on windows service seems consistent with 1.x line of Jenkins.
- Marker:plugin shutdown() is being called (yes on Linux and Windows)
Decision [Daniel, Kristin]: At this time, there's no indication of a regression on beta2 vs 1.x, so this is not a stop-ship
Next steps:
- Daniel will add testcase for this for the RC and future LTSs.
- At this time, this issues appears to be not reproducible...Daniel will close.
- Kristin will open another ticket about the problem where queue.xml doesn't get saved via Windows Service shutdown.
Resolving as Cannot Reproduce for now. I'll add test cases to RC/LTS testing plans so we make sure to check this for the 2.0 RC.
While investigating, I didn't see any error show up consistently before missing the log statement. I didn't see this occur as often on 2.0 before clicking the "Install New Features". Luckily it doesn't seem to be disruptive to my Jenkins, so that's a plus.
Seems to imply it is reproduceable from the command line?
Anyway I can reproduce it so you can always assign it to me
100% reproducible with the 2.0-rc WAR running on Mac.
Test case:
- Create a pipeline job
node {
echo 'stuffs'
sleep 100
stage 'Stage 2'
}
- Create a freestyle job, set to allow concurrent builds, with the following shell step:
echo 'I do stuff'
sleep 100
- Start up several builds of the pipeline job, to consume executors.
- Queue up the freestyle project
- Shutdown jenkins via Ctrl+C
Result:
- Command line reports winstone/jetty shutting down
- queue.xml & queue.xml.bak files are created but contain no information about the queued build or builds. Pipelines that were running will restart however.
- if Jenkins is started again, build will not run from queue. Indeed it is as if the freestyle project was never queued: new builds will use its build #, no record it ever existed.
Reopening because it exists in a painful and reproducible way (see previous comments); the entire build queue is lost on shutdown.
James, I'm sending it over to you since I couldn't reproduce it on beta while Ctrl-C or while stopping the service. I'm going to see if I can reproduce it on the RC, but I must not be going through the correct steps or something.
Note: the part about losing the job queue when restarted as a Windows service looks like an older problem. I definitely think that this should be a bug if it's not already.
Sam, it sounds like your defect is something different than what the original problem covered by this work item. While the queue is saved, it's not saving everything. That might be a bug in the queue.save(). In the Windows version it's not getting saved at all during Ctrl-C. We should open a ticket for that problem so it could be fixed.
Yeah, I can't tell for sure if it's the same issue, something similar, or these are all different manifestations of the same underlying cause.
The previous issue with losing the build queue apparently had some sort of handling with nextBuildNumber that resolved it (I think?). This one seems to be some combination of shutdown logging & queue writing, which could also be two separate issues as well.
To me, the file showing up in the first place and that it's successfully read on startup means the file was created correctly (there's no missing information when being written and it's not formatted incorrectly) it might just be incomplete. Obviously I'm making some assumptions on the queue behavior, and spent some time trying to go through where the queue was written so figure out how it came into being. Apparently there are 2 types of shutdown: one that writes the queue and one that's "quick" not counting the error case, the secret not-so-fun shutdown. From what I understand the error case just completely tanks jetty, et al. If it's able to save state enough to pick back up on a restart, I think that's slightly better than not having anything at all.
I'd check out if the queue being incomplete issue exists on the latest 1.x release. That will at least give some context into whether it's something new or not. Either way it's not good, but at least it's not worse!
Oh! I'd tested on CJE 1.625.16.1 - the queue is maintained when killed by Ctrl+C. Just rechecked, and that is also true on 1.642.x - so it's clearly a regression since then.
To be clear: we're not saving state to pick back up on restart at all. The queue.xml is empty of items, this is one sample:
<?xml version='1.0' encoding='UTF-8'?> <hudson.model.Queue_-State> <counter>42</counter> <items/> </hudson.model.Queue_-State>
Cool, so that's actually a regression from beta2 where that information was saved. I think that's a completely separate defect since this jetty issue happened before then. Good find during TestFest!
Not saving the queue on windows was fixed recently on master.
I'm unassigning from myself for the moment as I'm not looking at this. If I get to look at it and it's not taken I will re-assign it.
Closing this since Kristin opened https://issues.jenkins-ci.org/browse/JENKINS-34281 which looks to be a completely separate issue.
whilst this sometimes occurs in 1.x testing (<10% of the time) during my 2.0 testing I get it > 90% of the time in 2.x implying something is worse somewhere.