-
Bug
-
Resolution: Fixed
-
Major
-
jenkins version 2.49
Blue Ocean version 1.0.0-b25
Java version 1.8
-
Powered by SuggestiMate -
1.0, Blue Ocean 1.0-rc3, Blue Ocean - 1.1-beta-1
I noticed that during a pipeline stage the build time was reporting an impossible time. After the stage finished, the time corrected itself. Screenshots attached.
- Capture.PNG
- 5 kB
- Capture2.PNG
- 7 kB
- localhost.har
- 1.07 MB
- 12.xml
- 0.7 kB
[JENKINS-42636] During pipeline step duration reported incorrectly
jamesdumay I tried fetching 'General SCM' step using step API, its returning correct duration, while General SCM step is going on, it shows that steps duration as running/unknown, and it fetch of that API shows correct incrementing duration. Once its finished/success this number doesn't change. Someone from UI should take a look to see if there is some logic and for some reason its failing to update the duration, because it doesn't happen in all cases.
It will also help to capture steps API output when it shows such ever increasing duration. I don't see this behavior in API.
vivek can you please check again? I used https://github.com/i386/app-store-demo and was able to reproduce it on almost every second run. I would like to know for certain that this isn't a backend issue.
svanoort do you know of any conditions where steps could have really huge durations?
jamesdumay can't get it to reproduce locally. If you could capture network calls, it will help. Or maybe screen share sometime today?
Steps could get very long durations if they have been running a long time, or if time zones change (perhaps an NTP update on a freshly started system) while the build is running.
It would need the API call result to troubleshoot – that will demonstrate if the correct data is being returned. 50/50 chance on it being front-end vs backend – caching, delayed updates, or something anomalous preventing detection of a stage block could be at fault.
jamesdumay thanks for the HAR file. It shows there is large time returned in one of the step API call.
[ { "_class": "io.jenkins.blueocean.rest.impl.pipeline.PipelineStepImpl", "_links": { "self": { "_class": "io.jenkins.blueocean.rest.hal.Link", "href": "/blue/rest/organizations/jenkins/pipelines/App Store/branches/master/runs/37/nodes/11/steps/7/" }, "actions": { "_class": "io.jenkins.blueocean.rest.hal.Link", "href": "/blue/rest/organizations/jenkins/pipelines/App Store/branches/master/runs/37/nodes/11/steps/7/actions/" } }, "displayName": "General SCM", "durationInMillis": 3813, "id": "7", "input": null, "result": "SUCCESS", "startTime": "2017-03-15T09:27:19.799+1100", "state": "FINISHED" }, { "_class": "io.jenkins.blueocean.rest.impl.pipeline.PipelineStepImpl", "_links": { "self": { "_class": "io.jenkins.blueocean.rest.hal.Link", "href": "/blue/rest/organizations/jenkins/pipelines/App Store/branches/master/runs/37/nodes/11/steps/12/" }, "actions": { "_class": "io.jenkins.blueocean.rest.hal.Link", "href": "/blue/rest/organizations/jenkins/pipelines/App Store/branches/master/runs/37/nodes/11/steps/12/actions/" } }, "displayName": "Shell Script", "durationInMillis": 1489530443681, "id": "12", "input": null, "result": "UNKNOWN", "startTime": "1970-01-01T10:00:00.000+1000", "state": "RUNNING" } ]
svanoort For each step's timing information, I am calling StatusAndTiming.computeChunkTiming(). In this case, timingInfo.getTotalDurationMillis() is very large, 1489530443681.
Root cause is TimingAction.getStartTime(firstNode) returns 0, that is in certain cases flowNode.getPersistentAction(TimingAction.class) returns null and this results in to duration computed as System.currentTimeInMillis().
I do not have a unit test, but its happening on this step in this Jenkinsfile. `General SCM` step is run before this step, that's implicit step executed by declarative plugin.
vivek Why is the first node showing up with a null start time? (Specifically, do you have the node XML file?)
The only nodes that should ever have a null start time are:
(a) FlowStartNodes and
(b) Nodes that have been freshly generated and are about to begin execution, which is to say that the start time reflects the current system time, which is what this logic is intended to handle
Anything else points to an error deeper in the stack, probably in serialization/deserialization
> Why is the first node showing up with a null start time? (Specifically, do you have the node XML file?)
firstNode in this case is node with id 12, I suspect it might be due to reason (b). flowNode.getPersistentAction(TimingAction.class) is returning null. So maybe its too early in it's flow node generation. Guess it might be due to timing here the flow node timing action is not yet persisted. See attached 12.xml. 12.xml
Besides the issue of 'why flowNode.getPersistentAction() is returning null for StepAtomNode', there is bug in duration computation that is: if startTime is 0, then duration should be 0 as well. Whats happening here, based on what I see is:
- endTime is set to System.currentTimeInMillis()
- startTime is 0 because flowNode.getPersistentAction(TimingAction.class) return null.
- Duration (endTime-startTime) is then 'System.currentTimeInMillis'
Which is wrong. It should be zero because this stepAtomNode has not been started yet. So maybe we fix it there or ensure flowNode.getPersistentAction() always returns > 0?
> (a) FlowStartNodes and
In this case its StepAtomNode.
vivek That node should not have a null start time, period, if execution has begun. We need to solve how it managed to be executed with no TimingAction (based on the XML) because that points to a significant issue in workflow-job plugin, or workflow-cps.
Is this with the latest pipeline versions, and do you have a reproducible test case that will trigger this?
svanoort this was discovered using all the latest Pipeline dependencies. The easiest way to reproduce this is to run Blue Ocean from your IDE and repetitively build the app-store-demo project over and over. I seem to be able to reproduce it 1 out of 3 runs, but it seems like its time sensitive. We've had a few reports of this out in the wild and it has annoyed people, so we see this as important to fix.
jamesdumay Does it really make sense to class it as a "major" issue, if it's only rarely reproducible, in one environment, for a brief period? People will get annoyed about anything, but this doesn't break functionality, only show something bogus for a brief period.
svanoort have seen a few people bring it up - but not in a major way, it happens I would say "often", it smore that people don't often see that screen while it is happening.
fwiw, I've seen this pop up a few times myself. jamesdumay Do you by any chance have the build directory for one of those runs where this is happening?
abayer I am able to make this happen using the app-store-demo project as mentioned in vivek's comment
It takes a few goes to do it. Seems to happen more frequently if I run Blue Ocean via hpi:run
jamesdumay But you don't have a zip file of a build directory when this is happening, right? =) Just wanna avoid duplicating work if I can!
Ok, I've finally reproduced it. For what it's worth, it's not even the duration of the step, let alone the stage, that the TimingAction is missing - it's <2 seconds total. I was copying the build directory aside every second. One copy, before 12.xml was written (i.e., before the docker pull started), all was fine. Next copy, one second later, 12.xml was there, 12.log was completely empty, and 12.xml didn't have a TimingAction. And one second after that, 12.xml was still the latest, 12.log had some output in it, and 12.xml did have a TimingAction. So the shell step calling docker pull was still running but the node now had a proper TimingAction. While the time-since-epoch duration seems to stick around in the UI after that, that's something in the UI not refreshing it from the API, because a reload of the page gets the right duration at that point.
I'll defer to svanoort if he's got a better sense of what might be happening here, but I'd guess it's just that there's a tiny tiny tiny window between when the FlowNode is created and first saved and when CpsFlowExecution.notifyListeners gets called, which leads to the actual adding of the TimingAction.
So frankly? I think Blue Ocean should just be a little smarter and either not display a duration at all if it's expecting a TimingAction but gets null, or it should somehow decide what an appropriate analogue for the start time would be. This tiny gap between the FlowNode being initially written and the TimingAction being added is, IMO, entirely reasonable.
fwiw, added jenkins-42636-build-dirs.zip as an attachment. It's got the three separated-by-1-second copies of the build directory I mentioned in the above comment.
PR up at https://github.com/jenkinsci/blueocean-plugin/pull/947 - not sure off the top of my head the right way to test it, since reproducing the issue is inconsistent and timing-based anyway.
Vivek and I suspect this is a backend issue. I've seen this live on blueocean.io