[JENKINS-24752] NodeProvisioner algorithm is suboptimal for fast one-off cloud - Jenkins Jira

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: core
Labels:
None

Similar Issues:
Powered by SuggestiMate

Show
Released As:
Jenkins 2.185

NodeProvisioner minimally takes the time in the order of nodeProvisioningTime + nodeProvisionerRunInterval to react to a newly submitted item in the queue.

For clouds that are fast, such as docker plugin or JOCbC, nodeProvisioningTime could be 1sec or even less. And nodeProvisionerRunInterval is currently 10secs. So that ends up as a lot of waiting.

The problem do get masked if the rentention strategy of the provisioned slaves keeps the slaves around, as in such a case the steady state will not involve node provisioning or deprovisioning. So the NodeProvisioner becomes irrevelant.

But if the retention strategy is one-off, like docker or JOCbC, then the effect of NodeProvisioner remains visible in the steady state.

Jesse suggested that perhaps NodeProvisioner should be invoked out of cycle when jobs are submitted to the queue, which removes the nodeProvisionerRunInterval term from the above equation.

(03:48:56 PM) jglick: Nope, similar behavior. kohsuke: even in current trunk NodeProvisioner fails to provision as many one-shot cloud nodes as would be needed to handle the queue, even if you let things run for a long time with constant load. Is this really “as designed”?
(03:49:31 PM) kohsuke: Not sure if I entirely follow, but ...
(03:49:52 PM) kohsuke: in a steady state the length of the queue will approach zero
(03:50:05 PM) jglick: Well you would expect that if there is a Cloud which always agrees to provision new nodes, then the queue would be close to empty.
(03:50:05 PM) qma17: When running the slave from slave command line.   java -jar slave.jar -jnlpUrl http://host:8080/computer/rocks/slave-agent.jnlp, I got the 403 forbidden. Because that I am loggin with user on this slave.  in the JENKINS UI, it has a secret key,  `-secret 4bfe7cdc6a05f29cd4421305a30f788ffaf85070aed4ba1b1fc10246c72c4ce8`,  How can I pull this key automatically without manually setup on each slave?
(03:50:10 PM) jglick: But in fact it stays full.
(03:50:39 PM) jglick: Is there some minimum job duration below which this invariant does not hold?
(03:50:50 PM) kohsuke: I thought the algorithm was basically try to keep "queueLength - nodesBeingProvisioned - idleExecutors" zero
(03:51:04 PM) kohsuke: where each term is conservatively estimated between a point-in-time snapshot and moving average
(03:51:30 PM) kohsuke: So in your steady state I'm curious which of the 3 terms is incorrectly computed
(03:51:39 PM) jglick: For me this expression is hovering around 8, where I have 11 distinct queue items.
(03:52:13 PM) jglick: Nothing is idle, sometimes one or two slaves are being provisioned.
(03:52:33 PM) jglick: Queue length typically 8 or 9.
(03:52:42 PM) kohsuke: Are you running this with debugger?
(03:53:08 PM) jglick: Yes.
(03:54:03 PM) jglick: My jobs are mostly short—from 10 to 60 seconds. Maybe that is the explanation?
(03:54:05 PM) kohsuke: So we want to see 3 local variables the 'idle', 'qlen', and 'plannedCapacity'
(03:54:20 PM) kohsuke: ... in the NodeProvisioner.update()
(03:54:43 PM) jglick: Alright, will look into it.
(03:55:16 PM) kohsuke: that corresponds to 3 terms above, and you see that "conservative estimate" is implemented as max/min op
(03:55:22 PM) kohsuke: between moving average and snapshot value
(04:00:03 PM) shmiti: Hey, I got a question about building a plugin, if I want to add several steps options, I simply create multiple descriptors?
(04:00:04 PM) jglick: Ah, do not need debugger at all, just logging on NodeProvisioner, which I already have…
(04:00:24 PM) jglick: Excess workload 4.0 detected. (planned capacity=1.0547044E-9,Qlen=4.0,idle=0.083182335&0,total=16m,=0.1000061)
(04:00:45 PM) kohsuke: So it should be provisioning 4 more node
(04:01:08 PM) jglick: And it does, at that time.
(04:01:19 PM) kohsuke: OK, so we are waiting for it to hit the steady state?
(04:02:31 PM) jglick: Well, this *is* the steady state.
(04:03:17 PM) kohsuke: jglick: NodeProvisioner is detecting need for 4 more nodes, so I think it's doing the right computation
(04:03:36 PM) kohsuke: The question is why is your Cloud impl not provisioning?
(04:03:39 PM) jglick: Excess workload 9.32492 detected. (planned capacity=1.5377589E-9,Qlen=9.32492,idle=0.115431786&0,total=0m,=0.5)
(04:04:02 PM) jglick: It did provision. But NodeProvisioner did not keep up with more stuff in the queue. Maybe it just has a limited capacity for quick builds.
(04:04:40 PM) kohsuke: plannedCapacity is near 0, which means it thinks there's nothing being provisioned right now
(04:05:35 PM) kohsuke: Maybe your steady state involves nodes being taken down?
(04:05:52 PM) jglick: Yes, one-shot slaves.
(04:05:52 PM) kohsuke: one-off slave?
(04:05:55 PM) kohsuke: ah
(04:06:03 PM) kohsuke: Well that explains everything
(04:06:05 PM) jglick: That is what I am trying to test the behavior of.
(04:06:29 PM) kohsuke: Now I understand what stephenc was talking about
(04:07:13 PM) kohsuke: Yeah, this algorithm doesn't work well at all for one-off slaves
(04:07:34 PM) jglick: There is also another apparent bug that if the slave claims !acceptingTasks, NP fails to treat that as a signal to create a new slave. So then the behavior is even worse. Docker plugin has this issue I think. But the retention strategy I am testing with now just terminates the slave at the end of the build so that is not the issue.
(04:07:55 PM) jglick: I tried calling suggestReviewNow but it does not help.
(04:08:28 PM) kohsuke: IIUC, with one-off slaves your steady state is that everyone waits in the queue in the order of nodeProvisioningTime + nodeProvisionerRunInterval
(04:08:41 PM) jglick: Something like that.
(04:09:04 PM) kohsuke: so if your test workload comes in frequently you expect to see on average multiple instances
(04:09:17 PM) kohsuke: OK, that makes sense
(04:09:57 PM) jglick: Anyway, I will push this behavior into mock-slave-plugin 1.5 (upcoming) if you want to play with it. To enable this behavior all you will need to do is set numExecutors=1 for the mock cloud: it will automatically go into one-shot mode. Then you just need to make some jobs which sleep for a while and rapdily trigger one another (I use parameterized-trigger-plugin to keep the queue busy).
(04:10:42 PM) jglick: Testing one-shot because (a) it is desirable for a lot of people, and the behavior of the Docker plugin; (b) I want to see how durable tasks work with it (the answer currently is that they do not work at all).
(04:10:51 PM) kohsuke: I'm not sure how to fix this, though
(04:11:08 PM) kohsuke: It essentially requires us predicting the pace of submissions to the queue in the future
(04:11:29 PM) jglick: Well I think if a slave was removed, you should immediately review the queue and try to satisfy excess workload.
(04:11:37 PM) mode (+v recampbell) by ChanServ
(04:11:43 PM) jglick: No need to predict anything.
(04:12:08 PM) jglick: Just start provisioning as fast as possible if some slaves have been removed.
(04:12:25 PM) kohsuke: If the steady state involves losing and provisioning nodes, the log output above that shows plannedCapacity near 0 looks odd, too
(04:13:27 PM) jglick: FYI, I get a SocketException stack trace for each and every slave I terminate, which is pretty annoying.
(04:13:41 PM) jglick: AbstractCloudSlave.terminate
(04:13:43 PM) kohsuke: I think your suggestion just turns nodeProvisioningTime + nodeProvisionerRunInterval into nodeProvisioningTime
(04:13:49 PM) jglick: SEVERE: I/O error in channel mock-slave-193
(04:14:10 PM) kohsuke: I assume that's a bug in mock-slave
(04:14:17 PM) jglick: Well nPT might be a lot quicker than nPRI.
(04:14:47 PM) jglick: W.r.t. SocketException: perhaps, but if so it is not obvious, since my plugin does not appear in the stack trace.
(04:15:12 PM) kohsuke: Should we capture this conversation in a new ticket?
(04:15:49 PM) jglick: Sure.
(04:16:07 PM) jglick: If you agree this is a legitimate problem, not just an artifact of test methodology.
(04:16:20 PM) kohsuke: I agree this is a legitimate problem
(04:16:36 PM) jglick: Shall I file it?
(04:16:47 PM) kohsuke: Already doing it
(04:21:04 PM) kohsuke: Actually, I think I got the characterization of the issue wrong
(04:21:31 PM) kohsuke: It's not the one-off retention strategy that does any harm because the occupied executors do not enter into calculation anyway
(04:22:37 PM) jglick: It is just the turnover rate?
(04:23:00 PM) kohsuke: I think the issue is the item in the queue spends on the order of nodeProvisioningTime + nodeProvisionerRunInterval before it gets a slave
(04:23:14 PM) kohsuke: nodeProvisionerRunInterval is 10sec
(04:23:26 PM) kohsuke: so for cloud that's really fast (nodeProvisioningTime=1 sec)
(04:23:44 PM) kohsuke: you feel like you are penalized badly
(04:23:45 PM) jglick: BTW mock-slave 1.5 released & on CB UC
(04:23:58 PM) jglick: Which would be true of, say, docker-plugin.
(04:24:10 PM) kohsuke: Right, or JOCbC
(04:24:41 PM) jglick: And in fact for docker-plugin it is worse because currently it just sets !acceptingTasks, but does not terminate the node until ComputerRetentionWork runs a minute later.
(04:24:45 PM) kohsuke: This problem does get masked for benchmark workload with regular retention strategy that keeps slaves around as eventually the provisioning stops
(04:25:02 PM) jglick: Right, this is about constant reprovisioning.
(04:25:32 PM) kohsuke: OK, let me try to reword this

links to

PR #4082

Details

Description

Attachments

Issue Links

Activity

People

Dates