(03:48:56 PM) jglick: Nope, similar behavior. kohsuke: even in current trunk NodeProvisioner fails to provision as many one-shot cloud nodes as would be needed to handle the queue, even if you let things run for a long time with constant load. Is this really “as designed”?
(03:49:31 PM) kohsuke: Not sure if I entirely follow, but ...
(03:49:52 PM) kohsuke: in a steady state the length of the queue will approach zero
(03:50:05 PM) jglick: Well you would expect that if there is a Cloud which always agrees to provision new nodes, then the queue would be close to empty.
(03:50:05 PM) qma17: When running the slave from slave command line. java -jar slave.jar -jnlpUrl http://host:8080/computer/rocks/slave-agent.jnlp, I got the 403 forbidden. Because that I am loggin with user on this slave. in the JENKINS UI, it has a secret key, `-secret 4bfe7cdc6a05f29cd4421305a30f788ffaf85070aed4ba1b1fc10246c72c4ce8`, How can I pull this key automatically without manually setup on each slave?
(03:50:10 PM) jglick: But in fact it stays full.
(03:50:39 PM) jglick: Is there some minimum job duration below which this invariant does not hold?
(03:50:50 PM) kohsuke: I thought the algorithm was basically try to keep "queueLength - nodesBeingProvisioned - idleExecutors" zero
(03:51:04 PM) kohsuke: where each term is conservatively estimated between a point-in-time snapshot and moving average
(03:51:30 PM) kohsuke: So in your steady state I'm curious which of the 3 terms is incorrectly computed
(03:51:39 PM) jglick: For me this expression is hovering around 8, where I have 11 distinct queue items.
(03:52:13 PM) jglick: Nothing is idle, sometimes one or two slaves are being provisioned.
(03:52:33 PM) jglick: Queue length typically 8 or 9.
(03:52:42 PM) kohsuke: Are you running this with debugger?
(03:53:08 PM) jglick: Yes.
(03:54:03 PM) jglick: My jobs are mostly short—from 10 to 60 seconds. Maybe that is the explanation?
(03:54:05 PM) kohsuke: So we want to see 3 local variables the 'idle', 'qlen', and 'plannedCapacity'
(03:54:20 PM) kohsuke: ... in the NodeProvisioner.update()
(03:54:43 PM) jglick: Alright, will look into it.
(03:55:16 PM) kohsuke: that corresponds to 3 terms above, and you see that "conservative estimate" is implemented as max/min op
(03:55:22 PM) kohsuke: between moving average and snapshot value
(04:00:03 PM) shmiti: Hey, I got a question about building a plugin, if I want to add several steps options, I simply create multiple descriptors?
(04:00:04 PM) jglick: Ah, do not need debugger at all, just logging on NodeProvisioner, which I already have…
(04:00:24 PM) jglick: Excess workload 4.0 detected. (planned capacity=1.0547044E-9,Qlen=4.0,idle=0.083182335&0,total=16m,=0.1000061)
(04:00:45 PM) kohsuke: So it should be provisioning 4 more node
(04:01:08 PM) jglick: And it does, at that time.
(04:01:19 PM) kohsuke: OK, so we are waiting for it to hit the steady state?
(04:02:31 PM) jglick: Well, this *is* the steady state.
(04:03:17 PM) kohsuke: jglick: NodeProvisioner is detecting need for 4 more nodes, so I think it's doing the right computation
(04:03:36 PM) kohsuke: The question is why is your Cloud impl not provisioning?
(04:03:39 PM) jglick: Excess workload 9.32492 detected. (planned capacity=1.5377589E-9,Qlen=9.32492,idle=0.115431786&0,total=0m,=0.5)
(04:04:02 PM) jglick: It did provision. But NodeProvisioner did not keep up with more stuff in the queue. Maybe it just has a limited capacity for quick builds.
(04:04:40 PM) kohsuke: plannedCapacity is near 0, which means it thinks there's nothing being provisioned right now
(04:05:35 PM) kohsuke: Maybe your steady state involves nodes being taken down?
(04:05:52 PM) jglick: Yes, one-shot slaves.
(04:05:52 PM) kohsuke: one-off slave?
(04:05:55 PM) kohsuke: ah
(04:06:03 PM) kohsuke: Well that explains everything
(04:06:05 PM) jglick: That is what I am trying to test the behavior of.
(04:06:29 PM) kohsuke: Now I understand what stephenc was talking about
(04:07:13 PM) kohsuke: Yeah, this algorithm doesn't work well at all for one-off slaves
(04:07:34 PM) jglick: There is also another apparent bug that if the slave claims !acceptingTasks, NP fails to treat that as a signal to create a new slave. So then the behavior is even worse. Docker plugin has this issue I think. But the retention strategy I am testing with now just terminates the slave at the end of the build so that is not the issue.
(04:07:55 PM) jglick: I tried calling suggestReviewNow but it does not help.
(04:08:28 PM) kohsuke: IIUC, with one-off slaves your steady state is that everyone waits in the queue in the order of nodeProvisioningTime + nodeProvisionerRunInterval
(04:08:41 PM) jglick: Something like that.
(04:09:04 PM) kohsuke: so if your test workload comes in frequently you expect to see on average multiple instances
(04:09:17 PM) kohsuke: OK, that makes sense
(04:09:57 PM) jglick: Anyway, I will push this behavior into mock-slave-plugin 1.5 (upcoming) if you want to play with it. To enable this behavior all you will need to do is set numExecutors=1 for the mock cloud: it will automatically go into one-shot mode. Then you just need to make some jobs which sleep for a while and rapdily trigger one another (I use parameterized-trigger-plugin to keep the queue busy).
(04:10:42 PM) jglick: Testing one-shot because (a) it is desirable for a lot of people, and the behavior of the Docker plugin; (b) I want to see how durable tasks work with it (the answer currently is that they do not work at all).
(04:10:51 PM) kohsuke: I'm not sure how to fix this, though
(04:11:08 PM) kohsuke: It essentially requires us predicting the pace of submissions to the queue in the future
(04:11:29 PM) jglick: Well I think if a slave was removed, you should immediately review the queue and try to satisfy excess workload.
(04:11:37 PM) mode (+v recampbell) by ChanServ
(04:11:43 PM) jglick: No need to predict anything.
(04:12:08 PM) jglick: Just start provisioning as fast as possible if some slaves have been removed.
(04:12:25 PM) kohsuke: If the steady state involves losing and provisioning nodes, the log output above that shows plannedCapacity near 0 looks odd, too
(04:13:27 PM) jglick: FYI, I get a SocketException stack trace for each and every slave I terminate, which is pretty annoying.
(04:13:41 PM) jglick: AbstractCloudSlave.terminate
(04:13:43 PM) kohsuke: I think your suggestion just turns nodeProvisioningTime + nodeProvisionerRunInterval into nodeProvisioningTime
(04:13:49 PM) jglick: SEVERE: I/O error in channel mock-slave-193
(04:14:10 PM) kohsuke: I assume that's a bug in mock-slave
(04:14:17 PM) jglick: Well nPT might be a lot quicker than nPRI.
(04:14:47 PM) jglick: W.r.t. SocketException: perhaps, but if so it is not obvious, since my plugin does not appear in the stack trace.
(04:15:12 PM) kohsuke: Should we capture this conversation in a new ticket?
(04:15:49 PM) jglick: Sure.
(04:16:07 PM) jglick: If you agree this is a legitimate problem, not just an artifact of test methodology.
(04:16:20 PM) kohsuke: I agree this is a legitimate problem
(04:16:36 PM) jglick: Shall I file it?
(04:16:47 PM) kohsuke: Already doing it
(04:21:04 PM) kohsuke: Actually, I think I got the characterization of the issue wrong
(04:21:31 PM) kohsuke: It's not the one-off retention strategy that does any harm because the occupied executors do not enter into calculation anyway
(04:22:37 PM) jglick: It is just the turnover rate?
(04:23:00 PM) kohsuke: I think the issue is the item in the queue spends on the order of nodeProvisioningTime + nodeProvisionerRunInterval before it gets a slave
(04:23:14 PM) kohsuke: nodeProvisionerRunInterval is 10sec
(04:23:26 PM) kohsuke: so for cloud that's really fast (nodeProvisioningTime=1 sec)
(04:23:44 PM) kohsuke: you feel like you are penalized badly
(04:23:45 PM) jglick: BTW mock-slave 1.5 released & on CB UC
(04:23:58 PM) jglick: Which would be true of, say, docker-plugin.
(04:24:10 PM) kohsuke: Right, or JOCbC
(04:24:41 PM) jglick: And in fact for docker-plugin it is worse because currently it just sets !acceptingTasks, but does not terminate the node until ComputerRetentionWork runs a minute later.
(04:24:45 PM) kohsuke: This problem does get masked for benchmark workload with regular retention strategy that keeps slaves around as eventually the provisioning stops
(04:25:02 PM) jglick: Right, this is about constant reprovisioning.
(04:25:32 PM) kohsuke: OK, let me try to reword this