I've been looking at the support bundles;
So the thread dumps are indicating what people seems to say, it is mostly filled with metrics threads.
The Jenkins log has a lot of failing to provision ACi agents:
WARNING c.m.j.c.aci.AciCloud$1#call: AciCloud: Provision agent aci-maven-4mwg0 failed: Status code 409, {"error":{"code":"DeploymentQuotaExceeded","message":"Creating the deployment 'aci-maven-h6n6bs6x' would exceed the quota of '800'. The current deployment count is '800', please delete some deployments before creating a new one. Please see https://aka.ms/arm-debug for usage details."}}
And
2019-10-18 10:51:29.415+0000 [id=1021880] WARNING c.m.j.c.aci.AciCloud#canProvision: Cannot provision: template for label maven-11-windows is not available now, because it failed to provision last time.
What does a deployment count of 800 means? That we had 800 instances running!?
I also see a bunch of
2019-10-18 10:44:12.601+0000 [id=1018868] INFO c.m.j.c.aci.AciCleanTask#cleanDeployments: AzureAciCleanUpTask: cleanDeployments: Checking deployment aci-maven-11-zhdwvcqv
2019-10-18 10:44:12.769+0000 [id=1018868] INFO c.m.j.c.aci.AciCleanTask#cleanDeployments: AzureAciCleanUpTask: cleanDeployments: Deployment not found, skipping
Are we leaking deployments in the Azure cloud plugin?
I’ve been monitoring https://ci.jenkins.io for the past two days since Arnaud restarted it, and besides a bunch of agents that can’t be connected to it seems to have been handling itself fine. Though it hasn’t been much going on in the instance, a few plugin PR builds and a few core PR builds.
Do we have any historical data on how long the instance has survived between restarts in the past?
I'm wondering if there is another leak besides metrics and if so how long until it should manifest again. Because as of now it looks OK though the load hasn't been that high these last two days.
Just copying my suggestion in IRC here:
WDYT Baptiste Mathus Olblak?