Uploaded image for project: 'Infrastructure'
  1. Infrastructure
  2. INFRA-2308

Fix instabilities of ci.jenkins.io

    XMLWordPrintable

    Details

    • Epic Name:
      ci.jenkins.io instabilities
    • Similar Issues:

      Description

      There has been a lot of reports lately on instability issues on https://ci.jenkins.io/.
      We need to do a systematic investigation on the configuration of the instance, what exact issues it is facing, etc.

      Acceptance criteria

      • Infra should be able to build Core and plugins without failing randomly all the time
      • [[Do we want to define any more specific SLAs??]]

      Some resources

        Attachments

          Issue Links

            Activity

            Hide
            batmat Baptiste Mathus added a comment -

            First ongoing action: we have gathered support bundles, and submitted them for analysis to the CloudBees Support team.
            First outcome after GC analysis is https://github.com/jenkins-infra/jenkins-infra/pull/1375

            This is NOT deemed as the fix to rule them all. But it is definitely a first step to fix various unideal or wrong configurations to progressively narrow down problems.

            Show
            batmat Baptiste Mathus added a comment - First ongoing action: we have gathered support bundles, and submitted them for analysis to the CloudBees Support team. First outcome after GC analysis is https://github.com/jenkins-infra/jenkins-infra/pull/1375 This is NOT deemed as the fix to rule them all. But it is definitely a first step to fix various unideal or wrong configurations to progressively narrow down problems.
            Hide
            rsandell rsandell added a comment -

            I've been looking at the support bundles;

            So the thread dumps are indicating what people seems to say, it is mostly filled with metrics threads.

            The Jenkins log has a lot of failing to provision ACi agents:

            WARNING c.m.j.c.aci.AciCloud$1#call: AciCloud: Provision agent aci-maven-4mwg0 failed: Status code 409, {"error":{"code":"DeploymentQuotaExceeded","message":"Creating the deployment 'aci-maven-h6n6bs6x' would exceed the quota of '800'. The current deployment count is '800', please delete some deployments before creating a new one. Please see https://aka.ms/arm-debug for usage details."}}

            And

            2019-10-18 10:51:29.415+0000 [id=1021880] WARNING c.m.j.c.aci.AciCloud#canProvision: Cannot provision: template for label maven-11-windows is not available now, because it failed to provision last time.

            What does a deployment count of 800 means? That we had 800 instances running!?

             

            I also see a bunch of

            2019-10-18 10:44:12.601+0000 [id=1018868] INFO c.m.j.c.aci.AciCleanTask#cleanDeployments: AzureAciCleanUpTask: cleanDeployments: Checking deployment aci-maven-11-zhdwvcqv
            2019-10-18 10:44:12.769+0000 [id=1018868] INFO c.m.j.c.aci.AciCleanTask#cleanDeployments: AzureAciCleanUpTask: cleanDeployments: Deployment not found, skipping

            Are we leaking deployments in the Azure cloud plugin?

             

            I’ve been monitoring https://ci.jenkins.io for the past two days since Arnaud restarted it, and besides a bunch of agents that can’t be connected to it seems to have been handling itself fine. Though it hasn’t been much going on in the instance, a few plugin PR builds and a few core PR builds.

             

            Do we have any historical data on how long the instance has survived between restarts in the past?

            I'm wondering if there is another leak besides metrics and if so how long until it should manifest again. Because as of now it looks OK though the load hasn't been that high these last two days.

            Show
            rsandell rsandell added a comment - I've been looking at the support bundles; So the thread dumps are indicating what people seems to say, it is mostly filled with metrics threads. The Jenkins log has a lot of failing to provision ACi agents: WARNING c.m.j.c.aci.AciCloud$1#call: AciCloud: Provision agent aci-maven-4mwg0 failed: Status code 409, {"error":{"code":"DeploymentQuotaExceeded","message":"Creating the deployment 'aci-maven-h6n6bs6x' would exceed the quota of '800'. The current deployment count is '800', please delete some deployments before creating a new one. Please see https://aka.ms/arm-debug for usage details."}} And 2019-10-18 10:51:29.415+0000 [id=1021880] WARNING c.m.j.c.aci.AciCloud#canProvision: Cannot provision: template for label maven-11-windows is not available now, because it failed to provision last time. What does a deployment count of 800 means? That we had 800 instances running!?   I also see a bunch of 2019-10-18 10:44:12.601+0000 [id=1018868] INFO c.m.j.c.aci.AciCleanTask#cleanDeployments: AzureAciCleanUpTask: cleanDeployments: Checking deployment aci-maven-11-zhdwvcqv 2019-10-18 10:44:12.769+0000 [id=1018868] INFO c.m.j.c.aci.AciCleanTask#cleanDeployments: AzureAciCleanUpTask: cleanDeployments: Deployment not found, skipping Are we leaking deployments in the Azure cloud plugin?   I’ve been monitoring https://ci.jenkins.io for the past two days since Arnaud restarted it, and besides a bunch of agents that can’t be connected to it seems to have been handling itself fine. Though it hasn’t been much going on in the instance, a few plugin PR builds and a few core PR builds.   Do we have any historical data on how long the instance has survived between restarts in the past? I'm wondering if there is another leak besides metrics and if so how long until it should manifest again. Because as of now it looks OK though the load hasn't been that high these last two days.
            Hide
            danielbeck Daniel Beck added a comment -

            What does a deployment count of 800 means? That we had 800 instances running!?

            IIRC this happened after a sequence of several restarts in an afternoon, indicating obsolete instance do not get cleaned up properly, or in a timely manner, after Jenkins restart. Let each powercycle use up another 150 or so instances…

            Show
            danielbeck Daniel Beck added a comment - What does a deployment count of 800 means? That we had 800 instances running!? IIRC this happened after a sequence of several restarts in an afternoon, indicating obsolete instance do not get cleaned up properly, or in a timely manner, after Jenkins restart. Let each powercycle use up another 150 or so instances…
            Hide
            danielbeck Daniel Beck added a comment -

            I just idly checked the /threadDump on ci.j.io

            SupportPlugin periodic bundle generator: writing support_2019-11-01_10.21.58.zip since Fri Nov 01 10:21:58 UTC 2019

            As of…

            Page generated: Nov 1, 2019, 10:46:47 AM

            Collecting months old log files can easily be considered a bug in support-core, as they're probably useless anyway.

            Show
            danielbeck Daniel Beck added a comment - I just idly checked the /threadDump on ci.j.io SupportPlugin periodic bundle generator: writing support_2019-11-01_10.21.58.zip since Fri Nov 01 10:21:58 UTC 2019 As of… Page generated: Nov 1, 2019, 10:46:47 AM Collecting months old log files can easily be considered a bug in support-core, as they're probably useless anyway.
            Hide
            danielbeck Daniel Beck added a comment -

            Moved old GC logs into the subdirectory gc/ in JENKINS_HOME. Baptiste Mathus wanted the plugin re-enabled, so he gets to fix the java args to place GC logs there directly

            Show
            danielbeck Daniel Beck added a comment - Moved old GC logs into the subdirectory gc/ in JENKINS_HOME. Baptiste Mathus wanted the plugin re-enabled, so he gets to fix the java args to place GC logs there directly
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Just copying my suggestion in IRC here:

            • We keep this EPIC for the immediate ci.jenkins.io stabilization activities after the last massive outage in late October
            • We use the "ci.jenkins.io" component and "maintenability", "stability", "ux" labels in order to braindump ideas about improving ci.jenkins.io in longer term so that we spend less time on firefighting. If there is a consensus about it, I will create dashboards to track it

            WDYT Baptiste Mathus Olblak?

             

            Show
            oleg_nenashev Oleg Nenashev added a comment - Just copying my suggestion in IRC here: We keep this EPIC for the immediate ci.jenkins.io stabilization activities after the last massive outage in late October We use the "ci.jenkins.io" component and "maintenability", "stability", "ux" labels in order to braindump ideas about improving ci.jenkins.io in longer term so that we spend less time on firefighting. If there is a consensus about it, I will create dashboards to track it WDYT  Baptiste Mathus Olblak ?  

              People

              Assignee:
              batmat Baptiste Mathus
              Reporter:
              batmat Baptiste Mathus
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated: