Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-56981

VM failed to provision when use Pool Retention Strategy

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Duplicate
    • Component/s: _unsorted
    • Labels:
      None
    • Environment:
    • Similar Issues:

      Description

      When use pool retention strategy, most times VMs are failed to provision.

      If use the same template settings, only change retention strategy to Idle retention strategy, then all vms can provisioned successfully(100%). 

      Retention time in hour: 24

      Pool Size: 1

      Only a few times, VMs can be provisioned successfully with pool retention strategy.

      Node ProvisioningActivity for Azure-Test/jenkins-…

      java.lang.Exception: Node ProvisioningActivity for Azure-Test/jenkins-test/null (229415501) has lost. Mark as failure at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanCloudStatistics(AzureVMAgentCleanUpTask.java:604) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.clean(AzureVMAgentCleanUpTask.java:622) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.access$200(AzureVMAgentCleanUpTask.java:73) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask$3.call(AzureVMAgentCleanUpTask.java:632) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask$3.call(AzureVMAgentCleanUpTask.java:629) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

        Attachments

        1. pic1.png
          pic1.png
          492 kB
        2. pic2.png
          pic2.png
          238 kB
        3. pic3.png
          pic3.png
          314 kB
        4. pic4.png
          pic4.png
          316 kB
        5. pic5.png
          pic5.png
          243 kB
        6. pic6.png
          pic6.png
          155 kB
        7. pic7.png
          pic7.png
          77 kB
        8. pic8.png
          pic8.png
          241 kB
        9. Screen Shot 2019-04-30 at 4.20.06 PM.png
          Screen Shot 2019-04-30 at 4.20.06 PM.png
          259 kB
        10. Screen Shot 2019-04-30 at 4.24.13 PM.png
          Screen Shot 2019-04-30 at 4.24.13 PM.png
          194 kB
        11. Screen Shot 2019-04-30 at 4.31.45 PM.png
          Screen Shot 2019-04-30 at 4.31.45 PM.png
          208 kB
        12. Screen Shot 2019-04-30 at 4.50.40 PM.png
          Screen Shot 2019-04-30 at 4.50.40 PM.png
          291 kB

          Activity

          Hide
          diwang di wang added a comment -

          With more testing, I config the max vm nunmber is 4 which the pool size is 1.

          I submit a branch of requests to trigger it. As it thinks test-vm407940 was failed, then it provisioned another 4 vm. So total is 5 now, more than the max vm number I configured.

          Show
          diwang di wang added a comment - With more testing, I config the max vm nunmber is 4 which the pool size is 1. I submit a branch of requests to trigger it. As it thinks  test-vm407940  was failed, then it provisioned another 4 vm. So total is 5 now, more than the max vm number I configured.
          Hide
          diwang di wang added a comment - - edited

          Jie, do we have a fix for this? Jie Shen

          Show
          diwang di wang added a comment - - edited Jie, do we have a fix for this? Jie Shen
          Hide
          jieshe Jie Shen added a comment -

          di wang I have tested with vnet configuration(delete and provision several times), but I still get no luck to reproduce your error. I use the latest version on the dev branch for the configuration as code work done on the branch and it is should be a stable one with some fixes. And below is my configuration:

          - azureVM:
          azureCredentialsId: "imds"
          cloudName: "jsdev"
          configurationStatus: "pass"
          deploymentTimeout: 1200
          existingResourceGroupName: "Testing"
          maxVirtualMachinesLimit: 10
          newResourceGroupName: "jieshe-dev-agents"
          resourceGroupReferenceType: "new"
          vmTemplates:
          - agentLaunchMethod: "SSH"
          availabilityType:
          availabilitySet: "avset"
          builtInImage: "Windows Server 2016"
          credentialsId: "agent_admin_account"
          diskType: "managed"
          doNotUseMachineIfInitFails: true
          enableMSI: false
          executeInitScriptAsRoot: true
          existingStorageAccountName: "fsxvergehbetgzsvfg"
          imageReference:
          offer: "UbuntuServer"
          publisher: "Canonical"
          sku: "16.04-LTS"
          version: "latest"
          imageTopLevelType: "advanced"
          initScript: "sudo add-apt-repository ppa:openjdk-r/ppa -y\nsudo apt-get -y\
          \ update\nsudo apt-get install openjdk-8-jre openjdk-8-jre-headless openjdk-8-jdk\
          \ -y"
          installDocker: false
          installGit: false
          installMaven: false
          labels: "advanced"
          location: "East US"
          newStorageAccountName: "asdfwefwef"
          noOfParallelJobs: 1
          osDiskSize: 0
          osType: "Linux"
          preInstallSsh: true
          retentionStrategy:
          azureVMCloudPool:
          poolSize: 1
          retentionInHours: 24
          shutdownOnIdle: false
          storageAccountNameReferenceType: "new"
          storageAccountType: "Standard_LRS"
          subnetName: "jenkins"
          templateDisabled: false
          templateName: "advanced"
          usageMode: "Use this node as much as possible"
          usePrivateIP: true
          virtualMachineSize: "Standard_F2s"
          virtualNetworkName: "jenkins-vnet"
          virtualNetworkResourceGroupName: "jieshe-jenkins"
          

          Do I miss anything in the configuration that make me unable to reproduce the error on my side?
           

          Show
          jieshe Jie Shen added a comment - di wang I have tested with vnet configuration(delete and provision several times), but I still get no luck to reproduce your error. I use the latest version on the dev branch for the configuration as code work done on the branch and it is should be a stable one with some fixes. And below is my configuration: - azureVM: azureCredentialsId: "imds" cloudName: "jsdev" configurationStatus: "pass" deploymentTimeout: 1200 existingResourceGroupName: "Testing" maxVirtualMachinesLimit: 10 newResourceGroupName: "jieshe-dev-agents" resourceGroupReferenceType: " new " vmTemplates: - agentLaunchMethod: "SSH" availabilityType: availabilitySet: "avset" builtInImage: "Windows Server 2016" credentialsId: "agent_admin_account" diskType: "managed" doNotUseMachineIfInitFails: true enableMSI: false executeInitScriptAsRoot: true existingStorageAccountName: "fsxvergehbetgzsvfg" imageReference: offer: "UbuntuServer" publisher: "Canonical" sku: "16.04-LTS" version: "latest" imageTopLevelType: "advanced" initScript: "sudo add-apt-repository ppa:openjdk-r/ppa -y\nsudo apt-get -y\ \ update\nsudo apt-get install openjdk-8-jre openjdk-8-jre-headless openjdk-8-jdk\ \ -y" installDocker: false installGit: false installMaven: false labels: "advanced" location: "East US" newStorageAccountName: "asdfwefwef" noOfParallelJobs: 1 osDiskSize: 0 osType: "Linux" preInstallSsh: true retentionStrategy: azureVMCloudPool: poolSize: 1 retentionInHours: 24 shutdownOnIdle: false storageAccountNameReferenceType: " new " storageAccountType: "Standard_LRS" subnetName: "jenkins" templateDisabled: false templateName: "advanced" usageMode: "Use this node as much as possible" usePrivateIP: true virtualMachineSize: "Standard_F2s" virtualNetworkName: "jenkins-vnet" virtualNetworkResourceGroupName: "jieshe-jenkins" Do I miss anything in the configuration that make me unable to reproduce the error on my side?  
          Hide
          jieshe Jie Shen added a comment -

          For provisioning more nodes than the limits, it is an existing issue JENKINS-56972 caused by bad concurrency control. I will try to fix this later.

          Show
          jieshe Jie Shen added a comment - For provisioning more nodes than the limits, it is an existing issue  JENKINS-56972  caused by bad concurrency control. I will try to fix this later.
          Hide
          timja Tim Jacomb added a comment -

          All issues have been transferred to GitHub.

          See https://github.com/jenkinsci/azure-vm-agents-plugin/issues

          Search the issue title to find it.

          (This is a bulk comment and can't link to the specific issue)

          Show
          timja Tim Jacomb added a comment - All issues have been transferred to GitHub. See https://github.com/jenkinsci/azure-vm-agents-plugin/issues Search the issue title to find it. (This is a bulk comment and can't link to the specific issue)

            People

            Assignee:
            jieshe Jie Shen
            Reporter:
            diwang di wang
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: