-
Bug
-
Resolution: Duplicate
-
Major
-
Jenkins ver. 2.138.3 with Azure VM Agents Plugin ver. 0.7.4
-
Powered by SuggestiMate
We are currently using the Azure VM agents plugin for our CI/CD environment, we use custom images.
Our images are one of:
- Image 1
- Linux (With SSH launch)
- Idle retention time (10 min)
- Shutdown Only enabled
- Image 2
- Windows (With JNLP launch)
- Idle retention time (20 min)
- Shutdown Only enabled
Our issue is that we have an agent that successfully gets provisioned launched and is usable, after the idle timeout the node successfully shuts down and applies the (offline) and (suspended) flags to it correctly. Once another build comes along and attempts to use it, the plugin correctly launches the previously shutdown VM it starts correctly, connects correctly but it maintains the (suspended) flag making it unusable. I can manually fix it by issuing the following command in the Jenkins script console
Jenkins.instance.getNode('azure-nodeXXXXXX').toComputer().setAcceptingTasks(true);
Then the build goes to the node and builds successfully, however if that command is not issued the build gets stuck waiting for the node to come online.
This issue happens with both types of images.
[JENKINS-54776] Node stays suspended after node is started
Hi jieshe,
first up thanks for the work on the Azure VM Agents Plugin, we are enjoying using this plugin in our build environment.
We are also seeing the same issue during our CI/CD builds.
The Cloud configuration we are using looks essentially like:
new AzureVMCloudBuilder() .withCloudName("our Cloud Name") .withMaxVirtualMachinesLimit("3") .withDeploymentTimeout("7200") .addNewTemplate() .withName("our-name") .withDescription("Our Description") .withLabels("our-label") .withVirtualMachineSize("Standard_D4s_v3") .withStorageAccountType("Premium_LRS") .withExistingStorageAccount("uniquestorageaccountname") .withRetentionStrategy(new AzureVMCloudRetensionStrategy(30)) .withShutdownOnIdle(true) .withWorkspace("d:\\") .addNewAdvancedImage() .withReferenceImage("MicrosoftWindowsDesktop", "Windows-10", "RS4-ProN", "latest") .withOsType("Windows") .withLaunchMethod("SSH") .withPreInstallSsh(true) .withRunScriptAsRoot(false) .withNumberOfExecutors("1") .withInitScript("Write-Host 'custom init script'") .withDoNotUseMachineIfInitFails(true) .withVirtualNetworkName("our-network") .withVirtualNetworkResourceGroupName("network-resource-group") .withSubnetName("subnet-name") .withUsePrivateIP(true) .withNetworkSecurityGroupName("") .withJvmOptions("") .withDisableTemplate(false) .endAdvancedImage() .withUsageMode(Node.Mode.EXCLUSIVE.getDescription()) .withLocation("West Europe") .withDiskType(Constants.DISK_MANAGED) .withAdminCredential("cred-id") .endTemplate() .withAzureCredentialsId("azure-cred-id") .withExistingResourceGroupName("resource-group") .build();
But as reported I guess the crucial parameter for the issue is
.withShutdownOnIdle(true)
niels_oke Thanks for providing your configuration. I was unable to reproduce this issue in my environment before. I will use your configuration to check whether I could get any luck.
Hello, sorry to bother about this but i also have the same issue.. Hope you are close to resolving it and raising PR
Looking at the logs and the code i have noticed it goes into AzureVMManagementServiceDelegate : startVirtualMachine : <image>.
Within the code this is called within line 699 of AzureVMCloud, with azureComputer.setAcceptingTasks(true); being called shortly after (713), but stays in suspended state as mentioned above..
Jan 10, 2019 11:59:57 AM FINE com.microsoft.azure.vmagent.AzureVMCloud AzureVMCloud: getAzureAgentTemplate: Retrieving agent template with label azureFresh Jan 10, 2019 11:59:57 AM FINE com.microsoft.azure.vmagent.AzureVMCloud AzureVMCloud: getAzureAgentTemplate: Found agent template vimbuild Jan 10, 2019 11:59:57 AM FINE com.microsoft.azure.vmagent.AzureVMCloud AzureVMCloud: getAzureAgentTemplate: vimbuild matches! Jan 10, 2019 11:59:57 AM INFO com.microsoft.azure.vmagent.AzureVMCloud provision AzureVMCloud: provision: start for label azureFresh workLoad 1 Jan 10, 2019 11:59:57 AM FINE com.microsoft.azure.vmagent.AzureVMCloud AzureVMCloud: getAzureAgentTemplate: Retrieving agent template with label azureFresh Jan 10, 2019 11:59:57 AM FINE com.microsoft.azure.vmagent.AzureVMCloud AzureVMCloud: getAzureAgentTemplate: Found agent template vimbuild Jan 10, 2019 11:59:57 AM FINE com.microsoft.azure.vmagent.AzureVMCloud AzureVMCloud: getAzureAgentTemplate: vimbuild matches! Jan 10, 2019 11:59:57 AM FINE com.microsoft.azure.vmagent.AzureVMCloudVerificationTask AzureVMCloudVerificationTask: verify: verifying cloud vimBuild-0.7.6-snapshot Jan 10, 2019 11:59:57 AM FINE com.microsoft.azure.vmagent.AzureVMCloudVerificationTask AzureVMCloudVerificationTask: verifyConfiguration: start Jan 10, 2019 11:59:58 AM FINE com.microsoft.azure.vmagent.AzureVMCloudVerificationTask AzureVMCloudVerificationTask: validate: vimBuild-0.7.6-snapshot verified pass Jan 10, 2019 11:59:58 AM FINE com.microsoft.azure.vmagent.AzureVMCloudVerificationTask AzureVMCloudVerificationTask: getVirtualMachineCount: start Jan 10, 2019 11:59:58 AM FINE com.microsoft.azure.vmagent.AzureVMCloudVerificationTask AzureVMCloudVerificationTask: getVirtualMachineCount: end, cloud vimBuild-0.7.6-snapshot has currently 1 vms Jan 10, 2019 11:59:59 AM FINE com.microsoft.azure.vmagent.AzureVMCloudVerificationTask AzureVMCloudVerificationTask: verify: vimbuild verified successfully Jan 10, 2019 11:59:59 AM INFO com.microsoft.azure.vmagent.AzureVMCloud provision AzureVMCloud: provision: checking for node reuse options Jan 10, 2019 11:59:59 AM INFO com.microsoft.azure.vmagent.AzureVMCloud provision AzureVMCloud: provision: agent computer eligible for reuse vimbuild6a1390 Jan 10, 2019 11:59:59 AM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate virtualMachineExists AzureVMManagementServiceDelegate: virtualMachineExists: check for vimbuild6a1390 Jan 10, 2019 11:59:59 AM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate virtualMachineExists AzureVMManagementServiceDelegate: virtualMachineExists: vimbuild6a1390 exists Jan 10, 2019 11:59:59 AM INFO com.microsoft.azure.vmagent.AzureVMCloud$2 call Found existing node, starting VM vimbuild6a1390 Jan 10, 2019 11:59:59 AM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate startVirtualMachine AzureVMManagementServiceDelegate: startVirtualMachine: vimbuild6a1390 Jan 10, 2019 11:59:59 AM INFO com.microsoft.azure.vmagent.AzureVMCloud provision AzureVMCloud: provision: asynchronous provision finished, returning 1 planned node(s) Jan 10, 2019 12:02:02 PM FINE com.microsoft.azure.vmagent.AzureVMMaintainPoolTask Started Azure VM Maintainer Pool Size Jan 10, 2019 12:02:02 PM FINE com.microsoft.azure.vmagent.AzureVMMaintainPoolTask Finished Azure VM Maintainer Pool Size. 0 ms Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask Started Azure VM Agents Clean Task Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask AzureVMAgentCleanUpTask: execute: start Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask AzureVMAgentCleanUpTask: execute: Running clean with 5 minute timeout Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask AzureVMAgentCleanUpTask: cleanVMs: node vimbuild6a1390 blocked to cleanup Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask AzureVMAgentCleanUpTask: cleanDeployments: Cleaning deployments Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask AzureVMAgentCleanUpTask: cleanDeployments: Checking deployment vimbuild0110113128980 Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask AzureVMAgentCleanUpTask: cleanDeployments: Deployment created on 1/10/19 11:36 AM Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask AzureVMAgentCleanUpTask: cleanDeployments: Deployment newer than timeout, keeping Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask AzureVMAgentCleanUpTask: cleanDeployments: Done cleaning deployments Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask AzureVMAgentCleanUpTask: execute: end Jan 10, 2019 12:02:07 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask Finished Azure VM Agents Clean Task. 219 ms Jan 10, 2019 12:02:09 PM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate setVirtualMachineDetails The Azure agent doesn't have a public IP. Will use the private IP Jan 10, 2019 12:02:09 PM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate setVirtualMachineDetails Azure agent details: nodeNamevimbuild6a1390 adminUserName=localadmin shutdownOnIdle=true retentionTimeInMin=0 labels=azureFresh Jan 10, 2019 12:02:09 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch AzureVMAgentSSHLauncher: launch: launch method called for agent vimbuild6a1390 Jan 10, 2019 12:02:09 PM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate isVMAliveOrHealthy AzureVMManagementServiceDelegate: isVMAliveOrHealthy: status PowerState/running Jan 10, 2019 12:02:09 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher connectToSsh AzureVMAgentSSHLauncher: connectToSsh: start Jan 10, 2019 12:02:09 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher getRemoteSession AzureVMAgentSSHLauncher: getRemoteSession: getting remote session for user localadmin to host 10.228.4.26:22 Jan 10, 2019 12:02:10 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher getRemoteSession AzureVMAgentSSHLauncher: getRemoteSession: Got remote session for user localadmin to host 10.228.4.26:22 Jan 10, 2019 12:02:10 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher connectToSsh AzureVMAgentSSHLauncher: connectToSsh: Got remote connection Jan 10, 2019 12:02:10 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand AzureVMAgentSSHLauncher: executeRemoteCommand: starting dir C:\.azure-agent-init Jan 10, 2019 12:02:10 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand AzureVMAgentSSHLauncher: executeRemoteCommand: executed successfully Jan 10, 2019 12:02:10 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch AzureVMAgentSSHLauncher: launch: checking for java runtime Jan 10, 2019 12:02:10 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand AzureVMAgentSSHLauncher: executeRemoteCommand: starting java -fullversion Jan 10, 2019 12:02:10 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand AzureVMAgentSSHLauncher: executeRemoteCommand: executed successfully Jan 10, 2019 12:02:10 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch AzureVMAgentSSHLauncher: launch: java runtime present, copying slave.jar to remote Jan 10, 2019 12:02:10 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher copyFileToRemote AzureVMAgentSSHLauncher: copyFileToRemote: Initiating file transfer to slave.jar Jan 10, 2019 12:02:20 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher copyFileToRemote AzureVMAgentSSHLauncher: copyFileToRemote: copied file Successfully to slave.jar Jan 10, 2019 12:02:20 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch AzureVMAgentSSHLauncher: launch: launching agent: java -jar slave.jar Jan 10, 2019 12:02:20 PM INFO com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch AzureVMAgentSSHLauncher: launch: Connected successfully
Hi balaal, you are right. This issue is caused by the codes. https://github.com/jenkinsci/azure-vm-agents-plugin/blob/ea8dc0b19184c381172afac0a41cac46435eaaf4/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java#L707 may throw some exceptions in some cases and then skip the azureComputer.setAcceptingTasks(true). I made some changes to just catch and ignore the exceptions. But it seems not a proper way to fix it. I need to find out why the exceptions are thrown and how to avoid it.
We are experiencing similar issues, but unlike the OP, executing:
Jenkins.instance.getNode('azure-nodeXXXXXX').toComputer().setAcceptingTasks(true);
does not solve anything.
When the node gets shutdown and de-allocated, it gets started again when a build with its label starts. The build however does not acquire the node. When we execute the statement above, we can see that the node gets acquired, but nothing more happens, the build does not continue.
Any help would be greatly appreciated! We are keeping the nodes online for now (idle timeout set to basically infinite), not shutting them down, make it a rather expensive solution for the moment...
EDIT: if any extra logging/debugging is needed, let me know!
I ended up switching to JNLP on the latest release with the try catch implemented.. wanted SSH but gets me going for now.
Hi, I am having the same issues with Jenkins and Azure integration, is there any upcomming update to resolve this issue? in my environment plugins which make this are:
Azure VM Agents, SSH Slaves, Branch API and there could be more
Thanks,
Hi All, I could provide a preview build which will force reconnect and may mitigate the issue. I cannot make sure it will work for every cases since this problem is hard to reproduce.
Currently, "BRANCH API" is causing this issue after upgrading "Azure VM" plugin
I have released a preview version at https://github.com/jenkinsci/azure-vm-agents-plugin/releases/tag/v0.9.1-PREVIEW.
We have a similar issue to this. In our log there are a number of "agent XXX is always shut down" messages however the agent is still running.
It seems the omsagent will not allow the plugin to shutdown the vm during updates/patching
the plugin then marks the vm as eligibleforreuse even though its not shutdown
please consider adding a try/catch here
to see if the vm actually shutsdown
I have the same issue my linux ssh node goes online but keeps suspended state forever.
The follwing code in the Jenkins console does not work:
Jenkins.instance.getNode('XXXXXX').toComputer().setAcceptingTasks(true);
I am even unable to delete the node, I have to restart the whole jenkins server to get back to a sane state.
From the Thread dumps, I can see that the threads are keeped in a BLOCKED state.
If I read the thread dump right, AzureVMCloud.java:693 WorkspaceLocatorImpl.java:534 both are trying to acquire a lock on the agent. (See Thread Dumps below)
I guess the problem is that the following code:
synchronized (agentNode) { ... azureComputer.connect(false).get(); ... }
I think the get call should be outside of the synchronized block. Because the WorkspaceLocatorImpl needs to run for the connect call to finish but also tries to acquire a lock and the AzureVMCloud will only release the lock if the connect call is finished.
But this is just a guess, let me know if more info or testing is needed.
Channel reader thread: azure-build0132c0 "Channel reader thread: azure-build0132c0" Id=2973 Group=main TIMED_WAITING on com.jcraft.jsch.Channel$MyPipedInputStream@2dda23a2 at java.lang.Object.wait(Native Method) - waiting on com.jcraft.jsch.Channel$MyPipedInputStream@2dda23a2 at java.io.PipedInputStream.read(PipedInputStream.java:326) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) Computer.threadPoolForRemoting [#694] "Computer.threadPoolForRemoting [#694]" Id=2965 Group=main BLOCKED on com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d owned by "Computer.threadPoolForRemoting [#695]" Id=2966 at jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(WorkspaceLocatorImpl.java:534) - blocked on com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:701) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:435) at com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher.launch(AzureVMAgentSSHLauncher.java:250) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1bfc5e6f Computer.threadPoolForRemoting [#695] "Computer.threadPoolForRemoting [#695]" Id=2966 Group=main WAITING on java.util.concurrent.FutureTask@5d09196c at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.FutureTask@5d09196c at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at java.util.concurrent.FutureTask.get(FutureTask.java:191) at com.microsoft.azure.vmagent.AzureVMCloud.lambda$provision$1(AzureVMCloud.java:693) - locked com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d at com.microsoft.azure.vmagent.AzureVMCloud$$Lambda$365/1217654884.call(Unknown Source) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@26516d0
All issues have been transferred to GitHub.
See https://github.com/jenkinsci/azure-vm-agents-plugin/issues
Search the issue title to find it.
(This is a bulk comment and can't link to the specific issue)
Thanks for your reporting, I will investigate this issue.