• Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • _unsorted
    • Jenkins ver. 2.138.3 with Azure VM Agents Plugin ver. 0.7.4

      We are currently using the Azure VM agents plugin for our CI/CD environment, we use custom images.

      Our images are one of:

      • Image 1
        • Linux (With SSH launch)
        • Idle retention time (10 min)
        • Shutdown Only enabled
      • Image 2
        • Windows (With JNLP launch)
        • Idle retention time (20 min)
        • Shutdown Only enabled

      Our issue is that we have an agent that successfully gets provisioned launched and is usable, after the idle timeout the node successfully shuts down and applies the (offline) and (suspended) flags to it correctly. Once another build comes along and attempts to use it, the plugin correctly launches the previously shutdown VM it starts correctly, connects correctly but it maintains the (suspended) flag making it unusable. I can manually fix it by issuing the following command in the Jenkins script console

      Jenkins.instance.getNode('azure-nodeXXXXXX').toComputer().setAcceptingTasks(true);

      Then the build goes to the node and builds successfully, however if that command is not issued the build gets stuck waiting for the node to come online.

      This issue happens with both types of images.

          [JENKINS-54776] Node stays suspended after node is started

          Sander Bel added a comment - - edited

          We are experiencing similar issues, but unlike the OP, executing:

          Jenkins.instance.getNode('azure-nodeXXXXXX').toComputer().setAcceptingTasks(true);
          

          does not solve anything.

           

          When the node gets shutdown and de-allocated, it gets started again when a build with its label starts. The build however does not acquire the node. When we execute the statement above, we can see that the node gets acquired, but nothing more happens, the build does not continue.

           

          Any help would be greatly appreciated! We are keeping the nodes online for now (idle timeout set to basically infinite), not shutting them down, make it a rather expensive solution for the moment...

           

          EDIT: if any extra logging/debugging is needed, let me know!

          Sander Bel added a comment - - edited We are experiencing similar issues, but unlike the OP, executing: Jenkins.instance.getNode( 'azure-nodeXXXXXX' ).toComputer().setAcceptingTasks( true ); does not solve anything.   When the node gets shutdown and de-allocated, it gets started again when a build with its label starts. The build however does not acquire the node. When we execute the statement above, we can see that the node gets acquired, but nothing more happens, the build does not continue.   Any help would be greatly appreciated! We are keeping the nodes online for now (idle timeout set to basically infinite), not shutting them down, make it a rather expensive solution for the moment...   EDIT: if any extra logging/debugging is needed, let me know!

          Balaal Ashraf added a comment -

          I ended up switching to JNLP on the latest release with the try catch implemented.. wanted SSH but gets me going for now.

          Balaal Ashraf added a comment - I ended up switching to JNLP on the latest release with the try catch implemented.. wanted SSH but gets me going for now.

          Adrian .. added a comment -

          Hi, I am having the same issues with Jenkins and Azure integration, is there any upcomming update to resolve this issue? in my environment plugins which make this are:

          Azure VM Agents, SSH Slaves, Branch API and there could be more

           

          Thanks,

          Adrian .. added a comment - Hi, I am having the same issues with Jenkins and Azure integration, is there any upcomming update to resolve this issue? in my environment plugins which make this are: Azure VM Agents, SSH Slaves, Branch API and there could be more   Thanks,

          Any update on this?

          Łukasz Umbras-Nichnerowicz added a comment - Any update on this?

          Jie Shen added a comment -

          Hi All, I could provide a preview build which will force reconnect and may mitigate the issue. I cannot make sure it will work for every cases since this problem is hard to reproduce.

          Jie Shen added a comment - Hi All, I could provide a preview build which will force reconnect and may mitigate the issue. I cannot make sure it will work for every cases since this problem is hard to reproduce.

          Adrian .. added a comment -

          Currently, "BRANCH API" is causing this issue after upgrading "Azure VM" plugin

          Adrian .. added a comment - Currently, "BRANCH API" is causing this issue after upgrading "Azure VM" plugin

          Jie Shen added a comment -

          Jie Shen added a comment - I have released a preview version at  https://github.com/jenkinsci/azure-vm-agents-plugin/releases/tag/v0.9.1-PREVIEW.

          We have a similar issue to this. In our log there are a number of "agent XXX is always shut down" messages however the agent is still running.

          It seems the omsagent will not allow the plugin to shutdown the vm during updates/patching

          the plugin then marks the vm as eligibleforreuse even though its not shutdown

          please consider adding a try/catch here

          https://github.com/jenkinsci/azure-vm-agents-plugin/blob/46658bae2f22df3c8aaeb73d707816d766ca9dbc/src/main/java/com/microsoft/azure/vmagent/AzureVMAgent.java#L492

          to see if the vm actually shutsdown

          nicholas robinson added a comment - We have a similar issue to this. In our log there are a number of "agent XXX is always shut down" messages however the agent is still running. It seems the omsagent will not allow the plugin to shutdown the vm during updates/patching the plugin then marks the vm as eligibleforreuse even though its not shutdown please consider adding a try/catch here https://github.com/jenkinsci/azure-vm-agents-plugin/blob/46658bae2f22df3c8aaeb73d707816d766ca9dbc/src/main/java/com/microsoft/azure/vmagent/AzureVMAgent.java#L492 to see if the vm actually shutsdown

          Nils Rudolph added a comment -

          I have the same issue my linux ssh node goes online but keeps suspended state forever.

           The follwing code in the Jenkins console does not work:

          Jenkins.instance.getNode('XXXXXX').toComputer().setAcceptingTasks(true);
          

          I am even unable to delete the node, I have to restart the whole jenkins server to get back to a sane state.

          From the Thread dumps, I can see that the threads are keeped in a BLOCKED state.

          If I read the thread dump right, AzureVMCloud.java:693 WorkspaceLocatorImpl.java:534 both are trying to acquire a lock on the agent. (See Thread Dumps below)

          I guess the problem is that the following code:

          synchronized (agentNode) {
             ...
              azureComputer.connect(false).get();
              ...  
          }
          

          I think the get call should be outside of the synchronized block. Because the WorkspaceLocatorImpl needs to run for the connect call to finish but also tries to acquire a lock and the AzureVMCloud will only release the lock if the connect call is finished.

          But this is just a guess, let me know if more info or testing is needed.

          Channel reader thread: azure-build0132c0
          "Channel reader thread: azure-build0132c0" Id=2973 Group=main TIMED_WAITING on com.jcraft.jsch.Channel$MyPipedInputStream@2dda23a2
          	at java.lang.Object.wait(Native Method)
          	-  waiting on com.jcraft.jsch.Channel$MyPipedInputStream@2dda23a2
          	at java.io.PipedInputStream.read(PipedInputStream.java:326)
          	at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
          	at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
          	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
          	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          
          Computer.threadPoolForRemoting [#694]
          
          "Computer.threadPoolForRemoting [#694]" Id=2965 Group=main BLOCKED on com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d owned by "Computer.threadPoolForRemoting [#695]" Id=2966
          	at jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(WorkspaceLocatorImpl.java:534)
          	-  blocked on com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d
          	at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:701)
          	at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:435)
          	at com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher.launch(AzureVMAgentSSHLauncher.java:250)
          	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297)
          	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
          	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          	Number of locked synchronizers = 1
          	- java.util.concurrent.ThreadPoolExecutor$Worker@1bfc5e6f
          
          Computer.threadPoolForRemoting [#695]
          
          "Computer.threadPoolForRemoting [#695]" Id=2966 Group=main WAITING on java.util.concurrent.FutureTask@5d09196c
          	at sun.misc.Unsafe.park(Native Method)
          	-  waiting on java.util.concurrent.FutureTask@5d09196c
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
          	at java.util.concurrent.FutureTask.get(FutureTask.java:191)
          	at com.microsoft.azure.vmagent.AzureVMCloud.lambda$provision$1(AzureVMCloud.java:693)
          	-  locked com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d
          	at com.microsoft.azure.vmagent.AzureVMCloud$$Lambda$365/1217654884.call(Unknown Source)
          	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
          	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
          	Number of locked synchronizers = 1
          	- java.util.concurrent.ThreadPoolExecutor$Worker@26516d0
          

          Nils Rudolph added a comment - I have the same issue my linux ssh node goes online but keeps suspended state forever.  The follwing code in the Jenkins console does not work: Jenkins.instance.getNode( 'XXXXXX' ).toComputer().setAcceptingTasks( true ); I am even unable to delete the node, I have to restart the whole jenkins server to get back to a sane state. From the Thread dumps, I can see that the threads are keeped in a BLOCKED state. If I read the thread dump right, AzureVMCloud.java:693 WorkspaceLocatorImpl.java:534 both are trying to acquire a lock on the agent. (See Thread Dumps below) I guess the problem is that the following code: synchronized (agentNode) { ... azureComputer.connect( false ).get(); ... } I think the get call should be outside of the synchronized block. Because the WorkspaceLocatorImpl needs to run for the connect call to finish but also tries to acquire a lock and the AzureVMCloud will only release the lock if the connect call is finished. But this is just a guess, let me know if more info or testing is needed. Channel reader thread: azure-build0132c0 "Channel reader thread: azure-build0132c0" Id=2973 Group=main TIMED_WAITING on com.jcraft.jsch.Channel$MyPipedInputStream@2dda23a2 at java.lang. Object .wait(Native Method) - waiting on com.jcraft.jsch.Channel$MyPipedInputStream@2dda23a2 at java.io.PipedInputStream.read(PipedInputStream.java:326) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) Computer.threadPoolForRemoting [#694] "Computer.threadPoolForRemoting [#694]" Id=2965 Group=main BLOCKED on com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d owned by "Computer.threadPoolForRemoting [#695]" Id=2966 at jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(WorkspaceLocatorImpl.java:534) - blocked on com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:701) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:435) at com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher.launch(AzureVMAgentSSHLauncher.java:250) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1bfc5e6f Computer.threadPoolForRemoting [#695] "Computer.threadPoolForRemoting [#695]" Id=2966 Group=main WAITING on java.util.concurrent.FutureTask@5d09196c at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.FutureTask@5d09196c at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at java.util.concurrent.FutureTask.get(FutureTask.java:191) at com.microsoft.azure.vmagent.AzureVMCloud.lambda$provision$1(AzureVMCloud.java:693) - locked com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d at com.microsoft.azure.vmagent.AzureVMCloud$$Lambda$365/1217654884.call(Unknown Source) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@26516d0

          Tim Jacomb added a comment -

          All issues have been transferred to GitHub.

          See https://github.com/jenkinsci/azure-vm-agents-plugin/issues

          Search the issue title to find it.

          (This is a bulk comment and can't link to the specific issue)

          Tim Jacomb added a comment - All issues have been transferred to GitHub. See https://github.com/jenkinsci/azure-vm-agents-plugin/issues Search the issue title to find it. (This is a bulk comment and can't link to the specific issue)

            jieshe Jie Shen
            glokon Daniel McAssey
            Votes:
            9 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated:
              Resolved: