I have the same issue my linux ssh node goes online but keeps suspended state forever.
The follwing code in the Jenkins console does not work:
Jenkins.instance.getNode('XXXXXX').toComputer().setAcceptingTasks(true);
I am even unable to delete the node, I have to restart the whole jenkins server to get back to a sane state.
From the Thread dumps, I can see that the threads are keeped in a BLOCKED state.
If I read the thread dump right, AzureVMCloud.java:693 WorkspaceLocatorImpl.java:534 both are trying to acquire a lock on the agent. (See Thread Dumps below)
I guess the problem is that the following code:
synchronized (agentNode) {
...
azureComputer.connect(false).get();
...
}
I think the get call should be outside of the synchronized block. Because the WorkspaceLocatorImpl needs to run for the connect call to finish but also tries to acquire a lock and the AzureVMCloud will only release the lock if the connect call is finished.
But this is just a guess, let me know if more info or testing is needed.
Channel reader thread: azure-build0132c0
"Channel reader thread: azure-build0132c0" Id=2973 Group=main TIMED_WAITING on com.jcraft.jsch.Channel$MyPipedInputStream@2dda23a2
at java.lang.Object.wait(Native Method)
- waiting on com.jcraft.jsch.Channel$MyPipedInputStream@2dda23a2
at java.io.PipedInputStream.read(PipedInputStream.java:326)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Computer.threadPoolForRemoting [#694]
"Computer.threadPoolForRemoting [#694]" Id=2965 Group=main BLOCKED on com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d owned by "Computer.threadPoolForRemoting [#695]" Id=2966
at jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(WorkspaceLocatorImpl.java:534)
- blocked on com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:701)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:435)
at com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher.launch(AzureVMAgentSSHLauncher.java:250)
at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@1bfc5e6f
Computer.threadPoolForRemoting [#695]
"Computer.threadPoolForRemoting [#695]" Id=2966 Group=main WAITING on java.util.concurrent.FutureTask@5d09196c
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.FutureTask@5d09196c
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at com.microsoft.azure.vmagent.AzureVMCloud.lambda$provision$1(AzureVMCloud.java:693)
- locked com.microsoft.azure.vmagent.AzureVMAgent@54a8c39d
at com.microsoft.azure.vmagent.AzureVMCloud$$Lambda$365/1217654884.call(Unknown Source)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@26516d0
We are experiencing similar issues, but unlike the OP, executing:
does not solve anything.
When the node gets shutdown and de-allocated, it gets started again when a build with its label starts. The build however does not acquire the node. When we execute the statement above, we can see that the node gets acquired, but nothing more happens, the build does not continue.
Any help would be greatly appreciated! We are keeping the nodes online for now (idle timeout set to basically infinite), not shutting them down, make it a rather expensive solution for the moment...
EDIT: if any extra logging/debugging is needed, let me know!