-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Jenkins 2.332.3, OpenJDK 11.0.15, running on Ubuntu 20.04
SSH Slaves Plugin 1.814.vc82988f54b_10 (tested with 1.33.0 as well)
Anka Build Plugin 2.7.0
-
Powered by SuggestiMate -
1.821.vd834f8a_c390e
The error observed is agents simply hanging while starting. This happens about 5% of the VMs started in this manner.
Anka Build plugin is used and the VM which is spun by it is 100% functional.
Investigating the tread dump shows a deadlock between launch and
teardownConncetion methods in SSHLauncher.
I have attached stack trace of both threads as files.
The launch method seems to be hanging while executing this:
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.15/Native Method)
- waiting on <no object reference available>
at hudson.remoting.Request.call(Request.java:177) - waiting to re-lock in wait() <0x00000005f9721350> (a hudson.remoting.UserRequest)
at hudson.remoting.Channel.call(Channel.java:999)
at hudson.FilePath.act(FilePath.java:1194)
at hudson.FilePath.act(FilePath.java:1183)
at hudson.FilePath.exists(FilePath.java:1748)
at jenkins.branch.WorkspaceLocatorImpl.load(WorkspaceLocatorImpl.java:254)
at jenkins.branch.WorkspaceLocatorImpl.access$500(WorkspaceLocatorImpl.java:86)
at jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(WorkspaceLocatorImpl.java:601) - locked <0x00000005f97214e0> (a java.lang.String)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:727)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:437)
at hudson.plugins.sshslaves.SSHLauncher.startAgent(SSHLauncher.java:645)
at hudson.plugins.sshslaves.SSHLauncher.lambda$launch$0(SSHLauncher.java:458)
at hudson.plugins.sshslaves.SSHLauncher$$Lambda$393/0x0000000840c2c040.call(Unknown Source)
at java.util.concurrent.FutureTask.run(java.base@11.0.15/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.15/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.15/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.15/Thread.java:829)
[JENKINS-68656] SSH Slaves Plugin Deadlock while spinning up a new agent
We just found out that executing sudo kill -9 <pid> for SSHD process for that specific connection on a VM, will result in channel failure Jenkins will recognize that channel is broken and clean everything up.
Agent log(does not look 100% the same every time):
[06/01/22 11:06:51] [SSH] Checking java version of /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home//bin/java
[06/01/22 11:06:51] [SSH] /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home//bin/java -version returned 11.0.14.
[06/01/22 11:06:51] [SSH] Starting sftp client.
[06/01/22 11:06:51] [SSH] Copying latest remoting.jar...
[06/01/22 11:06:52] [SSH] Copied 1,524,115 bytes.
Expanded the channel window size to 4MB
[06/01/22 11:06:52] [SSH] Starting agent process: cd "/usr/local/mobile/mnt/workspaces" && /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home//bin/java -jar remoting.jar -workDir /usr/local/mobile/mnt/workspaces -jar-cache /usr/local/mobile/mnt/workspaces/remoting/jarCache
Jun 01, 2022 11:06:52 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /usr/local/mobile/mnt/workspaces/remoting as a remoting work directory
Jun 01, 2022 11:06:53 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to /usr/local/mobile/mnt/workspaces/remoting
<===[JENKINS REMOTING CAPACITY]===>channel started
Remoting version: 4.13
This is a Unix agent
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by jenkins.slaves.StandardOutputSwapper$ChannelSwapper to constructor java.io.FileDescriptor(int)
WARNING: Please consider reporting this to the maintainers of jenkins.slaves.StandardOutputSwapper$ChannelSwapper
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Evacuated stdout
Jun 01, 2022 11:15:54 AM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel channel.
java.util.concurrent.TimeoutException: Ping started at 1654081914259 hasn't completed by 1654082154266
at hudson.remoting.PingThread.ping(PingThread.java:132)
at hudson.remoting.PingThread.run(PingThread.java:88)
Also, from Jenkins system log:
WARNING c.c.j.s.i.AboutJenkins$NodesContent#printTo: Could not get agent.jar version for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-fAuJ3
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204)
at com.cloudbees.jenkins.support.util.CallAsyncWrapper.callAsync(CallAsyncWrapper.java:24)
Caused: java.io.IOException
at com.cloudbees.jenkins.support.util.CallAsyncWrapper.callAsync(CallAsyncWrapper.java:29)
at com.cloudbees.jenkins.support.AsyncResultCache.get(AsyncResultCache.java:59)
at com.cloudbees.jenkins.support.AsyncResultCache.get(AsyncResultCache.java:33)
at com.cloudbees.jenkins.support.impl.AboutJenkins$NodesContent.printTo(AboutJenkins.java:679)
at com.cloudbees.jenkins.support.api.PrefilteredPrintedContent.writeTo(PrefilteredPrintedContent.java:63)
at com.cloudbees.jenkins.support.api.PrefilteredPrintedContent.writeTo(PrefilteredPrintedContent.java:56)
at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:377)
at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:316)
at com.cloudbees.jenkins.support.SupportAction.prepareBundle(SupportAction.java:357)
at com.cloudbees.jenkins.support.SupportAction.doGenerateAllBundles(SupportAction.java:307)
at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:398)
at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:410)
at org.kohsuke.stapler.interceptor.RequirePOST$Processor.invoke(RequirePOST.java:78)
at org.kohsuke.stapler.PreInvokeInterceptedFunction.invoke(PreInvokeInterceptedFunction.java:26)
at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:208)
at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:141)
at org.kohsuke.stapler.MetaClass$11.doDispatch(MetaClass.java:558)
at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:59)
at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:766)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:898)
at org.kohsuke.stapler.MetaClass$9.dispatch(MetaClass.java:475)
at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:766)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:898)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:694)
at org.kohsuke.stapler.Stapler.service(Stapler.java:240)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:157)
at jenkins.security.ResourceDomainFilter.doFilter(ResourceDomainFilter.java:81)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
at jenkins.telemetry.impl.UserLanguages$AcceptLanguageFilter.doFilter(UserLanguages.java:129)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
at com.cloudbees.jenkins.support.slowrequest.SlowRequestFilter.doFilter(SlowRequestFilter.java:37)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:59)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239)
at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215)
at net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88)
at org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
at jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:160)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:154)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:94)
at jenkins.security.AcegiSecurityExceptionFilter.doFilter(AcegiSecurityExceptionFilter.java:52)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:54)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:122)
at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:116)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
at org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:109)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:102)
at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:93)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:219)
at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:213)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:97)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:110)
at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:80)
at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:63)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:111)
at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:172)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:53)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:86)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at jenkins.security.SuspiciousRequestFilter.doFilter(SuspiciousRequestFilter.java:38)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:386)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.base/java.lang.Thread.run(Thread.java:829)
Similiar stack traces also exist for:
- WARNING c.c.j.s.i.EnvironmentVariables$2#printTo: Could not record environment of node ...
- WARNING c.c.j.s.i.AboutJenkins$NodesContent#printTo: Could not get agent.jar version for...
- WARNING c.c.j.s.i.AboutJenkins$NodesContent#printTo: Could not get Java info for...
- WARNING c.c.j.s.i.AboutJenkins$NodeChecksumsContent#printTo: Could not compute checksums on agent ...
>We just found out that executing sudo kill -9 <pid> for SSHD process for that specific connection on a VM, will result in channel failure Jenkins will recognize that channel is broken and clean everything up.
So the agent is waiting for something and if you kill the SSHD service everything is correctly cleanup. The timeout for connections should make the same thing (210 seconds) you can customize that timeout.
I see you are using Temurin JDK 11 on the Agents and also that are macOS, Which JDK do you use on the Jenkins controller? Do you see some correlation between JDK versions or OS versions on the agents that fail to start?
I think is not related to the Jenkins plugins, looks like a JDK versions/flavor or OS versions issue.
master is on Ubuntu 20.04, using openjdk 11.0.15
We are seeing errors for the agent being non response for 12+ minutes, so the timeout mechanism is failing somewhere :/
Do all the agents stuck when they end the connection or before? this message means the agent is connected and the channel.
<===[JENKINS REMOTING CAPACITY]===>channel started Remoting version: 4.13 This is a Unix agent
Yes, all agents were stuck at the same stage.
This issue is evolving actually...
We introduced a delay through Anka Build Plugin to wait for launcher to finish before calling afterDisconnect, to avoid the above mentioned dead lock.
We are seeing other deadlocks but they all result in the same behavior where agents are hanging and not responding.
We have a lot of data, thread dumps, logs, etc, but it is still hard to understand what is going on.
Now, we are also seeing this happen even in the middle of pipeline run, so after agent started successfully.
The one thing in common we see in all cases is this stack trace:
"org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 577: checking /usr/local/mobile/mnt/workspaces/workspace/testing-hanging-agents/test-macos-12.2-xcode13.3-pipeline(2) on AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 / waiting for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=12376" id=12950 (0x3296) state=BLOCKED cpu=0%
- waiting to lock <0x6b111501> (a hudson.remoting.Channel)
owned by "Computer.threadPoolForRemoting 2368 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=284" id=12941 (0x328d)
at hudson.remoting.Request.call(Request.java:208)
at hudson.remoting.Channel.call(Channel.java:999)
at hudson.FilePath.act(FilePath.java:1194)
at hudson.FilePath.act(FilePath.java:1183)
at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.exitStatus(FileMonitoringTask.java:417)
at org.jenkinsci.plugins.durabletask.BourneShellScript$ShellController.exitStatus(BourneShellScript.java:301)
at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.exitStatus(FileMonitoringTask.java:409)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:598)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:549)
at java.base@11.0.15/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base@11.0.15/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)
This is joined with messages in the jenkins system log every 5 seconds:
o.j.p.w.s.concurrent.Timeout#lambda$ping$0: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 577: checking /usr/local/mobile/mnt/workspaces/workspace/testing-hanging-agents/test-macos-12.2-xcode13.3-pipeline(2) on AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 / waiting for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=12376 unresponsive for 1 min 50 sec
This reaches up to 21 minutes unresponsiveness in some cases
Any idea where to look?
>Now, we are also seeing this happen even in the middle of pipeline run, so after agent started successfully.
when the agent opens the SSH connection and opens the channel SSH agents would not do anything else until the agent is disconnected, in this stage, the only responsible of keeping the controller and agent talking is remoting, and the network layer. The deadlock is a symptom of something else.
The durable task stack tracepoint to a method that manages the exit of the workspace, getting the result files of the job. Do you have metrics of the agents? it looks like there is a bottleneck in someplace (CPU, network, disk IO)
Metrics on the agents are fine.
The VMs the agent is running on are completely accessible while they hang and show no issues with networking whatsoever (both incoming and outgoing).
Another thread shown in all 3 cases where we have hung VMs and proper logging information:
"Computer.threadPoolForRemoting 2368 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=284" id=12941 (0x328d) state=RUNNABLE cpu=76% (running in native)
at java.base@11.0.15/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base@11.0.15/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base@11.0.15/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.internal_write(CipherOutputStream.java:52)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.writeBlock(CipherOutputStream.java:101)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.write(CipherOutputStream.java:118)
at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:179)
at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107)
at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690)
at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:431)
- locked java.lang.Object@2b433a98
at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:94)
at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
at java.base@11.0.15/java.io.OutputStream.write(OutputStream.java:122)
at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
at hudson.remoting.Channel.send(Channel.java:765) - locked hudson.remoting.Channel@6b111501
at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:146) - locked hudson.remoting.ProxyOutputStream@6bff54a
at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112)
at hudson.remoting.Util.copy(Util.java:58)
at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)
at jdk.internal.reflect.GeneratedMethodAccessor435.invoke(Unknown Source)
at java.base@11.0.15/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base@11.0.15/java.lang.reflect.Method.invoke(Method.java:566)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
at hudson.remoting.UserRequest.perform(UserRequest.java:211)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:376)
at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
at hudson.remoting.InterceptingExecutorService$$Lambda$591/0x0000000840e43c40.call(Unknown Source)
at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18)
at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19)
at hudson.remoting.CallableDecoratorList$$Lambda$592/0x0000000840e43040.call(Unknown Source)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)
"Computer.threadPoolForRemoting 307 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-kQIqN id=282" id=2895 (0xb4f) state=RUNNABLE cpu=77% (running in native)
at java.base@11.0.15/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base@11.0.15/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base@11.0.15/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.internal_write(CipherOutputStream.java:52)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.writeBlock(CipherOutputStream.java:101)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.write(CipherOutputStream.java:118)
at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:179)
at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107)
at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690)
at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:431)
- locked java.lang.Object@180aed5d
at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:94)
at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
at java.base@11.0.15/java.io.OutputStream.write(OutputStream.java:122)
at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
at hudson.remoting.Channel.send(Channel.java:765) - locked hudson.remoting.Channel@5a0edacf
at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:146) - locked hudson.remoting.ProxyOutputStream@24248033
at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112)
at hudson.remoting.Util.copy(Util.java:58)
at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)
at jdk.internal.reflect.GeneratedMethodAccessor301.invoke(Unknown Source)
at java.base@11.0.15/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base@11.0.15/java.lang.reflect.Method.invoke(Method.java:566)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
at hudson.remoting.UserRequest.perform(UserRequest.java:211)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:376)
at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
at hudson.remoting.InterceptingExecutorService$$Lambda$547/0x0000000840d79440.call(Unknown Source)
at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18)
at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19)
at hudson.remoting.CallableDecoratorList$$Lambda$548/0x0000000840d79840.call(Unknown Source)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)
"Computer.threadPoolForRemoting 139 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-fAuJ3 id=97" id=657 (0x291) state=RUNNABLE cpu=85% (running in native)
at java.base@11.0.15/java.net.SocketOutputStream.socketWrite0(Native Method)
at java.base@11.0.15/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
at java.base@11.0.15/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.internal_write(CipherOutputStream.java:52)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.writeBlock(CipherOutputStream.java:101)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.write(CipherOutputStream.java:118)
at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:179)
at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107)
at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690)
at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:431)
- locked java.lang.Object@59b29645
at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:94)
at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
at java.base@11.0.15/java.io.OutputStream.write(OutputStream.java:122)
at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
at hudson.remoting.Channel.send(Channel.java:765) - locked hudson.remoting.Channel@2256db80
at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:146) - locked hudson.remoting.ProxyOutputStream@2abbfaf6
at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112)
at hudson.remoting.Util.copy(Util.java:58)
at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)
at jdk.internal.reflect.GeneratedMethodAccessor411.invoke(Unknown Source)
at java.base@11.0.15/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base@11.0.15/java.lang.reflect.Method.invoke(Method.java:566)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
at hudson.remoting.UserRequest.perform(UserRequest.java:211)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:376)
at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
at hudson.remoting.InterceptingExecutorService$$Lambda$574/0x0000000840d80c40.call(Unknown Source)
at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18)
at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19)
at hudson.remoting.CallableDecoratorList$$Lambda$575/0x0000000840da8040.call(Unknown Source)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)
Also, it seems that this is only happening in Pipelines and does not occur in freestyle jobs... :/
Also, we have 3 output sets of the "Support Core" plugin when this issue is happening. Lots of info there. I can attach if you think that will help.
I do not remember if the Support plugin anonymizes all the sensible info about your instance, so better not attach it here.
I do not like to mix stuff in the same issue, but I think that both have the same root cause, we are talking about two issues one is that agents stucks when they start, and the other is that agents stucks in the middle (not sure but if it is at the end or a random point) of pipeline execution.
With the stack trace, you paste before the three threads are using more than 70% of the CPU and are stuck reading from the disk and sending that data to the Jenkins Controller. The times I saw this in the past is related to a poor IO performance in the agent, it is usually caused by using a NFS filesystem(or other network filesystem) for the workspace of the agents, please check the following links
https://support.cloudbees.com/hc/en-us/articles/115003461772-IO-Troubleshooting-on-Linux
https://support.cloudbees.com/hc/en-us/articles/115003442371-Required-Data-IO-issues-on-Linux
I may be misunderstanding, but I am seeing all 3 stack traces stuck on "SocketWrite" so why are you saying its reading? and why do you say its reading from disk?
IIRC this is the part that grabs the classes from the Jenkins controller and stores those classes in the local cache of the agent to run them locally, so it has a network import that is saved to disk.
at hudson.remoting.Util.copy(Util.java:58)
at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)
I just remember that I have used Anka provider by MacStadium about 2 years ago, at least at that time their performance was really poor.
Yeah, since major version 2 is much better.
In any case, this is relevant https://github.com/jenkinsci/ssh-slaves-plugin/pull/304
The PR will kill the connection in the best case, but the issue about how the connection is stuck in a native IO operation will be there, so the plugin will try again to reconnect, it could work or not.
In the case of your pipelines stuck on IO operation, the fix in the PR will not apply if the channel is not broken.
Does it happen with SSH Agents not launched with the Anka plugin?
Do you have the logs of one of those agents to see at which stage of the connection is falling?