Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68656

SSH Slaves Plugin Deadlock while spinning up a new agent

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • None
    • Jenkins 2.332.3, OpenJDK 11.0.15, running on Ubuntu 20.04
      SSH Slaves Plugin 1.814.vc82988f54b_10 (tested with 1.33.0 as well)
      Anka Build Plugin 2.7.0
    • 1.821.vd834f8a_c390e

      The error observed is agents simply hanging while starting. This happens about 5% of the VMs started in this manner.

      Anka Build plugin is used and the VM which is spun by it is 100% functional.

      Investigating the tread dump shows a deadlock between launch and 

      teardownConncetion methods in SSHLauncher.

      I have attached stack trace of both threads as files.

       

      The launch method seems to be hanging while executing this:
      java.lang.Thread.State: TIMED_WAITING (on object monitor)
      at java.lang.Object.wait(java.base@11.0.15/Native Method)

      • waiting on <no object reference available>
        at hudson.remoting.Request.call(Request.java:177)
      • waiting to re-lock in wait() <0x00000005f9721350> (a hudson.remoting.UserRequest)
        at hudson.remoting.Channel.call(Channel.java:999)
        at hudson.FilePath.act(FilePath.java:1194)
        at hudson.FilePath.act(FilePath.java:1183)
        at hudson.FilePath.exists(FilePath.java:1748)
        at jenkins.branch.WorkspaceLocatorImpl.load(WorkspaceLocatorImpl.java:254)
        at jenkins.branch.WorkspaceLocatorImpl.access$500(WorkspaceLocatorImpl.java:86)
        at jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(WorkspaceLocatorImpl.java:601)
      • locked <0x00000005f97214e0> (a java.lang.String)
        at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:727)
        at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:437)
        at hudson.plugins.sshslaves.SSHLauncher.startAgent(SSHLauncher.java:645)
        at hudson.plugins.sshslaves.SSHLauncher.lambda$launch$0(SSHLauncher.java:458)
        at hudson.plugins.sshslaves.SSHLauncher$$Lambda$393/0x0000000840c2c040.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(java.base@11.0.15/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.15/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.15/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.15/Thread.java:829)

          [JENKINS-68656] SSH Slaves Plugin Deadlock while spinning up a new agent

          Ivan Fernandez Calvo added a comment - - edited

          Does it happen with SSH Agents not launched with the Anka plugin?
          Do you have the logs of one of those agents to see at which stage of the connection is falling?

          Ivan Fernandez Calvo added a comment - - edited Does it happen with SSH Agents not launched with the Anka plugin? Do you have the logs of one of those agents to see at which stage of the connection is falling?

          niv keidan added a comment -

          We just found out that executing sudo kill -9 <pid> for SSHD process for that specific connection on a VM, will result in channel failure Jenkins will recognize that channel is broken and clean everything up.

          Agent log(does not look 100% the same every time):

          [06/01/22 11:06:51] [SSH] Checking java version of /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home//bin/java
          [06/01/22 11:06:51] [SSH] /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home//bin/java -version returned 11.0.14.
          [06/01/22 11:06:51] [SSH] Starting sftp client.
          [06/01/22 11:06:51] [SSH] Copying latest remoting.jar...
          [06/01/22 11:06:52] [SSH] Copied 1,524,115 bytes.
          Expanded the channel window size to 4MB
          [06/01/22 11:06:52] [SSH] Starting agent process: cd "/usr/local/mobile/mnt/workspaces" && /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home//bin/java -jar remoting.jar -workDir /usr/local/mobile/mnt/workspaces -jar-cache /usr/local/mobile/mnt/workspaces/remoting/jarCache
          Jun 01, 2022 11:06:52 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
          INFO: Using /usr/local/mobile/mnt/workspaces/remoting as a remoting work directory
          Jun 01, 2022 11:06:53 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
          INFO: Both error and output logs will be printed to /usr/local/mobile/mnt/workspaces/remoting
          <===[JENKINS REMOTING CAPACITY]===>channel started
          Remoting version: 4.13
          This is a Unix agent
          WARNING: An illegal reflective access operation has occurred
          WARNING: Illegal reflective access by jenkins.slaves.StandardOutputSwapper$ChannelSwapper to constructor java.io.FileDescriptor(int)
          WARNING: Please consider reporting this to the maintainers of jenkins.slaves.StandardOutputSwapper$ChannelSwapper
          WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
          WARNING: All illegal access operations will be denied in a future release
          Evacuated stdout
          Jun 01, 2022 11:15:54 AM hudson.slaves.ChannelPinger$1 onDead
          INFO: Ping failed. Terminating the channel channel.
          java.util.concurrent.TimeoutException: Ping started at 1654081914259 hasn't completed by 1654082154266
          at hudson.remoting.PingThread.ping(PingThread.java:132)
          at hudson.remoting.PingThread.run(PingThread.java:88)

          niv keidan added a comment - We just found out that executing sudo kill -9 <pid> for SSHD process for that specific connection on a VM, will result in channel failure Jenkins will recognize that channel is broken and clean everything up. Agent log(does not look 100% the same every time): [06/01/22 11:06:51] [SSH] Checking java version of /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home//bin/java [06/01/22 11:06:51] [SSH] /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home//bin/java -version returned 11.0.14. [06/01/22 11:06:51] [SSH] Starting sftp client. [06/01/22 11:06:51] [SSH] Copying latest remoting.jar... [06/01/22 11:06:52] [SSH] Copied 1,524,115 bytes. Expanded the channel window size to 4MB [06/01/22 11:06:52] [SSH] Starting agent process: cd "/usr/local/mobile/mnt/workspaces" && /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home//bin/java -jar remoting.jar -workDir /usr/local/mobile/mnt/workspaces -jar-cache /usr/local/mobile/mnt/workspaces/remoting/jarCache Jun 01, 2022 11:06:52 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /usr/local/mobile/mnt/workspaces/remoting as a remoting work directory Jun 01, 2022 11:06:53 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Both error and output logs will be printed to /usr/local/mobile/mnt/workspaces/remoting <=== [JENKINS REMOTING CAPACITY] ===>channel started Remoting version: 4.13 This is a Unix agent WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by jenkins.slaves.StandardOutputSwapper$ChannelSwapper to constructor java.io.FileDescriptor(int) WARNING: Please consider reporting this to the maintainers of jenkins.slaves.StandardOutputSwapper$ChannelSwapper WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release Evacuated stdout Jun 01, 2022 11:15:54 AM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel channel. java.util.concurrent.TimeoutException: Ping started at 1654081914259 hasn't completed by 1654082154266 at hudson.remoting.PingThread.ping(PingThread.java:132) at hudson.remoting.PingThread.run(PingThread.java:88)

          niv keidan added a comment -

          Also, from Jenkins system log:

          WARNING c.c.j.s.i.AboutJenkins$NodesContent#printTo: Could not get agent.jar version for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-fAuJ3
          java.util.concurrent.TimeoutException
          at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204)
          at com.cloudbees.jenkins.support.util.CallAsyncWrapper.callAsync(CallAsyncWrapper.java:24)
          Caused: java.io.IOException
          at com.cloudbees.jenkins.support.util.CallAsyncWrapper.callAsync(CallAsyncWrapper.java:29)
          at com.cloudbees.jenkins.support.AsyncResultCache.get(AsyncResultCache.java:59)
          at com.cloudbees.jenkins.support.AsyncResultCache.get(AsyncResultCache.java:33)
          at com.cloudbees.jenkins.support.impl.AboutJenkins$NodesContent.printTo(AboutJenkins.java:679)
          at com.cloudbees.jenkins.support.api.PrefilteredPrintedContent.writeTo(PrefilteredPrintedContent.java:63)
          at com.cloudbees.jenkins.support.api.PrefilteredPrintedContent.writeTo(PrefilteredPrintedContent.java:56)
          at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:377)
          at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:316)
          at com.cloudbees.jenkins.support.SupportAction.prepareBundle(SupportAction.java:357)
          at com.cloudbees.jenkins.support.SupportAction.doGenerateAllBundles(SupportAction.java:307)
          at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
          at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:398)
          at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:410)
          at org.kohsuke.stapler.interceptor.RequirePOST$Processor.invoke(RequirePOST.java:78)
          at org.kohsuke.stapler.PreInvokeInterceptedFunction.invoke(PreInvokeInterceptedFunction.java:26)
          at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:208)
          at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:141)
          at org.kohsuke.stapler.MetaClass$11.doDispatch(MetaClass.java:558)
          at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:59)
          at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:766)
          at org.kohsuke.stapler.Stapler.invoke(Stapler.java:898)
          at org.kohsuke.stapler.MetaClass$9.dispatch(MetaClass.java:475)
          at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:766)
          at org.kohsuke.stapler.Stapler.invoke(Stapler.java:898)
          at org.kohsuke.stapler.Stapler.invoke(Stapler.java:694)
          at org.kohsuke.stapler.Stapler.service(Stapler.java:240)
          at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
          at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
          at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
          at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:157)
          at jenkins.security.ResourceDomainFilter.doFilter(ResourceDomainFilter.java:81)
          at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
          at jenkins.telemetry.impl.UserLanguages$AcceptLanguageFilter.doFilter(UserLanguages.java:129)
          at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
          at com.cloudbees.jenkins.support.slowrequest.SlowRequestFilter.doFilter(SlowRequestFilter.java:37)
          at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
          at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:59)
          at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
          at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239)
          at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215)
          at net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88)
          at org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114)
          at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
          at jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125)
          at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
          at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:160)
          at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
          at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
          at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:154)
          at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
          at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
          at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:94)
          at jenkins.security.AcegiSecurityExceptionFilter.doFilter(AcegiSecurityExceptionFilter.java:52)
          at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
          at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:54)
          at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
          at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:122)
          at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:116)
          at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
          at org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:109)
          at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
          at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:102)
          at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:93)
          at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
          at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:219)
          at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:213)
          at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
          at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:97)
          at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
          at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:110)
          at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:80)
          at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:63)
          at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99)
          at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:111)
          at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:172)
          at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
          at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
          at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:53)
          at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
          at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
          at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:86)
          at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
          at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
          at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30)
          at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
          at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
          at jenkins.security.SuspiciousRequestFilter.doFilter(SuspiciousRequestFilter.java:38)
          at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
          at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
          at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
          at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
          at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578)
          at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
          at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
          at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
          at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
          at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434)
          at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
          at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
          at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
          at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
          at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349)
          at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
          at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
          at org.eclipse.jetty.server.Server.handle(Server.java:516)
          at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
          at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
          at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
          at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
          at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
          at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
          at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
          at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:386)
          at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
          at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
          at java.base/java.lang.Thread.run(Thread.java:829)

          niv keidan added a comment - Also, from Jenkins system log: WARNING c.c.j.s.i.AboutJenkins$NodesContent#printTo: Could not get agent.jar version for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-fAuJ3 java.util.concurrent.TimeoutException at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204) at com.cloudbees.jenkins.support.util.CallAsyncWrapper.callAsync(CallAsyncWrapper.java:24) Caused: java.io.IOException at com.cloudbees.jenkins.support.util.CallAsyncWrapper.callAsync(CallAsyncWrapper.java:29) at com.cloudbees.jenkins.support.AsyncResultCache.get(AsyncResultCache.java:59) at com.cloudbees.jenkins.support.AsyncResultCache.get(AsyncResultCache.java:33) at com.cloudbees.jenkins.support.impl.AboutJenkins$NodesContent.printTo(AboutJenkins.java:679) at com.cloudbees.jenkins.support.api.PrefilteredPrintedContent.writeTo(PrefilteredPrintedContent.java:63) at com.cloudbees.jenkins.support.api.PrefilteredPrintedContent.writeTo(PrefilteredPrintedContent.java:56) at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:377) at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:316) at com.cloudbees.jenkins.support.SupportAction.prepareBundle(SupportAction.java:357) at com.cloudbees.jenkins.support.SupportAction.doGenerateAllBundles(SupportAction.java:307) at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:398) at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:410) at org.kohsuke.stapler.interceptor.RequirePOST$Processor.invoke(RequirePOST.java:78) at org.kohsuke.stapler.PreInvokeInterceptedFunction.invoke(PreInvokeInterceptedFunction.java:26) at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:208) at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:141) at org.kohsuke.stapler.MetaClass$11.doDispatch(MetaClass.java:558) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:59) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:766) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:898) at org.kohsuke.stapler.MetaClass$9.dispatch(MetaClass.java:475) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:766) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:898) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:694) at org.kohsuke.stapler.Stapler.service(Stapler.java:240) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799) at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:157) at jenkins.security.ResourceDomainFilter.doFilter(ResourceDomainFilter.java:81) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at jenkins.telemetry.impl.UserLanguages$AcceptLanguageFilter.doFilter(UserLanguages.java:129) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at com.cloudbees.jenkins.support.slowrequest.SlowRequestFilter.doFilter(SlowRequestFilter.java:37) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:59) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215) at net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88) at org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:160) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:154) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:94) at jenkins.security.AcegiSecurityExceptionFilter.doFilter(AcegiSecurityExceptionFilter.java:52) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99) at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:54) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99) at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:122) at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:116) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99) at org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:109) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99) at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:102) at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:93) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99) at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:219) at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:213) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99) at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:97) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99) at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:110) at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:80) at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:63) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:99) at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:111) at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:172) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:53) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:86) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at jenkins.security.SuspiciousRequestFilter.doFilter(SuspiciousRequestFilter.java:38) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.Server.handle(Server.java:516) at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:386) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) at java.base/java.lang.Thread.run(Thread.java:829)

          niv keidan added a comment -

          Similiar stack traces also exist for:

          • WARNING c.c.j.s.i.EnvironmentVariables$2#printTo: Could not record environment of node ...
          • WARNING c.c.j.s.i.AboutJenkins$NodesContent#printTo: Could not get agent.jar version for...
          • WARNING c.c.j.s.i.AboutJenkins$NodesContent#printTo: Could not get Java info for...
          • WARNING c.c.j.s.i.AboutJenkins$NodeChecksumsContent#printTo: Could not compute checksums on agent ...

          niv keidan added a comment - Similiar stack traces also exist for: WARNING c.c.j.s.i.EnvironmentVariables$2#printTo: Could not record environment of node ... WARNING c.c.j.s.i.AboutJenkins$NodesContent#printTo: Could not get agent.jar version for... WARNING c.c.j.s.i.AboutJenkins$NodesContent#printTo: Could not get Java info for... WARNING c.c.j.s.i.AboutJenkins$NodeChecksumsContent#printTo: Could not compute checksums on agent ...

          >We just found out that executing sudo kill -9 <pid> for SSHD process for that specific connection on a VM, will result in channel failure Jenkins will recognize that channel is broken and clean everything up.

          So the agent is waiting for something and if you kill the SSHD service everything is correctly cleanup. The timeout for connections should make the same thing (210 seconds) you can customize that timeout.

          I see you are using Temurin JDK 11 on the Agents and also that are macOS, Which JDK do you use on the Jenkins controller? Do you see some correlation between JDK versions or OS versions on the agents that fail to start?

          I think is not related to the Jenkins plugins, looks like a JDK versions/flavor or OS versions issue.

          Ivan Fernandez Calvo added a comment - >We just found out that executing sudo kill -9 <pid> for SSHD process for that specific connection on a VM, will result in channel failure Jenkins will recognize that channel is broken and clean everything up. So the agent is waiting for something and if you kill the SSHD service everything is correctly cleanup. The timeout for connections should make the same thing (210 seconds) you can customize that timeout. I see you are using Temurin JDK 11 on the Agents and also that are macOS, Which JDK do you use on the Jenkins controller? Do you see some correlation between JDK versions or OS versions on the agents that fail to start? I think is not related to the Jenkins plugins, looks like a JDK versions/flavor or OS versions issue.

          niv keidan added a comment -

          master is on Ubuntu 20.04, using openjdk 11.0.15

          We are seeing errors for the agent being non response for 12+ minutes, so the timeout mechanism is failing somewhere :/

          niv keidan added a comment - master is on Ubuntu 20.04, using openjdk 11.0.15 We are seeing errors for the agent being non response for 12+ minutes, so the timeout mechanism is failing somewhere :/

          Do all the agents stuck when they end the connection or before? this message means the agent is connected and the channel.

          <===[JENKINS REMOTING CAPACITY]===>channel started
          Remoting version: 4.13
          This is a Unix agent
          

          Ivan Fernandez Calvo added a comment - Do all the agents stuck when they end the connection or before? this message means the agent is connected and the channel. <===[JENKINS REMOTING CAPACITY]===>channel started Remoting version: 4.13 This is a Unix agent

          niv keidan added a comment - - edited

          Yes, all agents were stuck at the same stage.

          This issue is evolving actually...

          We introduced a delay through Anka Build Plugin to wait for launcher to finish before calling afterDisconnect, to avoid the above mentioned dead lock.

          We are seeing other deadlocks but they all result in the same behavior where agents are hanging and not responding.
          We have a lot of data, thread dumps, logs, etc, but it is still hard to understand what is going on.
          Now, we are also seeing this happen even in the middle of pipeline run, so after agent started successfully.

          The one thing in common we see in all cases is this stack trace:

          "org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 577: checking /usr/local/mobile/mnt/workspaces/workspace/testing-hanging-agents/test-macos-12.2-xcode13.3-pipeline(2) on AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 / waiting for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=12376" id=12950 (0x3296) state=BLOCKED cpu=0%

          • waiting to lock <0x6b111501> (a hudson.remoting.Channel)
            owned by "Computer.threadPoolForRemoting 2368 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=284" id=12941 (0x328d)
            at hudson.remoting.Request.call(Request.java:208)
            at hudson.remoting.Channel.call(Channel.java:999)
            at hudson.FilePath.act(FilePath.java:1194)
            at hudson.FilePath.act(FilePath.java:1183)
            at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.exitStatus(FileMonitoringTask.java:417)
            at org.jenkinsci.plugins.durabletask.BourneShellScript$ShellController.exitStatus(BourneShellScript.java:301)
            at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.exitStatus(FileMonitoringTask.java:409)
            at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:598)
            at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:549)
            at java.base@11.0.15/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
            at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264)
            at java.base@11.0.15/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
            at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)

          This is joined with messages in the jenkins system log every 5 seconds:
          o.j.p.w.s.concurrent.Timeout#lambda$ping$0: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 577: checking /usr/local/mobile/mnt/workspaces/workspace/testing-hanging-agents/test-macos-12.2-xcode13.3-pipeline(2) on AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 / waiting for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=12376 unresponsive for 1 min 50 sec

          This reaches up to 21 minutes unresponsiveness in some cases

          Any idea where to look?

          niv keidan added a comment - - edited Yes, all agents were stuck at the same stage. This issue is evolving actually... We introduced a delay through Anka Build Plugin to wait for launcher to finish before calling afterDisconnect, to avoid the above mentioned dead lock. We are seeing other deadlocks but they all result in the same behavior where agents are hanging and not responding. We have a lot of data, thread dumps, logs, etc, but it is still hard to understand what is going on. Now, we are also seeing this happen even in the middle of pipeline run, so after agent started successfully. The one thing in common we see in all cases is this stack trace: "org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 577 : checking /usr/local/mobile/mnt/workspaces/workspace/testing-hanging-agents/test-macos-12.2-xcode13.3-pipeline(2) on AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 / waiting for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=12376" id=12950 (0x3296) state=BLOCKED cpu=0% waiting to lock <0x6b111501> (a hudson.remoting.Channel) owned by "Computer.threadPoolForRemoting 2368 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=284" id=12941 (0x328d) at hudson.remoting.Request.call(Request.java:208) at hudson.remoting.Channel.call(Channel.java:999) at hudson.FilePath.act(FilePath.java:1194) at hudson.FilePath.act(FilePath.java:1183) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.exitStatus(FileMonitoringTask.java:417) at org.jenkinsci.plugins.durabletask.BourneShellScript$ShellController.exitStatus(BourneShellScript.java:301) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.exitStatus(FileMonitoringTask.java:409) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:598) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:549) at java.base@11.0.15/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base@11.0.15/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.15/java.lang.Thread.run(Thread.java:829) This is joined with messages in the jenkins system log every 5 seconds: o.j.p.w.s.concurrent.Timeout#lambda$ping$0: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 577 : checking /usr/local/mobile/mnt/workspaces/workspace/testing-hanging-agents/test-macos-12.2-xcode13.3-pipeline(2) on AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 / waiting for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=12376 unresponsive for 1 min 50 sec This reaches up to 21 minutes unresponsiveness in some cases Any idea where to look?

          Ivan Fernandez Calvo added a comment - - edited

          >Now, we are also seeing this happen even in the middle of pipeline run, so after agent started successfully.

          when the agent opens the SSH connection and opens the channel SSH agents would not do anything else until the agent is disconnected, in this stage, the only responsible of keeping the controller and agent talking is remoting, and the network layer. The deadlock is a symptom of something else.

          The durable task stack tracepoint to a method that manages the exit of the workspace, getting the result files of the job. Do you have metrics of the agents? it looks like there is a bottleneck in someplace (CPU, network, disk IO)

          https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java#L409-L411

          Ivan Fernandez Calvo added a comment - - edited >Now, we are also seeing this happen even in the middle of pipeline run, so after agent started successfully. when the agent opens the SSH connection and opens the channel SSH agents would not do anything else until the agent is disconnected, in this stage, the only responsible of keeping the controller and agent talking is remoting, and the network layer. The deadlock is a symptom of something else. The durable task stack tracepoint to a method that manages the exit of the workspace, getting the result files of the job. Do you have metrics of the agents? it looks like there is a bottleneck in someplace (CPU, network, disk IO) https://github.com/jenkinsci/durable-task-plugin/blob/master/src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java#L409-L411

          niv keidan added a comment -

          Metrics on the agents are fine.
          The VMs the agent is running on are completely accessible while they hang and show no issues with networking whatsoever (both incoming and outgoing).

          Another thread shown in all 3 cases where we have hung VMs and proper logging information:

          "Computer.threadPoolForRemoting 2368 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=284" id=12941 (0x328d) state=RUNNABLE cpu=76% (running in native)
          at java.base@11.0.15/java.net.SocketOutputStream.socketWrite0(Native Method)
          at java.base@11.0.15/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
          at java.base@11.0.15/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
          at com.trilead.ssh2.crypto.cipher.CipherOutputStream.internal_write(CipherOutputStream.java:52)
          at com.trilead.ssh2.crypto.cipher.CipherOutputStream.writeBlock(CipherOutputStream.java:101)
          at com.trilead.ssh2.crypto.cipher.CipherOutputStream.write(CipherOutputStream.java:118)
          at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:179)
          at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107)
          at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690)
          at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:431)

          • locked java.lang.Object@2b433a98
            at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
            at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:94)
            at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
            at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
            at java.base@11.0.15/java.io.OutputStream.write(OutputStream.java:122)
            at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
            at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
            at hudson.remoting.Channel.send(Channel.java:765)
          • locked hudson.remoting.Channel@6b111501
            at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:146)
          • locked hudson.remoting.ProxyOutputStream@6bff54a
            at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112)
            at hudson.remoting.Util.copy(Util.java:58)
            at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)
            at jdk.internal.reflect.GeneratedMethodAccessor435.invoke(Unknown Source)
            at java.base@11.0.15/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.base@11.0.15/java.lang.reflect.Method.invoke(Method.java:566)
            at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924)
            at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902)
            at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
            at hudson.remoting.UserRequest.perform(UserRequest.java:211)
            at hudson.remoting.UserRequest.perform(UserRequest.java:54)
            at hudson.remoting.Request$2.run(Request.java:376)
            at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
            at hudson.remoting.InterceptingExecutorService$$Lambda$591/0x0000000840e43c40.call(Unknown Source)
            at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18)
            at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19)
            at hudson.remoting.CallableDecoratorList$$Lambda$592/0x0000000840e43040.call(Unknown Source)
            at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
            at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
            at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264)
            at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)

          "Computer.threadPoolForRemoting 307 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-kQIqN id=282" id=2895 (0xb4f) state=RUNNABLE cpu=77% (running in native)
          at java.base@11.0.15/java.net.SocketOutputStream.socketWrite0(Native Method)
          at java.base@11.0.15/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
          at java.base@11.0.15/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
          at com.trilead.ssh2.crypto.cipher.CipherOutputStream.internal_write(CipherOutputStream.java:52)
          at com.trilead.ssh2.crypto.cipher.CipherOutputStream.writeBlock(CipherOutputStream.java:101)
          at com.trilead.ssh2.crypto.cipher.CipherOutputStream.write(CipherOutputStream.java:118)
          at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:179)
          at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107)
          at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690)
          at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:431)

          • locked java.lang.Object@180aed5d
            at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
            at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:94)
            at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
            at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
            at java.base@11.0.15/java.io.OutputStream.write(OutputStream.java:122)
            at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
            at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
            at hudson.remoting.Channel.send(Channel.java:765)
          • locked hudson.remoting.Channel@5a0edacf
            at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:146)
          • locked hudson.remoting.ProxyOutputStream@24248033
            at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112)
            at hudson.remoting.Util.copy(Util.java:58)
            at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)
            at jdk.internal.reflect.GeneratedMethodAccessor301.invoke(Unknown Source)
            at java.base@11.0.15/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.base@11.0.15/java.lang.reflect.Method.invoke(Method.java:566)
            at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924)
            at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902)
            at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
            at hudson.remoting.UserRequest.perform(UserRequest.java:211)
            at hudson.remoting.UserRequest.perform(UserRequest.java:54)
            at hudson.remoting.Request$2.run(Request.java:376)
            at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
            at hudson.remoting.InterceptingExecutorService$$Lambda$547/0x0000000840d79440.call(Unknown Source)
            at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18)
            at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19)
            at hudson.remoting.CallableDecoratorList$$Lambda$548/0x0000000840d79840.call(Unknown Source)
            at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
            at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
            at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264)
            at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)

          "Computer.threadPoolForRemoting 139 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-fAuJ3 id=97" id=657 (0x291) state=RUNNABLE cpu=85% (running in native)
          at java.base@11.0.15/java.net.SocketOutputStream.socketWrite0(Native Method)
          at java.base@11.0.15/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
          at java.base@11.0.15/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
          at com.trilead.ssh2.crypto.cipher.CipherOutputStream.internal_write(CipherOutputStream.java:52)
          at com.trilead.ssh2.crypto.cipher.CipherOutputStream.writeBlock(CipherOutputStream.java:101)
          at com.trilead.ssh2.crypto.cipher.CipherOutputStream.write(CipherOutputStream.java:118)
          at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:179)
          at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107)
          at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690)
          at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:431)

          • locked java.lang.Object@59b29645
            at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
            at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:94)
            at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
            at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
            at java.base@11.0.15/java.io.OutputStream.write(OutputStream.java:122)
            at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
            at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
            at hudson.remoting.Channel.send(Channel.java:765)
          • locked hudson.remoting.Channel@2256db80
            at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:146)
          • locked hudson.remoting.ProxyOutputStream@2abbfaf6
            at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112)
            at hudson.remoting.Util.copy(Util.java:58)
            at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)
            at jdk.internal.reflect.GeneratedMethodAccessor411.invoke(Unknown Source)
            at java.base@11.0.15/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.base@11.0.15/java.lang.reflect.Method.invoke(Method.java:566)
            at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924)
            at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902)
            at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
            at hudson.remoting.UserRequest.perform(UserRequest.java:211)
            at hudson.remoting.UserRequest.perform(UserRequest.java:54)
            at hudson.remoting.Request$2.run(Request.java:376)
            at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
            at hudson.remoting.InterceptingExecutorService$$Lambda$574/0x0000000840d80c40.call(Unknown Source)
            at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18)
            at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19)
            at hudson.remoting.CallableDecoratorList$$Lambda$575/0x0000000840da8040.call(Unknown Source)
            at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
            at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
            at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264)
            at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)

          niv keidan added a comment - Metrics on the agents are fine. The VMs the agent is running on are completely accessible while they hang and show no issues with networking whatsoever (both incoming and outgoing). Another thread shown in all 3 cases where we have hung VMs and proper logging information: "Computer.threadPoolForRemoting 2368 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-5Pko0 id=284" id=12941 (0x328d) state=RUNNABLE cpu=76% (running in native) at java.base@11.0.15/java.net.SocketOutputStream.socketWrite0(Native Method) at java.base@11.0.15/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110) at java.base@11.0.15/java.net.SocketOutputStream.write(SocketOutputStream.java:150) at com.trilead.ssh2.crypto.cipher.CipherOutputStream.internal_write(CipherOutputStream.java:52) at com.trilead.ssh2.crypto.cipher.CipherOutputStream.writeBlock(CipherOutputStream.java:101) at com.trilead.ssh2.crypto.cipher.CipherOutputStream.write(CipherOutputStream.java:118) at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:179) at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107) at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690) at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:431) locked java.lang.Object@2b433a98 at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63) at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:94) at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89) at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58) at java.base@11.0.15/java.io.OutputStream.write(OutputStream.java:122) at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46) at hudson.remoting.Channel.send(Channel.java:765) locked hudson.remoting.Channel@6b111501 at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:146) locked hudson.remoting.ProxyOutputStream@6bff54a at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112) at hudson.remoting.Util.copy(Util.java:58) at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57) at jdk.internal.reflect.GeneratedMethodAccessor435.invoke(Unknown Source) at java.base@11.0.15/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base@11.0.15/java.lang.reflect.Method.invoke(Method.java:566) at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at hudson.remoting.InterceptingExecutorService$$Lambda$591/0x0000000840e43c40.call(Unknown Source) at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18) at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19) at hudson.remoting.CallableDecoratorList$$Lambda$592/0x0000000840e43040.call(Unknown Source) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.15/java.lang.Thread.run(Thread.java:829) "Computer.threadPoolForRemoting 307 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-kQIqN id=282" id=2895 (0xb4f) state=RUNNABLE cpu=77% (running in native) at java.base@11.0.15/java.net.SocketOutputStream.socketWrite0(Native Method) at java.base@11.0.15/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110) at java.base@11.0.15/java.net.SocketOutputStream.write(SocketOutputStream.java:150) at com.trilead.ssh2.crypto.cipher.CipherOutputStream.internal_write(CipherOutputStream.java:52) at com.trilead.ssh2.crypto.cipher.CipherOutputStream.writeBlock(CipherOutputStream.java:101) at com.trilead.ssh2.crypto.cipher.CipherOutputStream.write(CipherOutputStream.java:118) at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:179) at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107) at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690) at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:431) locked java.lang.Object@180aed5d at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63) at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:94) at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89) at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58) at java.base@11.0.15/java.io.OutputStream.write(OutputStream.java:122) at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46) at hudson.remoting.Channel.send(Channel.java:765) locked hudson.remoting.Channel@5a0edacf at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:146) locked hudson.remoting.ProxyOutputStream@24248033 at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112) at hudson.remoting.Util.copy(Util.java:58) at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57) at jdk.internal.reflect.GeneratedMethodAccessor301.invoke(Unknown Source) at java.base@11.0.15/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base@11.0.15/java.lang.reflect.Method.invoke(Method.java:566) at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at hudson.remoting.InterceptingExecutorService$$Lambda$547/0x0000000840d79440.call(Unknown Source) at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18) at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19) at hudson.remoting.CallableDecoratorList$$Lambda$548/0x0000000840d79840.call(Unknown Source) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.15/java.lang.Thread.run(Thread.java:829) "Computer.threadPoolForRemoting 139 for AnkaOB-ephemeral-macos-12.2-xcode13.3-special-test-fAuJ3 id=97" id=657 (0x291) state=RUNNABLE cpu=85% (running in native) at java.base@11.0.15/java.net.SocketOutputStream.socketWrite0(Native Method) at java.base@11.0.15/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110) at java.base@11.0.15/java.net.SocketOutputStream.write(SocketOutputStream.java:150) at com.trilead.ssh2.crypto.cipher.CipherOutputStream.internal_write(CipherOutputStream.java:52) at com.trilead.ssh2.crypto.cipher.CipherOutputStream.writeBlock(CipherOutputStream.java:101) at com.trilead.ssh2.crypto.cipher.CipherOutputStream.write(CipherOutputStream.java:118) at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:179) at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107) at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690) at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:431) locked java.lang.Object@59b29645 at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63) at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:94) at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89) at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58) at java.base@11.0.15/java.io.OutputStream.write(OutputStream.java:122) at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46) at hudson.remoting.Channel.send(Channel.java:765) locked hudson.remoting.Channel@2256db80 at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:146) locked hudson.remoting.ProxyOutputStream@2abbfaf6 at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112) at hudson.remoting.Util.copy(Util.java:58) at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57) at jdk.internal.reflect.GeneratedMethodAccessor411.invoke(Unknown Source) at java.base@11.0.15/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base@11.0.15/java.lang.reflect.Method.invoke(Method.java:566) at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at hudson.remoting.InterceptingExecutorService$$Lambda$574/0x0000000840d80c40.call(Unknown Source) at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18) at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19) at hudson.remoting.CallableDecoratorList$$Lambda$575/0x0000000840da8040.call(Unknown Source) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base@11.0.15/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.15/java.lang.Thread.run(Thread.java:829)

          niv keidan added a comment -

          Also, it seems that this is only happening in Pipelines and does not occur in freestyle jobs... :/

          niv keidan added a comment - Also, it seems that this is only happening in Pipelines and does not occur in freestyle jobs... :/

          niv keidan added a comment -

          Also, we have 3 output sets of the "Support Core" plugin when this issue is happening. Lots of info there. I can attach if you think that will help.

          niv keidan added a comment - Also, we have 3 output sets of the "Support Core" plugin when this issue is happening. Lots of info there. I can attach if you think that will help.

          I do not remember if the Support plugin anonymizes all the sensible info about your instance, so better not attach it here.

          I do not like to mix stuff in the same issue, but I think that both have the same root cause, we are talking about two issues one is that agents stucks when they start, and the other is that agents stucks in the middle (not sure but if it is at the end or a random point) of pipeline execution.

          With the stack trace, you paste before the three threads are using more than 70% of the CPU and are stuck reading from the disk and sending that data to the Jenkins Controller. The times I saw this in the past is related to a poor IO performance in the agent, it is usually caused by using a NFS filesystem(or other network filesystem) for the workspace of the agents, please check the following links

          https://support.cloudbees.com/hc/en-us/articles/115003461772-IO-Troubleshooting-on-Linux
          https://support.cloudbees.com/hc/en-us/articles/115003442371-Required-Data-IO-issues-on-Linux

          Ivan Fernandez Calvo added a comment - I do not remember if the Support plugin anonymizes all the sensible info about your instance, so better not attach it here. I do not like to mix stuff in the same issue, but I think that both have the same root cause, we are talking about two issues one is that agents stucks when they start, and the other is that agents stucks in the middle (not sure but if it is at the end or a random point) of pipeline execution. With the stack trace, you paste before the three threads are using more than 70% of the CPU and are stuck reading from the disk and sending that data to the Jenkins Controller. The times I saw this in the past is related to a poor IO performance in the agent, it is usually caused by using a NFS filesystem(or other network filesystem) for the workspace of the agents, please check the following links https://support.cloudbees.com/hc/en-us/articles/115003461772-IO-Troubleshooting-on-Linux https://support.cloudbees.com/hc/en-us/articles/115003442371-Required-Data-IO-issues-on-Linux

          niv keidan added a comment -

          I may be misunderstanding, but I am seeing all 3 stack traces stuck on "SocketWrite" so why are you saying its reading? and why do you say its reading from disk?

          niv keidan added a comment - I may be misunderstanding, but I am seeing all 3 stack traces stuck on "SocketWrite" so why are you saying its reading? and why do you say its reading from disk?

          IIRC this is the part that grabs the classes from the Jenkins controller and stores those classes in the local cache of the agent to run them locally, so it has a network import that is saved to disk.

          at hudson.remoting.Util.copy(Util.java:58)
          at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)

          https://github.com/daniel-beck/jenkins-remoting/blob/master/src/main/java/hudson/remoting/JarLoaderImpl.java#L31-L39

          Ivan Fernandez Calvo added a comment - IIRC this is the part that grabs the classes from the Jenkins controller and stores those classes in the local cache of the agent to run them locally, so it has a network import that is saved to disk. at hudson.remoting.Util.copy(Util.java:58) at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57) https://github.com/daniel-beck/jenkins-remoting/blob/master/src/main/java/hudson/remoting/JarLoaderImpl.java#L31-L39

          I just remember that I have used Anka provider by MacStadium about 2 years ago, at least at that time their performance was really poor.

          Ivan Fernandez Calvo added a comment - I just remember that I have used Anka provider by MacStadium about 2 years ago, at least at that time their performance was really poor.

          niv keidan added a comment -

          Yeah, since major version 2 is much better.

          In any case, this is relevant https://github.com/jenkinsci/ssh-slaves-plugin/pull/304

          niv keidan added a comment - Yeah, since major version 2 is much better. In any case, this is relevant https://github.com/jenkinsci/ssh-slaves-plugin/pull/304

          Ivan Fernandez Calvo added a comment - - edited

          The PR will kill the connection in the best case, but the issue about how the connection is stuck in a native IO operation will be there, so the plugin will try again to reconnect, it could work or not.
          In the case of your pipelines stuck on IO operation, the fix in the PR will not apply if the channel is not broken.

          Ivan Fernandez Calvo added a comment - - edited The PR will kill the connection in the best case, but the issue about how the connection is stuck in a native IO operation will be there, so the plugin will try again to reconnect, it could work or not. In the case of your pipelines stuck on IO operation, the fix in the PR will not apply if the channel is not broken.

          Did the new version fix the Deadlock at start time?

          Ivan Fernandez Calvo added a comment - Did the new version fix the Deadlock at start time?

          Nathan added a comment -

          Hi Ivan, yep! tomekjarosik is unblocked using the new code.

          Nathan added a comment - Hi Ivan, yep! tomekjarosik is unblocked using the new code.

            ifernandezcalvo Ivan Fernandez Calvo
            niv_keidan_veertu niv keidan
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: