Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65534

Jenkins agent connection timeout results in builds locking up

XMLWordPrintable

      We are having problems with agents randomly "locking up" and getting stuck pipeline builds. When this happens the build has to be aborted. The agent then needs to be disconnected then reconnected or else other builds will get stuck on the agent. The lockup always seems to occur in two locations:
      (1) after the "commit message" log when checking out a project using git
      (2) when using any sh directive (even something simple like echo "hello world")

      After increasing debugging I have found it correlates with logs indicating the agent had some kind of timeout. It could be a network timeout or it could be something locking up resulting in a timeout. it might be caused to the agent timeout. In any once that occurs the agent becomes unusable due to always locking up a build on the above two steps.

      org.jenkinsci.plugins.workflow.support.concurrent.Timeout=FINE continuously logs this starting around the time the issue shows up. The agent however shows up as connected. There is no network issue - I can ssh into the agent fine from the controller.

      Interrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#6102]: checking /var/appl/jenkins/workspace/my-job on jknapprw17 / waiting for jknapprw17 id=425464 after 20 SECONDS
      java.lang.Throwable
      	at java.base@11.0.10/jdk.internal.misc.Unsafe.park(Native Method)
      	at java.base@11.0.10/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
      	at java.base@11.0.10/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
      	at java.base@11.0.10/java.util.concurrent.FutureTask.get(FutureTask.java:190)
      	at hudson.remoting.Request.call(Request.java:184)
      	at hudson.remoting.Channel.call(Channel.java:1000)
      	at hudson.FilePath.act(FilePath.java:1155)
      	at hudson.FilePath.act(FilePath.java:1144)
      	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:227)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:592)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:549)
      	at java.base@11.0.10/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base@11.0.10/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base@11.0.10/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base@11.0.10/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base@11.0.10/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base@11.0.10/java.lang.Thread.run(Thread.java:834)
      

      The agent remoting log has this:

      2021-04-29 15:50:45.549-0400 INFO hudson.Launcher$RemoteLaunchCallable$1 join: Failed to synchronize IO streams on the channel hudson.remoting.Channel@6973bf95:channel
      java.lang.InterruptedException
      	at java.base/java.lang.Object.wait(Native Method)
      	at hudson.remoting.Request.call(Request.java:176)
      	at hudson.remoting.Channel.call(Channel.java:1000)
      	at hudson.remoting.Channel.syncIO(Channel.java:1740)
      	at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1402)
      	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
      	at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:936)
      	at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:909)
      	at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:860)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
      	at hudson.remoting.Request$2.run(Request.java:375)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:73)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      

      I have not come up with a consistent way to reproduce this but with 100+ agents I see it every week. If I can figure out a way I'll add it to the notes.

            jthompson Jeff Thompson
            mrichar2 Mark R
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: