Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65534

Jenkins agent connection timeout results in builds locking up

    XMLWordPrintable

Details

    Description

      We are having problems with agents randomly "locking up" and getting stuck pipeline builds. When this happens the build has to be aborted. The agent then needs to be disconnected then reconnected or else other builds will get stuck on the agent. The lockup always seems to occur in two locations:
      (1) after the "commit message" log when checking out a project using git
      (2) when using any sh directive (even something simple like echo "hello world")

      After increasing debugging I have found it correlates with logs indicating the agent had some kind of timeout. It could be a network timeout or it could be something locking up resulting in a timeout. it might be caused to the agent timeout. In any once that occurs the agent becomes unusable due to always locking up a build on the above two steps.

      org.jenkinsci.plugins.workflow.support.concurrent.Timeout=FINE continuously logs this starting around the time the issue shows up. The agent however shows up as connected. There is no network issue - I can ssh into the agent fine from the controller.

      Interrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#6102]: checking /var/appl/jenkins/workspace/my-job on jknapprw17 / waiting for jknapprw17 id=425464 after 20 SECONDS
      java.lang.Throwable
      	at java.base@11.0.10/jdk.internal.misc.Unsafe.park(Native Method)
      	at java.base@11.0.10/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
      	at java.base@11.0.10/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447)
      	at java.base@11.0.10/java.util.concurrent.FutureTask.get(FutureTask.java:190)
      	at hudson.remoting.Request.call(Request.java:184)
      	at hudson.remoting.Channel.call(Channel.java:1000)
      	at hudson.FilePath.act(FilePath.java:1155)
      	at hudson.FilePath.act(FilePath.java:1144)
      	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:227)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:592)
      	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:549)
      	at java.base@11.0.10/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base@11.0.10/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base@11.0.10/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base@11.0.10/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base@11.0.10/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base@11.0.10/java.lang.Thread.run(Thread.java:834)
      

      The agent remoting log has this:

      2021-04-29 15:50:45.549-0400 INFO hudson.Launcher$RemoteLaunchCallable$1 join: Failed to synchronize IO streams on the channel hudson.remoting.Channel@6973bf95:channel
      java.lang.InterruptedException
      	at java.base/java.lang.Object.wait(Native Method)
      	at hudson.remoting.Request.call(Request.java:176)
      	at hudson.remoting.Channel.call(Channel.java:1000)
      	at hudson.remoting.Channel.syncIO(Channel.java:1740)
      	at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1402)
      	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
      	at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:936)
      	at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:909)
      	at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:860)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
      	at hudson.remoting.Request$2.run(Request.java:375)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:73)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      

      I have not come up with a consistent way to reproduce this but with 100+ agents I see it every week. If I can figure out a way I'll add it to the notes.

      Attachments

        Activity

          jthompson Jeff Thompson added a comment -

          The Remoting library depends on a consistent, reliable network connection. It doesn't have the features to handle interrupts during processing. I've looked into it a few times, but haven't found a straightforward solution. If we identify specific behaviors those can sometimes be improved. Often the issue occurs because of system or networking misbehaviors that cause the connection to get interrupted or otherwise break.

          You could try one of the other agent mechanisms. They may not be as susceptible to this. The WebSockets implementation may work better. Or use an SSH agent.

          Sometimes errors like this result from bad behavior or interactions with specific plugins and commands.

          Good luck on isolating and reproducing the problem.

          jthompson Jeff Thompson added a comment - The Remoting library depends on a consistent, reliable network connection. It doesn't have the features to handle interrupts during processing. I've looked into it a few times, but haven't found a straightforward solution. If we identify specific behaviors those can sometimes be improved. Often the issue occurs because of system or networking misbehaviors that cause the connection to get interrupted or otherwise break. You could try one of the other agent mechanisms. They may not be as susceptible to this. The WebSockets implementation may work better. Or use an SSH agent. Sometimes errors like this result from bad behavior or interactions with specific plugins and commands. Good luck on isolating and reproducing the problem.
          mrichar2 Mark R added a comment - - edited

          We updated to 2.277.3 with remoting 4.6 and after monitoring it I no longer see the issue. Note that we are using websockets and have been the whole time.

          Closing this ticket. I will reopen it and investigate more if we experience more problems.

          mrichar2 Mark R added a comment - - edited We updated to 2.277.3 with remoting 4.6 and after monitoring it I no longer see the issue. Note that we are using websockets and have been the whole time. Closing this ticket. I will reopen it and investigate more if we experience more problems.

          People

            jthompson Jeff Thompson
            mrichar2 Mark R
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: