Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-44785

Add Built-in Request timeout support in Remoting

      Filing at oleg_nenashev's request.

      VirtualChannel.call has no timeout parameter, so it is assumed to block the calling thread until the call completes, fails, or the thread is interrupted externally. In practice the programmer often knows that it is unreasonable for a given call to take more than a certain amount of time, and would prefer to pass a timeout, akin to Future.get(long timeout, TimeUnit unit) throwing TimeoutException.

      Similarly, various calls in Jenkins core which are implemented using Remoting ought to have overloads which take a timeout, or even (arguably) ought to default to using a fixed timeout. For example, FilePath.isDirectory has been observed to hang indefinitely when an agent connection is broken. It is almost never desirable to just wait until it completes; if the agent cannot respond in a reasonable amount of time (a few minutes at most), it would be better to just fail.

          [JENKINS-44785] Add Built-in Request timeout support in Remoting

          Jesse Glick added a comment -

          A Timeout utility class was introduced into Pipeline to solve some issues associated with JENKINS-32986. Used in the context of Remoting (specifically isDirectory) in JENKINS-37719.

          Jesse Glick added a comment - A Timeout utility class was introduced into Pipeline to solve some issues associated with  JENKINS-32986 . Used in the context of Remoting (specifically isDirectory ) in  JENKINS-37719 .

          Jesse Glick added a comment -

          Some diagnostic improvements to Request.call, which already sets the Thread.name to the Channel.name:

          • make sure that Channel.name is easily aligned with a SlaveComputer.nodeName; currently it seems that we get things like JNLP4-connect connection from some.host.name/1.2.3.4:12345 which bears no obvious relationship to agent names used elsewhere in thread dumps etc.
          • while it waits in 30s increments, it does not indicate in the Thread.name how long it has been waiting so far, so it is hard to tell if a given thread dump excerpt is just a normal delay of a couple seconds on a busy machine, or something gone haywire which has been hanging for hours

          Jesse Glick added a comment - Some diagnostic improvements to Request.call , which already sets the Thread.name to the Channel.name : make sure that Channel.name is easily aligned with a SlaveComputer.nodeName ; currently it seems that we get things like JNLP4-connect connection from some.host.name/1.2.3.4:12345 which bears no obvious relationship to agent names used elsewhere in thread dumps etc. while it waits in 30s increments, it does not indicate in the Thread.name how long it has been waiting so far, so it is hard to tell if a given thread dump excerpt is just a normal delay of a couple seconds on a busy machine, or something gone haywire which has been hanging for hours

          Oleg Nenashev added a comment -

          I do not anticipate to get Remoting maintenance time in next several months.
          This task is stalled indefinitely.

          Oleg Nenashev added a comment - I do not anticipate to get Remoting maintenance time in next several months. This task is stalled indefinitely.

          Oleg Nenashev added a comment -

          FTR https://github.com/jenkinsci/remoting/pull/174 is my draft PR. Everybody is welcome to take it over

          Oleg Nenashev added a comment - FTR https://github.com/jenkinsci/remoting/pull/174 is my draft PR. Everybody is welcome to take it over

          Alex Raber added a comment -

          Hello, not having a timeout parameter on the jnlp agent connection to master might be causing further issues when (on kubernetes) the jenkins-master container is moved, inducing a 2-3 minutes downtime of master while the jenkins-master pod boots on a different kubernetes node during resource scaling and pod orchestration.

          This appears to result in job pods (live agents/slaves) failing due to no connection to master. In kubernetes this results in a job failure, rather than a job being paused and waiting for the master to become available again.

           

          This in turn results in job pods not being cleaned up due to the pods being left in `Error` state. JENKINS-54540

          Hoping to bridge the gap if there is one, this is probably affecting everybody that runs on kubernetes/gke/aks/azurek8s.

          Alex Raber added a comment - Hello, not having a timeout parameter on the jnlp agent connection to master might be causing further issues when (on kubernetes) the jenkins-master container is moved, inducing a 2-3 minutes downtime of master while the jenkins-master pod boots on a different kubernetes node during resource scaling and pod orchestration. This appears to result in job pods (live agents/slaves) failing due to no connection to master. In kubernetes this results in a job failure, rather than a job being paused and waiting for the master to become available again.   This in turn results in job pods not being cleaned up due to the pods being left in `Error` state.  JENKINS-54540 Hoping to bridge the gap if there is one, this is probably affecting everybody that runs on kubernetes/gke/aks/azurek8s.

          Jesse Glick added a comment -

          alexhraber what you are talking about is a different issue. If the Jenkins master pod is deleted, the Remoting connection on the agent pod’s side should be closed, at which point it will attempt to reconnect at intervals. If that does not happen, it is a bug but please do not discuss it here. This issue is about a timeout on an individual RPC call, mainly in the master → agent direction, for example if the agent JVM ran out of memory and is unresponsive (but the socket remains open).

          Jesse Glick added a comment - alexhraber what you are talking about is a different issue. If the Jenkins master pod is deleted, the Remoting connection on the agent pod’s side should be closed , at which point it will attempt to reconnect at intervals. If that does not happen, it is a bug but please do not discuss it here. This issue is about a timeout on an individual RPC call, mainly in the master → agent direction, for example if the agent JVM ran out of memory and is unresponsive (but the socket remains open).

            Unassigned Unassigned
            jglick Jesse Glick
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: