Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-44785

Add Built-in Request timeout support in Remoting

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Filing at Oleg Nenashev's request.

      VirtualChannel.call has no timeout parameter, so it is assumed to block the calling thread until the call completes, fails, or the thread is interrupted externally. In practice the programmer often knows that it is unreasonable for a given call to take more than a certain amount of time, and would prefer to pass a timeout, akin to Future.get(long timeout, TimeUnit unit) throwing TimeoutException.

      Similarly, various calls in Jenkins core which are implemented using Remoting ought to have overloads which take a timeout, or even (arguably) ought to default to using a fixed timeout. For example, FilePath.isDirectory has been observed to hang indefinitely when an agent connection is broken. It is almost never desirable to just wait until it completes; if the agent cannot respond in a reasonable amount of time (a few minutes at most), it would be better to just fail.

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            A Timeout utility class was introduced into Pipeline to solve some issues associated with JENKINS-32986. Used in the context of Remoting (specifically isDirectory) in JENKINS-37719.

            Show
            jglick Jesse Glick added a comment - A Timeout utility class was introduced into Pipeline to solve some issues associated with  JENKINS-32986 . Used in the context of Remoting (specifically isDirectory ) in  JENKINS-37719 .
            Hide
            jglick Jesse Glick added a comment -

            Some diagnostic improvements to Request.call, which already sets the Thread.name to the Channel.name:

            • make sure that Channel.name is easily aligned with a SlaveComputer.nodeName; currently it seems that we get things like JNLP4-connect connection from some.host.name/1.2.3.4:12345 which bears no obvious relationship to agent names used elsewhere in thread dumps etc.
            • while it waits in 30s increments, it does not indicate in the Thread.name how long it has been waiting so far, so it is hard to tell if a given thread dump excerpt is just a normal delay of a couple seconds on a busy machine, or something gone haywire which has been hanging for hours
            Show
            jglick Jesse Glick added a comment - Some diagnostic improvements to Request.call , which already sets the Thread.name to the Channel.name : make sure that Channel.name is easily aligned with a SlaveComputer.nodeName ; currently it seems that we get things like JNLP4-connect connection from some.host.name/1.2.3.4:12345 which bears no obvious relationship to agent names used elsewhere in thread dumps etc. while it waits in 30s increments, it does not indicate in the Thread.name how long it has been waiting so far, so it is hard to tell if a given thread dump excerpt is just a normal delay of a couple seconds on a busy machine, or something gone haywire which has been hanging for hours
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            I do not anticipate to get Remoting maintenance time in next several months.
            This task is stalled indefinitely.

            Show
            oleg_nenashev Oleg Nenashev added a comment - I do not anticipate to get Remoting maintenance time in next several months. This task is stalled indefinitely.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            FTR https://github.com/jenkinsci/remoting/pull/174 is my draft PR. Everybody is welcome to take it over

            Show
            oleg_nenashev Oleg Nenashev added a comment - FTR https://github.com/jenkinsci/remoting/pull/174 is my draft PR. Everybody is welcome to take it over
            Hide
            alexhraber Alex Raber added a comment -

            Hello, not having a timeout parameter on the jnlp agent connection to master might be causing further issues when (on kubernetes) the jenkins-master container is moved, inducing a 2-3 minutes downtime of master while the jenkins-master pod boots on a different kubernetes node during resource scaling and pod orchestration.

            This appears to result in job pods (live agents/slaves) failing due to no connection to master. In kubernetes this results in a job failure, rather than a job being paused and waiting for the master to become available again.

             

            This in turn results in job pods not being cleaned up due to the pods being left in `Error` state. JENKINS-54540

            Hoping to bridge the gap if there is one, this is probably affecting everybody that runs on kubernetes/gke/aks/azurek8s.

            Show
            alexhraber Alex Raber added a comment - Hello, not having a timeout parameter on the jnlp agent connection to master might be causing further issues when (on kubernetes) the jenkins-master container is moved, inducing a 2-3 minutes downtime of master while the jenkins-master pod boots on a different kubernetes node during resource scaling and pod orchestration. This appears to result in job pods (live agents/slaves) failing due to no connection to master. In kubernetes this results in a job failure, rather than a job being paused and waiting for the master to become available again.   This in turn results in job pods not being cleaned up due to the pods being left in `Error` state.  JENKINS-54540 Hoping to bridge the gap if there is one, this is probably affecting everybody that runs on kubernetes/gke/aks/azurek8s.
            Hide
            jglick Jesse Glick added a comment -

            Alex Raber what you are talking about is a different issue. If the Jenkins master pod is deleted, the Remoting connection on the agent pod’s side should be closed, at which point it will attempt to reconnect at intervals. If that does not happen, it is a bug but please do not discuss it here. This issue is about a timeout on an individual RPC call, mainly in the master → agent direction, for example if the agent JVM ran out of memory and is unresponsive (but the socket remains open).

            Show
            jglick Jesse Glick added a comment - Alex Raber what you are talking about is a different issue. If the Jenkins master pod is deleted, the Remoting connection on the agent pod’s side should be closed , at which point it will attempt to reconnect at intervals. If that does not happen, it is a bug but please do not discuss it here. This issue is about a timeout on an individual RPC call, mainly in the master → agent direction, for example if the agent JVM ran out of memory and is unresponsive (but the socket remains open).

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              jglick Jesse Glick
              Votes:
              4 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated: