Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45219

Remoting should terminate() channel after a timeout even if it does not hear from the remote side

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • Minor
    • Resolution: Duplicate
    • remoting
    • None

    Description

      Currently the channel termination logic depends on the exchange of CloseCommand's between one side and another... sideA and sideB

      1) sideA requests the channel close

      2) CloseCommand goes to sideB

      3) Transport#commandReceiver() fails to invoke the task due to any reason (deadlock, overload, thread death, etc.) and does not send the CloseCommand back

      4) channel.terminate(new OrderlyShutdown(createdAt)) does not get invoked on sideA

      5) If the channel is operational && there is no PingThread, channel.terminate() will be never invoked again on sideA

      6) channel on sideA never closes the Receiver, so Channel#inClosed stays null

      7) If there are pending Request#calls() operations, they may inifinitely hang in this cycle: 

      while(response==null && !channel.isInClosed())
        // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
        // but in production I've observed that in rare occasion it can block forever, even after a channel
        // is gone. So be defensive against that.
        wait(30*1000);

       

      If we set a timeout for Channel termination on close(), it may help to forcefully terminate the channel when sideB does not send the command back after a timeout (e.g. 1 minute)

      Attachments

        Issue Links

          Activity

            oleg_nenashev Oleg Nenashev added a comment -

            I think it's just JENKINS-44785 now. JENKINS-45294 and JENKINS-45023 should have improved the situation a lot, but we still need timeouts to break this wait() cycle

            oleg_nenashev Oleg Nenashev added a comment - I think it's just JENKINS-44785 now. JENKINS-45294 and JENKINS-45023 should have improved the situation a lot, but we still need timeouts to break this wait() cycle

            People

              Unassigned Unassigned
              oleg_nenashev Oleg Nenashev
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: