Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25698

Archiving artifacts (or FilePath operatons) should properly time out and retry the copy

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • core
    • None

      I have two related requests regarding how the artifacts are copied, assuming that no proper syncing will be implemented any time soon:

      1. The copying process when archiving artifacts should impose a timeout on network operations. If nothing has been sent for some period of time, it should break the connection. What happens now is that when the connection stalls, our builds run forever or until someone cancels it manually.

      2. When copying fails, it should retry (perhaps after waiting a moment.) This plays well with the timeout because it means that when a timeout occurs, the copy fails and gets retried.

      Remember, the first fallacy of distributed computing is "The network is reliable." So you can't assume a single attempt to copy the data will always work.

        1. screenshot-1.jpg
          screenshot-1.jpg
          177 kB
        2. stacks.zip
          16 kB
        3. Thread dump [Jenkins].htm
          141 kB

          [JENKINS-25698] Archiving artifacts (or FilePath operatons) should properly time out and retry the copy

          trejkaz added a comment -

          Attaching my own stack dumps from a stall this morning, in case they're any different. This is before anyone had a chance to kill the build.

          trejkaz added a comment - Attaching my own stack dumps from a stall this morning, in case they're any different. This is before anyone had a chance to kill the build.

          ikedam added a comment -

          @d3matt Let me know what do you mean with "timed out".
          Did it automatically stop? (or someone clicked the x button?)

          ikedam added a comment - @d3matt Let me know what do you mean with "timed out". Did it automatically stop? (or someone clicked the x button?)

          It did that on it's own, no intervention.

          Matthew Stoltenberg added a comment - It did that on it's own, no intervention.

          ikedam added a comment -

          @trejkaz What's the hanged job name?

          ikedam added a comment - @trejkaz What's the hanged job name?

          trejkaz added a comment - - edited
          Executor #0 for U-Igglepiggle : executing trunk-compile/os=ubuntu #3556
          

          (we only have one job per slave, so it is easy to deduce from the dump.)

          trejkaz added a comment - - edited Executor #0 for U-Igglepiggle : executing trunk-compile/os=ubuntu #3556 (we only have one job per slave, so it is easy to deduce from the dump.)

          ikedam added a comment -

          Both of stacktraces point here: https://github.com/jenkinsci/remoting/blob/remoting-2.48/src/main/java/hudson/remoting/FastPipedInputStream.java#L175
          There is a timeout (default 10 secs) and there may be an infinite loop calling that.

          ikedam added a comment - Both of stacktraces point here: https://github.com/jenkinsci/remoting/blob/remoting-2.48/src/main/java/hudson/remoting/FastPipedInputStream.java#L175 There is a timeout (default 10 secs) and there may be an infinite loop calling that.

          ikedam added a comment -

          Oh, there's an infinite loop just before that.
          It means the opposite output stream isn't closed. Who should close it?

          ikedam added a comment - Oh, there's an infinite loop just before that. It means the opposite output stream isn't closed. Who should close it?

          Daniel Beck added a comment -

          Matthew: Your issue looks more like JENKINS-22914. The underlying exception causes are relevant.

          This report is EOF, while yours is a forced abort due to ping timeout.

          Daniel Beck added a comment - Matthew: Your issue looks more like JENKINS-22914 . The underlying exception causes are relevant. This report is EOF, while yours is a forced abort due to ping timeout.

          Daniel Beck added a comment -

          oleg_nenashev Are there changes in remoting since this was reported that make this issue obsolete?

          Daniel Beck added a comment - oleg_nenashev Are there changes in remoting since this was reported that make this issue obsolete?

          This is still a valid issue. I am using jenkins 2.1.30 and I still have this problem. 

          I run very large build that produce many GB of artifacts. Those builds run for hours. It is very frustrating when it come to the end to see the job hanging in the middle of the transfer. I know I can get access the files directly on the builder but then, it means I am running a build that will never be archived and can't be released.

          Simon Martineau added a comment - This is still a valid issue. I am using jenkins 2.1.30 and I still have this problem.  I run very large build that produce many GB of artifacts. Those builds run for hours. It is very frustrating when it come to the end to see the job hanging in the middle of the transfer. I know I can get access the files directly on the builder but then, it means I am running a build that will never be archived and can't be released.

            Unassigned Unassigned
            trejkaz trejkaz
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: