Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50597

Verify behavior of timeouts, interrupts, and network disconnections in S3 storage

      svanoort reminds me that we need to examine the behavior of this plugin with respect to timeouts and network failures and the like. Specifically, we can classify anomalous events as follows:

      • Network failures, throwing an exception from some socket call typically.
      • Network hangs (perhaps due to misconfigured TCP settings), whereby a socket call just blocks indefinitely (java.io versions are typically immune to interruption except by Thread.stop, alas).
      • User-initiated interrupt: Stop button is clicked.
      • System-initiated interrupt, such as via the timeout step.

      The code which would be impacted by such events can also be classified:

      • Master-side S3 metadata calls made in the course of a build, such as for archiveArtifacts, typically inside SynchronousNonBlockingStepExecution.
      • Master-side S3 metadata calls made in the context of a build but not inside a build step:
        • artifact & stash deletion during log rotation of old builds
        • stash deletion at the end of a build
        • artifact & stash copy during checkpoint resumption
      • Master-side S3 metadata calls made completely outside the context of a build:
        • artifact browsing from classic UI
        • same but from Blue Ocean
      • Agent-side URL GET or POST calls made from a build step.

      Draft acceptance criteria:

      • Build steps may hang or fail due to network issues, but timeout or manual interrupts must be honored promptly. (retry can be used for critical builds when there is an advance expectation of problems; checkpoints can also be used for manual intervention.)
      • Operations associated with a build but outside the context of a build step must apply some reasonable timeout, and if this is exceeded, either fail or issue a warning, according to the nature of the API.
      • Operations associated with an HTTP request thread in classic UI may block on the network, though if some reasonable timeout is exceeded an HTTP error should be returned and the thread returned to the pool.
      • Blue Ocean behavior is TBD. Ideally these REST calls would be asynchronous and not block rendering of the Artifacts tab.

          [JENKINS-50597] Verify behavior of timeouts, interrupts, and network disconnections in S3 storage

          Jesse Glick created issue -

          Sam Van Oort added a comment -

          Huge thumbs-up for tracking this as something that needs to be part of the design, and I think the criteria sounds reasonable.

          I'd add to the Draft Acceptance criteria that

          • The non-step operations need to either have a reasonable default handling of direct network failures and throw an appropriately specific exception (hopefully not just a blind IOException, but some sort of subtype)

          This table of Java network exception types may be of assistance: http://vaibhavblogs.org/2012/12/common-java-networking-exceptions/

          Sam Van Oort added a comment - Huge thumbs-up for tracking this as something that needs to be part of the design, and I think the criteria sounds reasonable. I'd add to the Draft Acceptance criteria that The non-step operations need to either have a reasonable default handling of direct network failures and throw an appropriately specific exception (hopefully not just a blind IOException, but some sort of subtype) This table of Java network exception types may be of assistance: http://vaibhavblogs.org/2012/12/common-java-networking-exceptions/

          Sam Van Oort added a comment -

          An optional approach for dealing with errors if we don't want to mandate a lot of Checked Exceptions may be to optionally accept a Handler object that may implement its own strategy of dealing with the various kinds of failures (i.e. it can do retries, logging, timeouts) – it lets us revise the approach in the future a bit more flexibly.

          Sam Van Oort added a comment - An optional approach for dealing with errors if we don't want to mandate a lot of Checked Exceptions may be to optionally accept a Handler object that may implement its own strategy of dealing with the various kinds of failures (i.e. it can do retries, logging, timeouts) – it lets us revise the approach in the future a bit more flexibly.

          Carlos Sanchez added a comment - - edited

          I'll add that we need to make sure calls to AWS APIs are retried with backoff given my previous experience

          If an API request exceeds the API request rate for its category, the request returns the RequestLimitExceeded error code. To prevent this error, ensure that your application doesn't retry API requests at a high rate. You can do this by using care when polling and by using exponential backoff retries.

          https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html

          Carlos Sanchez added a comment - - edited I'll add that we need to make sure calls to AWS APIs are retried with backoff given my previous experience If an API request exceeds the API request rate for its category, the request returns the RequestLimitExceeded error code. To prevent this error, ensure that your application doesn't retry API requests at a high rate. You can do this by using care when polling and by using exponential backoff retries. https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html
          Jesse Glick made changes -
          Assignee Original: Jesse Glick [ jglick ]

          Vivek Pandey added a comment -

          +1 for retries with exponential back off with circuit breaker pattern.

          Vivek Pandey added a comment - +1 for retries with exponential back off with circuit breaker pattern.

          Jesse Glick added a comment -

          As mentioned in the issue description, I see no need for baking retries into the implementation for the steps invoked by the Pipeline script. You can already use retry (or waitUntil, which does exponential backoff) in any Pipeline for which the success of individual builds is so important that you prefer to keep going until AWS responds. Best to leave retry semantics (and timeout and checkpoints) in the hands of users, who are better placed to judge whether it is appropriate in a given context, and keep the build step itself simple and transparent.

          AWS calls made by the Jenkins master outside the build context are another matter. The user has no control over these, so they need to behave somehow sanely, including timeouts and perhaps also retries, depending on the criticality of the function.

          Jesse Glick added a comment - As mentioned in the issue description, I see no need for baking retries into the implementation for the steps invoked by the Pipeline script. You can already use retry (or waitUntil , which does exponential backoff) in any Pipeline for which the success of individual builds is so important that you prefer to keep going until AWS responds. Best to leave retry semantics (and timeout and checkpoints) in the hands of users, who are better placed to judge whether it is appropriate in a given context, and keep the build step itself simple and transparent. AWS calls made by the Jenkins master outside the build context are another matter. The user has no control over these, so they need to behave somehow sanely, including timeouts and perhaps also retries, depending on the criticality of the function.

          Sam Van Oort added a comment - - edited

          So... from discussion with jglick, my main concern is that:

          1. Any failures at the network layer result in a subtype of IOException rather than indefinite hangs, and that retry be supported implicitly if desired. Edit: also any failures need to be isolated to the request itself – they should not be able to impact other parts of the system, other builds, etc.
          2. Option to inject a custom handler for network-level failures that supports a Strategy for retry / timeout / fallBack

          Sam Van Oort added a comment - - edited So... from discussion with jglick , my main concern is that: 1. Any failures at the network layer result in a subtype of IOException rather than indefinite hangs, and that retry be supported implicitly if desired. Edit: also any failures need to be isolated to the request itself – they should not be able to impact other parts of the system, other builds, etc. 2. Option to inject a custom handler for network-level failures that supports a Strategy for retry / timeout / fallBack
          Jesse Glick made changes -
          Remote Link New: This issue links to "PR 28 (Web Link)" [ 20729 ]
          Jesse Glick made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]

            jglick Jesse Glick
            jglick Jesse Glick
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: