Skip to content

Wrap CRT acquire timeout and other transient HTTP errors in IOException#7037

Open
zoewangg wants to merge 1 commit into
masterfrom
zoewang/crt-acquire-timeout-retry
Open

Wrap CRT acquire timeout and other transient HTTP errors in IOException#7037
zoewangg wants to merge 1 commit into
masterfrom
zoewang/crt-acquire-timeout-retry

Conversation

@zoewangg

Copy link
Copy Markdown
Contributor

Motivation and Context

Customers using the AWS CRT HTTP client see increased error of
software.amazon.awssdk.crt.http.HttpException: Connection Manager failed to acquire a connection within the defined timeout. and the cause is that this error is not retried by
the SDK retry layer.

Root cause: CrtUtils.wrapWithIoExceptionIfRetryable only wrapped errors that
the native CRT layer classified as transient via CRT.awsIsTransientError,
plus a single special case for HEALTH_CHECK_FAILURE. The acquire-timeout
error code (AWS_ERROR_HTTP_CONNECTION_MANAGER_ACQUISITION_TIMEOUT, 2093)
falls outside that allowlist, so it surfaced as raw HttpException and the
retry layer treated it as non-retryable. Several other recoverable HTTP
errors (GOAWAY_RECEIVED, RESPONSE_FIRST_BYTE_TIMEOUT,
MAX_CONCURRENT_STREAMS_EXCEEDED, etc.) had the same problem.

Internal ref:

Modifications

  • Add additional retryable error codes to check against in CrtUtils

Testing

  • Strengthened suite assertion exercised in apache-client, apache5-client,
    url-connection-client, aws-crt-client (sync + async): all pass.

  • netty-nio-client NettyAsyncHttpClientLongRunningRequestTest: 3 tests
    including the inherited (and overridden) acquire-timeout test.

  • SdkHttpClientLongRunningRequestTestSuite and
    SdkAsyncHttpClientLongRunningRequestTestSuite -
    executeWhenConnectionAcquireTimeoutAndPoolExhaustedFailsWithinTimeoutBound
    previously asserted only the timing bound. Strengthened to also assert the
    failure cause chain contains IOException, via a new
    LongRunningRequestTestSupport.assertFailsWithIoExceptionWithinTimeBound
    helper. Apache, Apache5, URLConnection, CRT sync, and CRT async all pass
    the strengthened assertion.

  • NettyAsyncHttpClientLongRunningRequestTest - Netty does NOT pass the
    strengthened assertion; NettyUtils.decorateException wraps acquire timeout
    in a plain Throwable, the same class of bug we're fixing for CRT here.
    The Netty fix is intentionally out of scope for this change. Override the
    test in the Netty subclass to keep the timing-bound contract while skipping
    the IOException assertion, with a // TODO explaining why.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read the CONTRIBUTING document
  • Local run of mvn install succeeds
  • My code follows the code style of this project
  • My change requires a change to the Javadoc documentation
  • I have updated the Javadoc documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed
  • I have added a changelog entry.
  • My change is to implement 1.11 parity feature and I have updated LaunchChangelog

License

  • I confirm that this pull request can be released under the Apache 2 license

Connection-pool acquire timeout and several transient HTTP error codes were
surfacing as raw HttpException, so the SDK retry layer treated them as
non-retryable. Wrap them in IOException to restore retry behavior, and
strengthen the shared LongRunningRequestTestSuite to enforce the contract.
@zoewangg zoewangg requested a review from a team as a code owner June 15, 2026 23:15
// HTTP error codes that the native CRT classifier (CRT.awsIsTransientError) does NOT mark as transient
// but that the SDK considers recoverable. See enum aws_http_errors in aws-c-http/include/aws/http/http.h
// for symbolic names.
private static final Set<Integer> ADDITIONAL_RETRYABLE_ERROR_CODES;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will remove the list once those error codes get added to CRT

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we derive these seven errors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants