Skip to content

feat(providers): add aws_sts_assume_role refresh strategy (v2 only) #1576

@russellb

Description

@russellb

Problem Statement

An agent running in an OpenShell sandbox needs to access AWS services (e.g., push content to S3) using short-lived STS credentials. The gateway credential refresh system (#1306, PR #1349) already supports OAuth2 and Google service-account strategies, but has no AWS STS support. AWS IMDS is hardcoded-blocked in sandboxes, so ambient host credentials are unavailable. Users must either inject static long-lived IAM keys (a security liability) or run external refresh daemons (moving lifecycle management outside OpenShell — exactly what #1306 set out to fix).

This feature adds aws_sts_assume_role as a first-class gateway-owned refresh strategy, scoped exclusively to provider v2 (providers_v2_enabled=true).

Technical Context

The credential refresh foundation (PR #1349) built a generalizable system: refresh state stored as scoped objects, a background refresh worker, strategy-specific mint functions, and credential propagation to running sandboxes via provider environment revision polling. The design explicitly left room for AWS STS as a future strategy. However, STS is structurally different from OAuth2 — it produces three coupled values (AccessKeyId, SecretAccessKey, SessionToken) from one API call, whereas existing strategies produce a single access_token. This requires extending MintedCredential to support multi-key output.

The v2 scoping is intentional: service-specific profiles (e.g., aws-s3) rely on v2's JIT profile-based policy layer composition to automatically contribute endpoint network rules. Without v2, the profile's endpoints are ignored.

Affected Components

Component Key Files Role
Proto model proto/openshell.proto:892-899 ProviderCredentialRefreshStrategy enum — add AWS_STS_ASSUME_ROLE = 6
Refresh engine crates/openshell-server/src/provider_refresh.rs MintedCredential struct (line 224), apply_minted_credential (line 400), mint_credential dispatch (line 437), is_gateway_mintable_strategy (line 286)
Gateway RPC crates/openshell-server/src/grpc/provider.rs:1138-1334 handle_configure_provider_refresh — v2 gate + multi-key collision validation
V2 gate crates/openshell-server/src/grpc/policy.rs:634-642 bool_setting_enabled — needs pub(super) for reuse from provider.rs
Profile serde crates/openshell-providers/src/profiles.rs:492-519 Strategy name mapping (from_yaml/to_yaml)
Profile registry crates/openshell-providers/src/profiles.rs:19-23 BUILT_IN_PROFILE_YAMLS — add new profiles
Server deps crates/openshell-server/Cargo.toml Add aws-sdk-sts, aws-config
Provider profiles providers/aws.yaml, providers/aws-s3.yaml New profile files
Docs docs/sandboxes/manage-providers.mdx AWS STS setup documentation

Technical Investigation

Architecture Overview

The credential refresh system has these layers:

  1. Provider profiles (providers/*.yaml) declare credential schemas, including refresh metadata (strategy, material requirements, timing).
  2. ConfigureProviderRefresh RPC (provider.rs:1138) stores refresh material as a StoredProviderCredentialRefreshState scoped object, separate from injectable Provider.credentials.
  3. Refresh worker (provider_refresh.rs) runs periodically, scans refresh states, and calls strategy-specific mint_* functions for due credentials.
  4. mint_credential dispatches to strategy implementations (OAuth2 refresh token, OAuth2 client credentials, Google SA JWT).
  5. apply_minted_credential writes the minted token into Provider.credentials via CAS update, triggering a provider_env_revision change.
  6. Sandbox propagation — the sandbox supervisor polls for revision changes and updates the ProviderCredentialState snapshot, so proxy placeholder resolution sees refreshed credentials.

Under provider v2 (providers_v2_enabled=true), profile endpoints also contribute JIT network policy layers via profile_provider_policy_layers() (policy.rs:586-632).

Code References

Location Description
proto/openshell.proto:892-899 ProviderCredentialRefreshStrategy enum — last value is GOOGLE_SERVICE_ACCOUNT_JWT = 5
provider_refresh.rs:224-229 MintedCredential struct — currently access_token, expires_at_ms, refresh_token
provider_refresh.rs:400-435 apply_minted_credential — CAS update writes single credential_key into provider
provider_refresh.rs:437-458 mint_credential — match dispatch to strategy-specific functions
provider_refresh.rs:286-293 is_gateway_mintable_strategy — allowlist of strategies the gateway can mint
provider_refresh.rs:273-284 refresh_strategy_name — display name mapping
provider.rs:1138-1334 handle_configure_provider_refresh — full handler including validation
provider.rs:1156-1166 Strategy validation via is_gateway_mintable_strategy()
provider.rs:1229 validate_provider_credential_key_available_for_attached_sandboxes — only validates primary key
provider.rs:505-530 validate_provider_update_against_attached_sandboxes — iterates sandbox attachments
policy.rs:634-642 bool_setting_enabled — currently fn (private), needs pub(super)
policy.rs:487-488 Existing v2 gate usage pattern
profiles.rs:492-506 provider_refresh_strategy_from_yaml
profiles.rs:509-519 provider_refresh_strategy_to_yaml
profiles.rs:19-23 BUILT_IN_PROFILE_YAMLS array
settings.rs:51 PROVIDERS_V2_ENABLED_KEY = "providers_v2_enabled"

Current Behavior

  • MintedCredential models a single access_token output per refresh operation.
  • apply_minted_credential writes one credential key per CAS update.
  • The handle_configure_provider_refresh handler validates only the primary credential_key against attached sandbox collisions.
  • No AWS SDK dependencies exist in the workspace.
  • Three provider profiles exist: claude-code, github, nvidia.

What Would Need to Change

Multi-key MintedCredential:

  • Add additional_credentials: HashMap<String, String> to MintedCredential (line 224).
  • In apply_minted_credential (line 400): build the candidate updated provider with all keys (primary + additional) before the pre-CAS validation call; update the CAS closure to write all keys with the same expires_at_ms.

New mint function:

  • Add mint_aws_sts_assume_role that uses aws-sdk-sts to call AssumeRole.
  • Resolve gateway AWS credentials: stored IAM keys from refresh material (if present), or AWS default credential chain (ambient).
  • Return MintedCredential with access_token = AccessKeyId, additional_credentials = {AWS_SECRET_ACCESS_KEY: SecretAccessKey, AWS_SESSION_TOKEN: SessionToken}.

V2 gate:

  • In handle_configure_provider_refresh (after line 1166): reject aws_sts_assume_role when providers_v2_enabled=false.
  • Make bool_setting_enabled and load_global_settings in policy.rs visible to provider.rs (pub(super) suffices since both are grpc submodules).

Multi-key collision validation:

  • In handle_configure_provider_refresh (line 1229): validate all three credential keys (primary + additional) at configure-time, not just the primary. Otherwise, a collision with AWS_SECRET_ACCESS_KEY from another provider would only be caught on first rotate.

Profile and strategy registration:

  • Add aws_sts_assume_role to from_yaml/to_yaml, refresh_strategy_name, is_gateway_mintable_strategy, and mint_credential dispatch.
  • Create providers/aws.yaml and providers/aws-s3.yaml.
  • Add both to BUILT_IN_PROFILE_YAMLS.

Profile category note: The design doc uses category: cloud and category: storage which don't exist in the proto enum. Use category: other for both, or add new category enum values (adds proto API surface).

Alternative Approaches Considered

  1. Three separate refresh configs — one per credential key, sharing one STS call. Rejected: fragile, no "refresh group" concept, mismatched triple risk.
  2. Single composite credential value — JSON blob as one credential, split sandbox-side. Rejected: format coupling leaks to sandbox, standard AWS SDKs expect three separate env vars.
  3. External refresh only — sidecar/cron calls STS and pushes via provider update. Works today with zero code, but moves lifecycle outside OpenShell.

The chosen approach (multi-key MintedCredential, A1 from the design doc) keeps the model of "one refresh config per logical credential source" which is conceptually correct — STS AssumeRole is one credential source that produces three values.

Patterns to Follow

  • Strategy registration: Follow the GoogleServiceAccountJwt pattern across all four registration points (mint_credential dispatch, is_gateway_mintable_strategy, refresh_strategy_name, serde mapping).
  • V2 gate: Follow the pattern at policy.rs:487-488 — load global settings, call bool_setting_enabled.
  • Profile YAML: Follow github.yaml structure — id, display_name, description, category, credentials, endpoints.
  • Tests: Follow provider_refresh.rs:772+test_store() + wiremock for HTTP mocks. For AWS SDK testing, configure the SDK client with a custom endpoint pointing at wiremock.

Proposed Approach

Add aws_sts_assume_role as a new refresh strategy enum value. Extend MintedCredential with an additional_credentials map for multi-key output, and update apply_minted_credential to write all keys atomically in the same CAS. Add a v2 gate in ConfigureProviderRefresh that rejects STS configuration when providers_v2_enabled=false. Ship two built-in profiles: a generic aws profile (credentials only) and an aws-s3 convenience profile with S3-specific endpoint rules. Use aws-sdk-sts for the STS call, supporting both ambient gateway credentials and stored IAM keys.

Scope Assessment

  • Complexity: Medium
  • Confidence: High — clear path forward, well-understood extension points, established patterns to follow
  • Estimated files to change: 8-10 (proto, provider_refresh.rs, provider.rs, policy.rs, profiles.rs, Cargo.toml, 2 new YAML profiles, docs)
  • Issue type: feat

Risks & Open Questions

  • Multi-key collision validation at configure-time. The current handler validates only the primary credential_key at configure. For STS, the additional keys (AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN) must also be validated to prevent silent collisions with other providers attached to the same sandbox. This is a gap in the current code that the design doc didn't explicitly call out — the configure handler needs to know about additional keys before the first mint.
  • Profile category enum. Design doc uses cloud/storage categories that don't exist in the proto enum. Decision needed: add new enum values (proto API surface change) or use other.
  • AWS SDK dependency size. aws-sdk-sts is modular and well-maintained (AWS-published), but adds compile-time cost. Acceptable for openshell-server only.
  • Role ARN validation. The role_arn material field should be validated against ^arn:aws:iam::\d{12}:role/.+ as input validation defense (CWE-20). The STS endpoint is hardcoded by the SDK, but the role ARN is user-provided.
  • Test isolation. The AWS SDK can be configured with a custom endpoint for tests (pointing at wiremock on localhost), which is cleaner than leaking test hooks into production material.

Test Considerations

  • V2 gate tests: ConfigureProviderRefresh with aws_sts_assume_role must fail when providers_v2_enabled=false and succeed when true.
  • Unit tests: STS material validation in profile serde; multi-key MintedCredential application; strategy dispatch for aws_sts_assume_role; additional_credentials CAS write atomicity; role ARN validation.
  • Integration tests: ConfigureProviderRefresh with STS strategy stores correct material; RotateProviderCredential against a mock STS endpoint writes all three credential keys; refresh worker picks up STS refresh state; expired STS credentials fail closed; provider env revision changes on refresh; profile-based policy layers for aws-s3 endpoints are composed into effective sandbox policy; multi-key collision validation at configure-time.
  • CLI integration tests: provider refresh configure with STS material; provider refresh rotate for STS; provider refresh status display for STS credentials.
  • E2E tests (stretch): Running sandbox with STS-refreshed provider, fake STS endpoint, verify all three env vars appear and rotate on refresh.
  • Existing test patterns: Follow provider_refresh.rs:772+ (test_store + wiremock) and provider.rs:1443+ (test_server_state + handler functions). For AWS SDK mocking, configure SDK with custom endpoint pointing at wiremock.

Created by spike investigation. Design doc at architecture/plans/2026-05-26-aws-sts-refresh-strategy-design.md. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions