Skip to content

feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router#1596

Merged
TaylorMutch merged 7 commits into
NVIDIA:mainfrom
mrunalp:feat/per-handler-rpc-auth
May 27, 2026
Merged

feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router#1596
TaylorMutch merged 7 commits into
NVIDIA:mainfrom
mrunalp:feat/per-handler-rpc-auth

Conversation

@mrunalp
Copy link
Copy Markdown
Collaborator

@mrunalp mrunalp commented May 27, 2026

Summary

Move gRPC auth metadata (auth mode, Bearer scope, required role) from four
hand-maintained constants into per-handler #[rpc_auth(...)] annotations
generated by a new proc macro, and close the asymmetric router enforcement
where a Principal::User could reach handlers intended for sandbox
supervisors. A descriptor-set-driven exhaustiveness test pins the surface
so a new RPC can't silently fall back to openshell:all.

Related Issue

Fixes #1586

Changes

  • New proc-macro crate openshell-server-macros exposing #[rpc_authz]
    (impl-level) and #[rpc_auth] (per-method) attributes. First proc macro
    in the workspace; intentionally small and focused on auth metadata.
  • Per-handler annotations on every RPC in OpenShellService and
    InferenceService, declaring auth = "unauthenticated" | "sandbox" | "bearer" | "dual" and (when bearer / dual) the required scope and
    role. The macro derives canonical gRPC paths from the proto service
    name + PascalCased method name, so paths cannot drift from the proto.
  • Aggregator + lookup in auth/method_authz.rs exposing
    lookup, required_scope, required_role, is_unauthenticated,
    is_sandbox_callable, and the new is_user_callable. Single source of
    truth for the four old constants.
  • Router-side AuthMode check in multiplex.rs: Principal::User is
    now rejected with PermissionDenied: this method requires a sandbox principal for methods declared sandbox (and for unauthenticated
    methods which would already short-circuit). Mirrors the existing
    is_sandbox_callable check on Principal::Sandbox. Closes the gap
    where GetSandboxProviderEnvironment, ReportPolicyStatus,
    SubmitPolicyAnalysis, PushSandboxLogs, ConnectSupervisor, and
    RelayStream were reachable by a user token because their handlers
    use ensure_sandbox_scope (which intentionally lets users through) or
    no guard at all.
  • Deleted SCOPED_METHODS and ADMIN_METHODS from auth/authz.rs,
    UNAUTHENTICATED_METHODS from auth/oidc.rs, and
    ALLOWED_SANDBOX_METHODS from auth/sandbox_methods.rs. Their public
    predicates (is_unauthenticated_method, is_sandbox_callable) now
    delegate to the aggregator; the existing unit tests keep passing.
    UNAUTHENTICATED_PREFIXES stays — prefix matching for
    /grpc.reflection.* and /grpc.health.* is structural, not per-method.
  • Compile-time enforcement. #[rpc_authz] fails compilation on:
    missing #[rpc_auth], scope/role on unauthenticated/sandbox
    methods, missing scope/role on bearer/dual methods, duplicate
    gRPC paths within a service, duplicate keys inside one #[rpc_auth],
    and invalid auth mode/role strings.
  • Descriptor-set exhaustiveness test. openshell-core/build.rs now
    calls tonic_build::configure().file_descriptor_set_path(...) and
    exposes openshell_core::FILE_DESCRIPTOR_SET. A new test in
    openshell-server enumerates every (service, method) from the
    descriptor and asserts it is covered exactly once by a MethodAuth
    entry. Catches new RPCs without annotations, stale annotations after
    renames, and duplicates across services.
  • Router regression test asserting an openshell-admin + openshell:all
    bearer user is denied on each sandbox-annotated method.

Auth model

Auth mode Principal::Sandbox accepted? Bearer accepted? Scope applies? Role applies?
unauthenticated n/a n/a no no
sandbox yes no no no
bearer no yes yes yes
dual yes yes Bearer path only Bearer path only

sandbox here refers to the per-sandbox gateway-minted JWT introduced in
#1404 — the old shared sandbox secret no longer exists. A handler
annotated sandbox authenticates as a specific Principal::Sandbox.

Backwards compatibility

Visible behavior changes for deployed gateways:

  • Six methods (GetSandboxProviderEnvironment, ReportPolicyStatus,
    SubmitPolicyAnalysis, PushSandboxLogs, ConnectSupervisor,
    RelayStream) start rejecting Bearer users at the router. Nothing in
    the CLI or any user-facing flow calls them; only sandbox supervisors
    do, via the per-sandbox JWT path.
  • A handful of provider-profile methods and ExecSandboxInteractive that
    previously fell back to openshell:all now have explicit scope/role
    declarations. openshell:all tokens still work; provider:read-only
    tokens gain access to ListProviderProfiles / GetProviderProfile.

Testing

  • mise run pre-commit passes (rust:lint, rust:format, rust:check,
    helm:lint, helm:docs:check — all green; the only mise run ci failures
    are markdown lint errors in pre-existing files outside this branch).
  • Unit tests added/updated:
    • auth::method_authz::tests — three exhaustiveness tests
      (every_proto_rpc_has_an_annotation,
      every_annotated_path_matches_a_real_rpc,
      no_duplicate_paths_across_services) plus user_callable_matches_auth_mode.
    • multiplex::tests::auth_router::user_principal_is_denied_on_sandbox_only_methods
      — proves the router rejects admin + openshell:all on all nine
      sandbox-only methods.
    • Existing tests in authz.rs, oidc.rs, sandbox_methods.rs,
      multiplex.rs continue to pass against the new aggregator.
  • E2E tests added/updated (if applicable): N/A — no new e2e harness
    added in this PR; existing e2e:kubernetes smoke run continues to
    exercise the auth path through the gateway.
  • Full workspace test suite: 2509 tests pass, 0 failures.

Manual verification

Verified end-to-end against a local OIDC + Keycloak deployment.

Method Before this PR After this PR
GetSandboxProviderEnvironment NotFound: sandbox not found (handler reached, store queried) PermissionDenied: this method requires a sandbox principal
ReportPolicyStatus InvalidArgument: sandbox_id is required PermissionDenied
SubmitPolicyAnalysis InvalidArgument: name is required PermissionDenied
PushSandboxLogs {}stream accepted, push succeeded PermissionDenied
ConnectSupervisor InvalidArgument: expected SupervisorHello (bidi opened) PermissionDenied
RelayStream InvalidArgument: first RelayFrame must be init… (bidi opened) PermissionDenied
IssueSandboxToken / RefreshSandboxToken / GetInferenceBundle PermissionDenied (handler-level guard) PermissionDenied (router-level)

Positive paths (reader token can ListSandboxes, writer token reaches
CreateSandbox validation, admin + provider:write can CreateProvider
/ ListProviders / DeleteProvider, scope-only and role-only denials
return the existing AuthzPolicy messages) are unchanged.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@mrunalp mrunalp requested review from a team, derekwaynecarr and maxamillion as code owners May 27, 2026 16:42
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@TaylorMutch
Copy link
Copy Markdown
Collaborator

/ok to test 5de68c4

mrunalp added 4 commits May 27, 2026 10:21
Move scope, role, and auth-mode metadata to the handler definition site
via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained
SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and
ALLOWED_SANDBOX_METHODS tables are now generated from per-method
annotations on the tonic service impls, with canonical gRPC paths
derived from the service name and method name.

Adds a new openshell-server-macros proc-macro crate, an aggregator in
auth/method_authz.rs, and an exhaustiveness test that decodes the
protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and
verifies every RPC has an annotation.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox
gateway-minted JWTs. A handler marked `sandbox` now authenticates as a
specific `Principal::Sandbox`, not as a holder of a shared credential.

Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and
`AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches
the post-NVIDIA#1404 identity model.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Addresses review feedback on the per-handler auth-annotation work.

- Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous
  router only checked is_sandbox_callable() for Principal::Sandbox; user
  principals still flowed into AuthzPolicy::check() and bypassed the
  per-handler declaration. A user with `openshell:all` could therefore
  reach `sandbox`-only handlers like GetSandboxProviderEnvironment,
  ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even
  though their annotations said sandbox-only. Adds an
  is_user_callable() predicate and rejects User principals at the
  router for `sandbox` / `unauthenticated` methods.

- Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A
  second `auth`, `scope`, or `role` previously silently overwrote the
  first value; now it fails to compile.

- Regression tests: a unit test for is_user_callable() and a router
  test that proves a user with admin role + openshell:all cannot reach
  the nine sandbox-only handlers.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
…hz doc comments

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
@mrunalp mrunalp force-pushed the feat/per-handler-rpc-auth branch from 5de68c4 to 0f54133 Compare May 27, 2026 17:22
Comment thread crates/openshell-server/src/inference.rs Outdated
Comment thread crates/openshell-server-macros/src/lib.rs Outdated
mrunalp added 2 commits May 27, 2026 11:19
The stub was a safety net that fired only when a method had
`#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it
required `rpc_auth` to be imported, which is why both call sites carried
`#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`.

Drop the stub and the unused-import workaround. A missing `#[rpc_authz]`
now surfaces as rustc's standard "cannot find attribute `rpc_auth` in
this scope" — clear enough, and one fewer import + lint exception.

Addresses review comment on PR NVIDIA#1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
The previous trait-derived const name turned `OpenShell` into
`OPEN_SHELL_AUTH_METADATA`, splitting the project name across an
underscore. Each impl already lives in its own module
(`crate::grpc::`, `crate::inference::`), so the module path is enough
to disambiguate between services — a fixed `AUTH_METADATA` name reads
more naturally.

Aggregator in `auth/method_authz.rs` now references
`crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA`
directly.

Addresses review comment on PR NVIDIA#1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
@TaylorMutch
Copy link
Copy Markdown
Collaborator

/ok to test a8f611e

@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label May 27, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for a8f611e. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

Comment thread crates/openshell-server-macros/src/lib.rs Outdated
TaylorMutch
TaylorMutch previously approved these changes May 27, 2026
OpenShell is one word; reference name in the doc should be
OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA.

Addresses review nit on PR NVIDIA#1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
@TaylorMutch
Copy link
Copy Markdown
Collaborator

/ok to test cd2669f

@TaylorMutch TaylorMutch merged commit 3f520dd into NVIDIA:main May 27, 2026
37 checks passed
mrunalp added a commit to mrunalp/OpenShell that referenced this pull request May 28, 2026
…ust clients

Addresses three findings from the branch review (`feat-python-sdk-bearer-auth-review.md`):

Finding 1 (HIGH): HTTPS OIDC gateways without a full mTLS bundle were
falling back to `grpc.insecure_channel`. Made `TlsConfig.ca_path`,
`cert_path`, and `key_path` all optional with the cert/key pair
required-together-or-not-at-all, so callers can express:

- Full mTLS (all three): server trusts client identity.
- CA-only (`ca_path` only): custom CA trust, no client identity.
- System roots (`TlsConfig()`): OS trust store; the right default for
  OIDC gateways behind a public CA.

`from_active_cluster` now mirrors `crates/openshell-tui/src/lib.rs`
`build_oidc_channel`: for any `https://` gateway, always build a
secure channel and pick the strongest TLS profile available
(mTLS → CA-only → system roots).

Finding 2 (MEDIUM): `from_active_cluster` snapshotted the access token
once. Replaced with `_make_cluster_bearer_provider`, a per-RPC closure
that re-reads `oidc_token.json` each call. A long-lived
`SandboxClient` now picks up rotations performed by `openshell gateway
login` without being reconstructed. Provider fails closed with
`SandboxError` (and a "re-authenticate with: openshell gateway login"
hint) when the token file is missing, malformed, or expired.

Finding 3 (MEDIUM): `from_active_cluster` was attaching bearer
metadata whenever `oidc_token.json` existed, even for gateways
registered as `mtls` or `plaintext`. Now gates on
`metadata.json.auth_mode == "oidc"`, matching
`crates/openshell-cli/src/main.rs` and the TUI.

Test coverage expands 14 → 23: HTTPS-OIDC-without-mTLS, CA-only
layout, stale-token-with-wrong-auth-mode, per-RPC reload, expired
token rejection, missing-file rejection, partial-TlsConfig validation,
and the existing channel/interceptor matrix.

Verified end-to-end against the OpenShift Keycloak deployment:
positive admin / reader paths, scope enforcement, and PR NVIDIA#1596
sandbox-principal gate all still pass via the SDK.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp added a commit to mrunalp/OpenShell that referenced this pull request May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from
the SDK.

Adds:

- `bearer_token: str | Callable[[], str] | None` kwarg on
  `SandboxClient`. Static strings or zero-arg callables (the latter
  is invoked per RPC, so callers can drop in a refresh loop or
  token-file watcher without reconstructing the client). Composes
  with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
  `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
  Appends `authorization: Bearer <token>` to outgoing metadata.
  Implemented as an interceptor (not call credentials) so it works
  on both plaintext (`disableTls=true` dev) and TLS channels without
  `grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
  `key_path`) are now optional with `cert_path` / `key_path`
  required-together-or-not-at-all (enforced in `__post_init__`). This
  unlocks three transport profiles from one dataclass:
    * full mTLS (all three)
    * CA-only trust (`ca_path` only)
    * system roots (`TlsConfig()` — for OIDC gateways behind a
      public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
  `build_oidc_channel`:
    * For any `https://` gateway, always build a secure channel.
      Pick the strongest TLS profile available in `mtls/` (full
      mTLS → CA-only → system roots). No more `insecure_channel`
      fallback for HTTPS.
    * Gate OIDC bearer attachment on
      `metadata.json["auth_mode"] == "oidc"`. Matches
      `crates/openshell-cli/src/main.rs:132` and the TUI; a stale
      `oidc_token.json` next to a non-OIDC gateway no longer causes
      the SDK to attach a bearer.
    * Use `_make_cluster_bearer_provider` — a per-RPC closure that
      reads `oidc_token.json` on every invocation, returning the
      current `access_token` if fresh and raising `SandboxError`
      with a "re-authenticate with: openshell gateway login" hint
      if the token is missing, malformed, or expired (the 30 s
      grace window matches
      `openshell-bootstrap::oidc_token::is_token_expired`). A
      long-lived `SandboxClient` now picks up token rotations done
      by the CLI without being reconstructed. OAuth2 refresh itself
      stays in the CLI; the SDK only consumes what's on disk.

Tested:

- 23 SDK unit tests pass (5 existing + 18 new across the bearer
  interceptor, token provider, `TlsConfig` validation, and the
  `from_active_cluster` auth ladder). `mise run test:python` →
  31 passed total.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift
  (deploy recipe at `architecture/plans/deploy-openshift.md`):
    * unauthenticated `Health` bypass works
    * admin + `openshell:all` reaches user-callable methods
    * reader (`sandbox:read`) denied on `CreateSandbox` by scope
    * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only
      methods at the router (the new gate is honored from the SDK)
    * full provider CRUD lifecycle via the SDK
    * callable token provider rotates per RPC as expected
- Regression-probed against four pre-PR failure modes:
    * `https://` OIDC gateway without `mtls/` no longer falls back
      to `insecure_channel`
    * CA-only `mtls/ca.crt` layout no longer raises
      `FileNotFoundError`
    * plaintext gateway with stale `oidc_token.json` no longer gets
      a bearer attached
    * long-lived client picks up rotated tokens; expired tokens
      surface as `SandboxError`, not silent gateway 401s

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp added a commit to mrunalp/OpenShell that referenced this pull request May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from
the SDK.

Adds:

- `bearer_token: str | Callable[[], str] | None` kwarg on
  `SandboxClient`. Static strings or zero-arg callables (the latter
  is invoked per RPC, so callers can drop in a refresh loop or
  token-file watcher without reconstructing the client). Composes
  with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
  `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
  Appends `authorization: Bearer <token>` to outgoing metadata.
  Implemented as an interceptor (not call credentials) so it works
  on both plaintext (`disableTls=true` dev) and TLS channels without
  `grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
  `key_path`) are now optional with `cert_path` / `key_path`
  required-together-or-not-at-all (enforced in `__post_init__`). This
  unlocks three transport profiles from one dataclass:
    * full mTLS (all three)
    * CA-only trust (`ca_path` only)
    * system roots (`TlsConfig()` — for OIDC gateways behind a
      public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
  `build_oidc_channel`:
    * For any `https://` gateway, always build a secure channel.
      Pick the strongest TLS profile available in `mtls/` (full
      mTLS → CA-only → system roots). No more `insecure_channel`
      fallback for HTTPS.
    * Gate OIDC bearer attachment on
      `metadata.json["auth_mode"] == "oidc"`. Matches
      `crates/openshell-cli/src/main.rs:132` and the TUI; a stale
      `oidc_token.json` next to a non-OIDC gateway no longer causes
      the SDK to attach a bearer.
- `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh
  modeled on `google.oauth2.credentials.Credentials` and
  `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every
  RPC; when stale, re-reads disk first (the CLI may have rotated
  the bundle), and only then exchanges the refresh_token against
  the IdP's token endpoint discovered via OIDC discovery
  (`/.well-known/openid-configuration`, cached after first call).
  Concurrent RPCs share a single refresh via `threading.Lock` (no
  IdP stampede). Honors refresh-token rotation. Surfaces IdP
  failures as `SandboxError` with the IdP's error body included for
  diagnostics.
- `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=False)`
  factory. Default is the refresher path; `auto_refresh=False` falls
  back to the read-only fail-closed behavior for callers that don't
  want the SDK to make outbound HTTP calls to the IdP.
  `write_back=True` (opt-in) atomically persists the rotated bundle
  with 0600 mode so other processes — including the Rust CLI — see
  the rotation. Off by default; treats the Rust CLI as the canonical
  writer.
- `from_active_cluster` exposes `auto_refresh` / `write_back`
  kwargs (defaults: True / False).

OAuth2 refresh refresh policy and write-back semantics deliberately
mirror what the major Python SDKs do — see
github.com/googleapis/google-auth-library-python (`Credentials`)
and github.com/boto/botocore (`SSOTokenProvider`):

| Library                       | Native refresh | Writes back |
|-------------------------------|----------------|-------------|
| google-auth Credentials       | yes            | no          |
| botocore SSOTokenProvider     | yes            | yes         |
| openshell SandboxClient (here)| yes (opt-out)  | opt-in      |

Refresh in the SDK is the production answer because:

- Long-running Python orchestrators (agent runs, data pipelines)
  outlast a Keycloak 1-hour access token. Without in-SDK refresh,
  they crash at expiry.
- Headless containers (sandbox-controller pods, GitHub Actions
  runners) may not have the Rust CLI installed but always have
  Python and a refresh_token.
- Subprocess-to-CLI per RPC would spawn `openshell` on every gRPC
  call, including hot streaming paths. Unacceptable.

The Rust CLI keeps owning interactive flows (browser/device-code,
keyring storage, the initial login). The SDK owns refresh during
script execution.

Tested:

- 32 SDK unit tests pass (5 existing + 27 new across the bearer
  interceptor, fail-closed provider, refresher behavior, `TlsConfig`
  validation, `from_active_cluster` auth ladder, and the refresher's
  concurrency / rotation / write-back / error paths).
  `mise run test:python` → 40 passed total.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift:
    * unauthenticated `Health` bypass works
    * admin + `openshell:all` reaches user-callable methods
    * reader (`sandbox:read`) denied on `CreateSandbox` by scope
    * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only
      methods at the router (the new gate is honored from the SDK)
    * full provider CRUD lifecycle via the SDK
    * callable token provider rotates per RPC as expected
- Regression-probed against the four pre-review failure modes:
    * `https://` OIDC gateway without `mtls/` no longer falls back
      to `insecure_channel`
    * CA-only `mtls/ca.crt` layout no longer raises
      `FileNotFoundError`
    * plaintext gateway with stale `oidc_token.json` no longer gets
      a bearer attached
    * long-lived client picks up rotated tokens; expired tokens
      surface as `SandboxError`, not silent gateway 401s
- Refresher unit tests cover: cached-fresh fast path, disk-rotated
  re-read before refresh, OAuth2 exchange against the discovered
  token endpoint, refresh-token rotation, atomic write-back at
  0600 mode, concurrent N-thread coordination (one refresh shared
  across 8 threads), IdP failure surfaced with error body, and the
  client_credentials / no-refresh_token error path.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp added a commit to mrunalp/OpenShell that referenced this pull request May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from
the SDK.

Adds:

- `bearer_token: str | Callable[[], str] | None` kwarg on
  `SandboxClient`. Static strings or zero-arg callables (the latter
  is invoked per RPC, so callers can drop in a refresh loop or
  token-file watcher without reconstructing the client). Composes
  with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
  `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
  Appends `authorization: Bearer <token>` to outgoing metadata.
  Implemented as an interceptor (not call credentials) so it works
  on both plaintext (`disableTls=true` dev) and TLS channels without
  `grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
  `key_path`) are now optional with `cert_path` / `key_path`
  required-together-or-not-at-all (enforced in `__post_init__`). This
  unlocks three transport profiles from one dataclass:
    * full mTLS (all three)
    * CA-only trust (`ca_path` only)
    * system roots (`TlsConfig()` — for OIDC gateways behind a
      public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
  `build_oidc_channel`:
    * For any `https://` gateway, always build a secure channel.
      Pick the strongest TLS profile available in `mtls/` (full
      mTLS → CA-only → system roots). No more `insecure_channel`
      fallback for HTTPS.
    * Gate OIDC bearer attachment on
      `metadata.json["auth_mode"] == "oidc"`. Matches
      `crates/openshell-cli/src/main.rs:132` and the TUI; a stale
      `oidc_token.json` next to a non-OIDC gateway no longer causes
      the SDK to attach a bearer.
- `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh
  modeled on `google.oauth2.credentials.Credentials` and
  `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every
  RPC; when stale, re-reads disk first (the CLI may have rotated
  the bundle), and only then exchanges the refresh_token against
  the IdP's token endpoint discovered via OIDC discovery
  (`/.well-known/openid-configuration`, cached after first call).
  Concurrent RPCs share a single refresh via `threading.Lock` (no
  IdP stampede). Honors refresh-token rotation. Surfaces IdP
  failures as `SandboxError` with the RFC 6749 error body included
  for diagnostics.

  Mirrors the Rust CLI's HTTP-policy posture from
  `crates/openshell-cli/src/oidc_auth.rs`:
    * `follow_redirects=False` so a 3xx during discovery can't
      steer us to an attacker-controlled token endpoint.
    * Discovery `issuer` is validated against the configured
      issuer; a discovery document claiming a different issuer is
      rejected, preventing the SDK from POSTing the refresh_token
      to a malicious endpoint.
    * `insecure: bool` flag plumbed through to httpx's
      `verify=` so self-signed-cert deployments work the same way
      they do in the Rust CLI.

  Built on `httpx` (chosen over `urllib` specifically for
  follow_redirects + verify control as kwargs). The OAuth2
  refresh-token grant itself (RFC 6749 §6) is one form-encoded
  POST — handled inline rather than via a dedicated OAuth library;
  tried `authlib`'s `OAuth2Client` first but it auto-injects an
  Authorization header on every request, which breaks the
  unauthenticated discovery GET.
- `_make_cluster_bearer_provider(..., auto_refresh=True,
  write_back=True, insecure=False)` factory. Defaults to the
  refresher path with write-back enabled; `auto_refresh=False`
  falls back to the read-only fail-closed behavior for callers that
  don't want the SDK to make outbound HTTP calls to the IdP.

  `write_back=True` is the default (changed from the first round of
  review): IdPs with refresh-token rotation (Keycloak with
  rotation, Entra in strict mode) invalidate the old refresh_token
  on each refresh, so an in-memory-only refresh would leave the
  on-disk bundle pointing at an invalidated value — any second
  process starting from disk would `invalid_grant`. With write-back
  enabled by default, the SDK keeps the shared cache consistent
  with the IdP.
- `from_active_cluster` exposes `auto_refresh`, `write_back`, and
  `insecure` kwargs (defaults: True / True / False).

OAuth2 refresh policy and write-back semantics deliberately mirror
what the major Python SDKs do — see
github.com/googleapis/google-auth-library-python (`Credentials`)
and github.com/boto/botocore (`SSOTokenProvider`):

| Library                       | Native refresh | Writes back |
|-------------------------------|----------------|-------------|
| google-auth Credentials       | yes            | no          |
| botocore SSOTokenProvider     | yes            | yes         |
| openshell SandboxClient (here)| yes (opt-out)  | yes (opt-out)|

OpenShell sits between the two; chose write-back-by-default because
the rotation invariant matters more for our deployments than the
"CLI is the only writer" assumption that fits google-auth.

Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library —
the refresh grant is a single POST.

Tested:

- 36 SDK unit tests pass (5 existing + 31 across the bearer
  interceptor, fail-closed provider, refresher behavior, TlsConfig
  validation, from_active_cluster auth ladder, and three
  regression tests for the security review findings).
  `mise run test:python` → 44 passed total.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift:
    * unauthenticated `Health` bypass works
    * admin + `openshell:all` reaches user-callable methods
    * reader (`sandbox:read`) denied on `CreateSandbox` by scope
    * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only
      methods at the router (the new gate is honored from the SDK)
    * full provider CRUD lifecycle via the SDK
    * callable token provider rotates per RPC as expected
- Regression-probed against three pre-review security findings:
    * **Discovery issuer validation** — a discovery document
      claiming a different `issuer` than the configured one is
      rejected with a clear `SandboxError` before any refresh POST
      can reach the attacker-controlled endpoint.
    * **Redirect during discovery** — `follow_redirects=False` on
      the underlying httpx client means a 3xx during discovery
      surfaces as a SandboxError rather than silently chasing the
      redirect.
    * **Cross-process rotation** — a two-process simulation shows
      process B starting from disk and successfully refreshing
      with the rotated refresh_token, because process A's
      write-back updated the shared cache.
- Refresher unit tests cover: cached-fresh fast path, disk-rotated
  re-read before refresh, OAuth2 exchange against the discovered
  token endpoint, refresh-token rotation, atomic write-back at
  0600 mode (default), default-on write_back proven by test,
  concurrent N-thread coordination (one refresh shared across 8
  threads), IdP failure surfaced with the error body, the
  client_credentials / no-refresh_token error path, issuer-
  mismatch rejection, redirect-during-discovery rejection,
  insecure flag plumbing.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp added a commit to mrunalp/OpenShell that referenced this pull request May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from
the SDK.

Adds:

- `bearer_token: str | Callable[[], str] | None` kwarg on
  `SandboxClient`. Static strings or zero-arg callables (the latter
  is invoked per RPC, so callers can drop in a refresh loop or
  token-file watcher without reconstructing the client). Composes
  with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
  `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
  Appends `authorization: Bearer <token>` to outgoing metadata.
  Implemented as an interceptor (not call credentials) so it works
  on both plaintext (`disableTls=true` dev) and TLS channels without
  `grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
  `key_path`) are now optional with `cert_path` / `key_path`
  required-together-or-not-at-all (enforced in `__post_init__`). This
  unlocks three transport profiles from one dataclass:
    * full mTLS (all three)
    * CA-only trust (`ca_path` only)
    * system roots (`TlsConfig()` — for OIDC gateways behind a
      public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
  `build_oidc_channel`:
    * For any `https://` gateway, always build a secure channel.
      Pick the strongest TLS profile available in `mtls/` (full
      mTLS → CA-only → system roots). No more `insecure_channel`
      fallback for HTTPS.
    * Gate OIDC bearer attachment on
      `metadata.json["auth_mode"] == "oidc"`. Matches
      `crates/openshell-cli/src/main.rs:132` and the TUI; a stale
      `oidc_token.json` next to a non-OIDC gateway no longer causes
      the SDK to attach a bearer.
- `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh
  modeled on `google.oauth2.credentials.Credentials` and
  `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every
  RPC; when stale, re-reads disk first (the CLI may have rotated
  the bundle), and only then exchanges the refresh_token against
  the IdP's token endpoint discovered via OIDC discovery
  (`/.well-known/openid-configuration`, cached after first call).
  Concurrent RPCs share a single refresh via `threading.Lock` (no
  IdP stampede). Honors refresh-token rotation. Surfaces IdP
  failures as `SandboxError` with the RFC 6749 error body included
  for diagnostics.

  Mirrors the Rust CLI's HTTP-policy posture from
  `crates/openshell-cli/src/oidc_auth.rs`:
    * `follow_redirects=False` so a 3xx during discovery can't
      steer us to an attacker-controlled token endpoint.
    * Discovery `issuer` is validated against the configured
      issuer; a discovery document claiming a different issuer is
      rejected, preventing the SDK from POSTing the refresh_token
      to a malicious endpoint.
    * `insecure: bool` flag plumbed through to httpx's
      `verify=` so self-signed-cert deployments work the same way
      they do in the Rust CLI.

  Built on `httpx` (chosen over `urllib` specifically for
  follow_redirects + verify control as kwargs). The OAuth2
  refresh-token grant itself (RFC 6749 §6) is one form-encoded
  POST — handled inline rather than via a dedicated OAuth library;
  tried `authlib`'s `OAuth2Client` first but it auto-injects an
  Authorization header on every request, which breaks the
  unauthenticated discovery GET.
- `_make_cluster_bearer_provider(..., auto_refresh=True,
  write_back=True, insecure=False)` factory. Defaults to the
  refresher path with write-back enabled; `auto_refresh=False`
  falls back to the read-only fail-closed behavior for callers that
  don't want the SDK to make outbound HTTP calls to the IdP.

  `write_back=True` is the default (changed from the first round of
  review): IdPs with refresh-token rotation (Keycloak with
  rotation, Entra in strict mode) invalidate the old refresh_token
  on each refresh, so an in-memory-only refresh would leave the
  on-disk bundle pointing at an invalidated value — any second
  process starting from disk would `invalid_grant`. With write-back
  enabled by default, the SDK keeps the shared cache consistent
  with the IdP.
- `from_active_cluster` exposes `auto_refresh`, `write_back`, and
  `insecure` kwargs (defaults: True / True / False).

OAuth2 refresh policy and write-back semantics deliberately mirror
what the major Python SDKs do — see
github.com/googleapis/google-auth-library-python (`Credentials`)
and github.com/boto/botocore (`SSOTokenProvider`):

| Library                       | Native refresh | Writes back |
|-------------------------------|----------------|-------------|
| google-auth Credentials       | yes            | no          |
| botocore SSOTokenProvider     | yes            | yes         |
| openshell SandboxClient (here)| yes (opt-out)  | yes (opt-out)|

OpenShell sits between the two; chose write-back-by-default because
the rotation invariant matters more for our deployments than the
"CLI is the only writer" assumption that fits google-auth.

Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library —
the refresh grant is a single POST.

Tested:

- 36 SDK unit tests pass (5 existing + 31 across the bearer
  interceptor, fail-closed provider, refresher behavior, TlsConfig
  validation, from_active_cluster auth ladder, and three
  regression tests for the security review findings).
  `mise run test:python` → 44 passed total.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift:
    * unauthenticated `Health` bypass works
    * admin + `openshell:all` reaches user-callable methods
    * reader (`sandbox:read`) denied on `CreateSandbox` by scope
    * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only
      methods at the router (the new gate is honored from the SDK)
    * full provider CRUD lifecycle via the SDK
    * callable token provider rotates per RPC as expected
- Regression-probed against three pre-review security findings:
    * **Discovery issuer validation** — a discovery document
      claiming a different `issuer` than the configured one is
      rejected with a clear `SandboxError` before any refresh POST
      can reach the attacker-controlled endpoint.
    * **Redirect during discovery** — `follow_redirects=False` on
      the underlying httpx client means a 3xx during discovery
      surfaces as a SandboxError rather than silently chasing the
      redirect.
    * **Cross-process rotation** — a two-process simulation shows
      process B starting from disk and successfully refreshing
      with the rotated refresh_token, because process A's
      write-back updated the shared cache.
- Refresher unit tests cover: cached-fresh fast path, disk-rotated
  re-read before refresh, OAuth2 exchange against the discovered
  token endpoint, refresh-token rotation, atomic write-back at
  0600 mode (default), default-on write_back proven by test,
  concurrent N-thread coordination (one refresh shared across 8
  threads), IdP failure surfaced with the error body, the
  client_credentials / no-refresh_token error path, issuer-
  mismatch rejection, redirect-during-discovery rejection,
  insecure flag plumbing.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
st-gr pushed a commit to st-gr/OpenShell that referenced this pull request May 28, 2026
… enforce at the router (NVIDIA#1596)

* feat(server): per-handler gRPC auth annotations

Move scope, role, and auth-mode metadata to the handler definition site
via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained
SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and
ALLOWED_SANDBOX_METHODS tables are now generated from per-method
annotations on the tonic service impls, with canonical gRPC paths
derived from the service name and method name.

Adds a new openshell-server-macros proc-macro crate, an aggregator in
auth/method_authz.rs, and an exhaustiveness test that decodes the
protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and
verifies every RPC has an annotation.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* refactor(server): rename `sandbox-secret` auth mode to `sandbox`

PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox
gateway-minted JWTs. A handler marked `sandbox` now authenticates as a
specific `Principal::Sandbox`, not as a holder of a shared credential.

Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and
`AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches
the post-NVIDIA#1404 identity model.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* fix(server): enforce per-handler AuthMode at the router

Addresses review feedback on the per-handler auth-annotation work.

- Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous
  router only checked is_sandbox_callable() for Principal::Sandbox; user
  principals still flowed into AuthzPolicy::check() and bypassed the
  per-handler declaration. A user with `openshell:all` could therefore
  reach `sandbox`-only handlers like GetSandboxProviderEnvironment,
  ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even
  though their annotations said sandbox-only. Adds an
  is_user_callable() predicate and rejects User principals at the
  router for `sandbox` / `unauthenticated` methods.

- Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A
  second `auth`, `scope`, or `role` previously silently overwrote the
  first value; now it fails to compile.

- Regression tests: a unit test for is_user_callable() and a router
  test that proves a user with admin role + openshell:all cannot reach
  the nine sandbox-only handlers.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* refactor(server-macros): drop standalone `rpc_auth` stub

The stub was a safety net that fired only when a method had
`#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it
required `rpc_auth` to be imported, which is why both call sites carried
`#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`.

Drop the stub and the unused-import workaround. A missing `#[rpc_authz]`
now surfaces as rustc's standard "cannot find attribute `rpc_auth` in
this scope" — clear enough, and one fewer import + lint exception.

Addresses review comment on PR NVIDIA#1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* refactor(server-macros): emit fixed `AUTH_METADATA` const per service

The previous trait-derived const name turned `OpenShell` into
`OPEN_SHELL_AUTH_METADATA`, splitting the project name across an
underscore. Each impl already lives in its own module
(`crate::grpc::`, `crate::inference::`), so the module path is enough
to disambiguate between services — a fixed `AUTH_METADATA` name reads
more naturally.

Aggregator in `auth/method_authz.rs` now references
`crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA`
directly.

Addresses review comment on PR NVIDIA#1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment

OpenShell is one word; reference name in the doc should be
OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA.

Addresses review nit on PR NVIDIA#1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

---------

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router

2 participants