feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router by mrunalp · Pull Request #1596 · NVIDIA/OpenShell

mrunalp · 2026-05-27T16:42:58Z

Summary

Move gRPC auth metadata (auth mode, Bearer scope, required role) from four
hand-maintained constants into per-handler #[rpc_auth(...)] annotations
generated by a new proc macro, and close the asymmetric router enforcement
where a Principal::User could reach handlers intended for sandbox
supervisors. A descriptor-set-driven exhaustiveness test pins the surface
so a new RPC can't silently fall back to openshell:all.

Related Issue

Fixes #1586

Changes

New proc-macro crate openshell-server-macros exposing #[rpc_authz]
(impl-level) and #[rpc_auth] (per-method) attributes. First proc macro
in the workspace; intentionally small and focused on auth metadata.
Per-handler annotations on every RPC in OpenShellService and
InferenceService, declaring auth = "unauthenticated" | "sandbox" | "bearer" | "dual" and (when bearer / dual) the required scope and
role. The macro derives canonical gRPC paths from the proto service
name + PascalCased method name, so paths cannot drift from the proto.
Aggregator + lookup in auth/method_authz.rs exposing
lookup, required_scope, required_role, is_unauthenticated,
is_sandbox_callable, and the new is_user_callable. Single source of
truth for the four old constants.
Router-side AuthMode check in multiplex.rs: Principal::User is
now rejected with PermissionDenied: this method requires a sandbox principal for methods declared sandbox (and for unauthenticated
methods which would already short-circuit). Mirrors the existing
is_sandbox_callable check on Principal::Sandbox. Closes the gap
where GetSandboxProviderEnvironment, ReportPolicyStatus,
SubmitPolicyAnalysis, PushSandboxLogs, ConnectSupervisor, and
RelayStream were reachable by a user token because their handlers
use ensure_sandbox_scope (which intentionally lets users through) or
no guard at all.
Deleted SCOPED_METHODS and ADMIN_METHODS from auth/authz.rs,
UNAUTHENTICATED_METHODS from auth/oidc.rs, and
ALLOWED_SANDBOX_METHODS from auth/sandbox_methods.rs. Their public
predicates (is_unauthenticated_method, is_sandbox_callable) now
delegate to the aggregator; the existing unit tests keep passing.
UNAUTHENTICATED_PREFIXES stays — prefix matching for
/grpc.reflection.* and /grpc.health.* is structural, not per-method.
Compile-time enforcement. #[rpc_authz] fails compilation on:
missing #[rpc_auth], scope/role on unauthenticated/sandbox
methods, missing scope/role on bearer/dual methods, duplicate
gRPC paths within a service, duplicate keys inside one #[rpc_auth],
and invalid auth mode/role strings.
Descriptor-set exhaustiveness test. openshell-core/build.rs now
calls tonic_build::configure().file_descriptor_set_path(...) and
exposes openshell_core::FILE_DESCRIPTOR_SET. A new test in
openshell-server enumerates every (service, method) from the
descriptor and asserts it is covered exactly once by a MethodAuth
entry. Catches new RPCs without annotations, stale annotations after
renames, and duplicates across services.
Router regression test asserting an openshell-admin + openshell:all
bearer user is denied on each sandbox-annotated method.

Auth model

Auth mode	`Principal::Sandbox` accepted?	Bearer accepted?	Scope applies?	Role applies?
`unauthenticated`	n/a	n/a	no	no
`sandbox`	yes	no	no	no
`bearer`	no	yes	yes	yes
`dual`	yes	yes	Bearer path only	Bearer path only

sandbox here refers to the per-sandbox gateway-minted JWT introduced in
#1404 — the old shared sandbox secret no longer exists. A handler
annotated sandbox authenticates as a specific Principal::Sandbox.

Backwards compatibility

Visible behavior changes for deployed gateways:

Six methods (GetSandboxProviderEnvironment, ReportPolicyStatus,
SubmitPolicyAnalysis, PushSandboxLogs, ConnectSupervisor,
RelayStream) start rejecting Bearer users at the router. Nothing in
the CLI or any user-facing flow calls them; only sandbox supervisors
do, via the per-sandbox JWT path.
A handful of provider-profile methods and ExecSandboxInteractive that
previously fell back to openshell:all now have explicit scope/role
declarations. openshell:all tokens still work; provider:read-only
tokens gain access to ListProviderProfiles / GetProviderProfile.

Testing

mise run pre-commit passes (rust:lint, rust:format, rust:check,
helm:lint, helm:docs:check — all green; the only mise run ci failures
are markdown lint errors in pre-existing files outside this branch).
Unit tests added/updated:
- auth::method_authz::tests — three exhaustiveness tests
  (every_proto_rpc_has_an_annotation,
  every_annotated_path_matches_a_real_rpc,
  no_duplicate_paths_across_services) plus user_callable_matches_auth_mode.
- multiplex::tests::auth_router::user_principal_is_denied_on_sandbox_only_methods
  — proves the router rejects admin + openshell:all on all nine
  sandbox-only methods.
- Existing tests in authz.rs, oidc.rs, sandbox_methods.rs,
  multiplex.rs continue to pass against the new aggregator.
E2E tests added/updated (if applicable): N/A — no new e2e harness
added in this PR; existing e2e:kubernetes smoke run continues to
exercise the auth path through the gateway.
Full workspace test suite: 2509 tests pass, 0 failures.

Manual verification

Verified end-to-end against a local OIDC + Keycloak deployment.

Method	Before this PR	After this PR
`GetSandboxProviderEnvironment`	`NotFound: sandbox not found` (handler reached, store queried)	`PermissionDenied: this method requires a sandbox principal`
`ReportPolicyStatus`	`InvalidArgument: sandbox_id is required`	`PermissionDenied`
`SubmitPolicyAnalysis`	`InvalidArgument: name is required`	`PermissionDenied`
`PushSandboxLogs`	`{}` — stream accepted, push succeeded	`PermissionDenied`
`ConnectSupervisor`	`InvalidArgument: expected SupervisorHello` (bidi opened)	`PermissionDenied`
`RelayStream`	`InvalidArgument: first RelayFrame must be init…` (bidi opened)	`PermissionDenied`
`IssueSandboxToken` / `RefreshSandboxToken` / `GetInferenceBundle`	`PermissionDenied` (handler-level guard)	`PermissionDenied` (router-level)

Positive paths (reader token can ListSandboxes, writer token reaches
CreateSandbox validation, admin + provider:write can CreateProvider
/ ListProviders / DeleteProvider, scope-only and role-only denials
return the existing AuthzPolicy messages) are unchanged.

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

copy-pr-bot · 2026-05-27T16:43:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

TaylorMutch · 2026-05-27T16:53:05Z

/ok to test 5de68c4

Move scope, role, and auth-mode metadata to the handler definition site via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and ALLOWED_SANDBOX_METHODS tables are now generated from per-method annotations on the tonic service impls, with canonical gRPC paths derived from the service name and method name. Adds a new openshell-server-macros proc-macro crate, an aggregator in auth/method_authz.rs, and an exhaustiveness test that decodes the protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and verifies every RPC has an annotation. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-NVIDIA#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

Addresses review feedback on the per-handler auth-annotation work. - Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous router only checked is_sandbox_callable() for Principal::Sandbox; user principals still flowed into AuthzPolicy::check() and bypassed the per-handler declaration. A user with `openshell:all` could therefore reach `sandbox`-only handlers like GetSandboxProviderEnvironment, ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even though their annotations said sandbox-only. Adds an is_user_callable() predicate and rejects User principals at the router for `sandbox` / `unauthenticated` methods. - Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A second `auth`, `scope`, or `role` previously silently overwrote the first value; now it fails to compile. - Regression tests: a unit test for is_user_callable() and a router test that proves a user with admin role + openshell:all cannot reach the nine sandbox-only handlers. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

…hz doc comments Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

The stub was a safety net that fired only when a method had `#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it required `rpc_auth` to be imported, which is why both call sites carried `#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`. Drop the stub and the unused-import workaround. A missing `#[rpc_authz]` now surfaces as rustc's standard "cannot find attribute `rpc_auth` in this scope" — clear enough, and one fewer import + lint exception. Addresses review comment on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

The previous trait-derived const name turned `OpenShell` into `OPEN_SHELL_AUTH_METADATA`, splitting the project name across an underscore. Each impl already lives in its own module (`crate::grpc::`, `crate::inference::`), so the module path is enough to disambiguate between services — a fixed `AUTH_METADATA` name reads more naturally. Aggregator in `auth/method_authz.rs` now references `crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA` directly. Addresses review comment on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

TaylorMutch · 2026-05-27T18:31:35Z

/ok to test a8f611e

github-actions · 2026-05-27T18:32:03Z

Label test:e2e applied for a8f611e. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

OpenShell is one word; reference name in the doc should be OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA. Addresses review nit on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

TaylorMutch · 2026-05-27T19:27:09Z

/ok to test cd2669f

…ust clients Addresses three findings from the branch review (`feat-python-sdk-bearer-auth-review.md`): Finding 1 (HIGH): HTTPS OIDC gateways without a full mTLS bundle were falling back to `grpc.insecure_channel`. Made `TlsConfig.ca_path`, `cert_path`, and `key_path` all optional with the cert/key pair required-together-or-not-at-all, so callers can express: - Full mTLS (all three): server trusts client identity. - CA-only (`ca_path` only): custom CA trust, no client identity. - System roots (`TlsConfig()`): OS trust store; the right default for OIDC gateways behind a public CA. `from_active_cluster` now mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: for any `https://` gateway, always build a secure channel and pick the strongest TLS profile available (mTLS → CA-only → system roots). Finding 2 (MEDIUM): `from_active_cluster` snapshotted the access token once. Replaced with `_make_cluster_bearer_provider`, a per-RPC closure that re-reads `oidc_token.json` each call. A long-lived `SandboxClient` now picks up rotations performed by `openshell gateway login` without being reconstructed. Provider fails closed with `SandboxError` (and a "re-authenticate with: openshell gateway login" hint) when the token file is missing, malformed, or expired. Finding 3 (MEDIUM): `from_active_cluster` was attaching bearer metadata whenever `oidc_token.json` existed, even for gateways registered as `mtls` or `plaintext`. Now gates on `metadata.json.auth_mode == "oidc"`, matching `crates/openshell-cli/src/main.rs` and the TUI. Test coverage expands 14 → 23: HTTPS-OIDC-without-mTLS, CA-only layout, stale-token-with-wrong-auth-mode, per-RPC reload, expired token rejection, missing-file rejection, partial-TlsConfig validation, and the existing channel/interceptor matrix. Verified end-to-end against the OpenShift Keycloak deployment: positive admin / reader paths, scope enforcement, and PR NVIDIA#1596 sandbox-principal gate all still pass via the SDK. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. * Use `_make_cluster_bearer_provider` — a per-RPC closure that reads `oidc_token.json` on every invocation, returning the current `access_token` if fresh and raising `SandboxError` with a "re-authenticate with: openshell gateway login" hint if the token is missing, malformed, or expired (the 30 s grace window matches `openshell-bootstrap::oidc_token::is_token_expired`). A long-lived `SandboxClient` now picks up token rotations done by the CLI without being reconstructed. OAuth2 refresh itself stays in the CLI; the SDK only consumes what's on disk. Tested: - 23 SDK unit tests pass (5 existing + 18 new across the bearer interceptor, token provider, `TlsConfig` validation, and the `from_active_cluster` auth ladder). `mise run test:python` → 31 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift (deploy recipe at `architecture/plans/deploy-openshift.md`): * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against four pre-PR failure modes: * `https://` OIDC gateway without `mtls/` no longer falls back to `insecure_channel` * CA-only `mtls/ca.crt` layout no longer raises `FileNotFoundError` * plaintext gateway with stale `oidc_token.json` no longer gets a bearer attached * long-lived client picks up rotated tokens; expired tokens surface as `SandboxError`, not silent gateway 401s Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the IdP's error body included for diagnostics. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=False)` factory. Default is the refresher path; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` (opt-in) atomically persists the rotated bundle with 0600 mode so other processes — including the Rust CLI — see the rotation. Off by default; treats the Rust CLI as the canonical writer. - `from_active_cluster` exposes `auto_refresh` / `write_back` kwargs (defaults: True / False). OAuth2 refresh refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | opt-in | Refresh in the SDK is the production answer because: - Long-running Python orchestrators (agent runs, data pipelines) outlast a Keycloak 1-hour access token. Without in-SDK refresh, they crash at expiry. - Headless containers (sandbox-controller pods, GitHub Actions runners) may not have the Rust CLI installed but always have Python and a refresh_token. - Subprocess-to-CLI per RPC would spawn `openshell` on every gRPC call, including hot streaming paths. Unacceptable. The Rust CLI keeps owning interactive flows (browser/device-code, keyring storage, the initial login). The SDK owns refresh during script execution. Tested: - 32 SDK unit tests pass (5 existing + 27 new across the bearer interceptor, fail-closed provider, refresher behavior, `TlsConfig` validation, `from_active_cluster` auth ladder, and the refresher's concurrency / rotation / write-back / error paths). `mise run test:python` → 40 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against the four pre-review failure modes: * `https://` OIDC gateway without `mtls/` no longer falls back to `insecure_channel` * CA-only `mtls/ca.crt` layout no longer raises `FileNotFoundError` * plaintext gateway with stale `oidc_token.json` no longer gets a bearer attached * long-lived client picks up rotated tokens; expired tokens surface as `SandboxError`, not silent gateway 401s - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with error body, and the client_credentials / no-refresh_token error path. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the RFC 6749 error body included for diagnostics. Mirrors the Rust CLI's HTTP-policy posture from `crates/openshell-cli/src/oidc_auth.rs`: * `follow_redirects=False` so a 3xx during discovery can't steer us to an attacker-controlled token endpoint. * Discovery `issuer` is validated against the configured issuer; a discovery document claiming a different issuer is rejected, preventing the SDK from POSTing the refresh_token to a malicious endpoint. * `insecure: bool` flag plumbed through to httpx's `verify=` so self-signed-cert deployments work the same way they do in the Rust CLI. Built on `httpx` (chosen over `urllib` specifically for follow_redirects + verify control as kwargs). The OAuth2 refresh-token grant itself (RFC 6749 §6) is one form-encoded POST — handled inline rather than via a dedicated OAuth library; tried `authlib`'s `OAuth2Client` first but it auto-injects an Authorization header on every request, which breaks the unauthenticated discovery GET. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=True, insecure=False)` factory. Defaults to the refresher path with write-back enabled; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` is the default (changed from the first round of review): IdPs with refresh-token rotation (Keycloak with rotation, Entra in strict mode) invalidate the old refresh_token on each refresh, so an in-memory-only refresh would leave the on-disk bundle pointing at an invalidated value — any second process starting from disk would `invalid_grant`. With write-back enabled by default, the SDK keeps the shared cache consistent with the IdP. - `from_active_cluster` exposes `auto_refresh`, `write_back`, and `insecure` kwargs (defaults: True / True / False). OAuth2 refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | yes (opt-out)| OpenShell sits between the two; chose write-back-by-default because the rotation invariant matters more for our deployments than the "CLI is the only writer" assumption that fits google-auth. Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library — the refresh grant is a single POST. Tested: - 36 SDK unit tests pass (5 existing + 31 across the bearer interceptor, fail-closed provider, refresher behavior, TlsConfig validation, from_active_cluster auth ladder, and three regression tests for the security review findings). `mise run test:python` → 44 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against three pre-review security findings: * **Discovery issuer validation** — a discovery document claiming a different `issuer` than the configured one is rejected with a clear `SandboxError` before any refresh POST can reach the attacker-controlled endpoint. * **Redirect during discovery** — `follow_redirects=False` on the underlying httpx client means a 3xx during discovery surfaces as a SandboxError rather than silently chasing the redirect. * **Cross-process rotation** — a two-process simulation shows process B starting from disk and successfully refreshing with the rotated refresh_token, because process A's write-back updated the shared cache. - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode (default), default-on write_back proven by test, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with the error body, the client_credentials / no-refresh_token error path, issuer- mismatch rejection, redirect-during-discovery rejection, insecure flag plumbing. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

… enforce at the router (NVIDIA#1596) * feat(server): per-handler gRPC auth annotations Move scope, role, and auth-mode metadata to the handler definition site via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and ALLOWED_SANDBOX_METHODS tables are now generated from per-method annotations on the tonic service impls, with canonical gRPC paths derived from the service name and method name. Adds a new openshell-server-macros proc-macro crate, an aggregator in auth/method_authz.rs, and an exhaustiveness test that decodes the protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and verifies every RPC has an annotation. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server): rename `sandbox-secret` auth mode to `sandbox` PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-NVIDIA#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(server): enforce per-handler AuthMode at the router Addresses review feedback on the per-handler auth-annotation work. - Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous router only checked is_sandbox_callable() for Principal::Sandbox; user principals still flowed into AuthzPolicy::check() and bypassed the per-handler declaration. A user with `openshell:all` could therefore reach `sandbox`-only handlers like GetSandboxProviderEnvironment, ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even though their annotations said sandbox-only. Adds an is_user_callable() predicate and rejects User principals at the router for `sandbox` / `unauthenticated` methods. - Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A second `auth`, `scope`, or `role` previously silently overwrote the first value; now it fails to compile. - Regression tests: a unit test for is_user_callable() and a router test that proves a user with admin role + openshell:all cannot reach the nine sandbox-only handlers. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): drop standalone `rpc_auth` stub The stub was a safety net that fired only when a method had `#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it required `rpc_auth` to be imported, which is why both call sites carried `#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`. Drop the stub and the unused-import workaround. A missing `#[rpc_authz]` now surfaces as rustc's standard "cannot find attribute `rpc_auth` in this scope" — clear enough, and one fewer import + lint exception. Addresses review comment on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): emit fixed `AUTH_METADATA` const per service The previous trait-derived const name turned `OpenShell` into `OPEN_SHELL_AUTH_METADATA`, splitting the project name across an underscore. Each impl already lives in its own module (`crate::grpc::`, `crate::inference::`), so the module path is enough to disambiguate between services — a fixed `AUTH_METADATA` name reads more naturally. Aggregator in `auth/method_authz.rs` now references `crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA` directly. Addresses review comment on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment OpenShell is one word; reference name in the doc should be OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA. Addresses review nit on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> --------- Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

mrunalp requested review from a team, derekwaynecarr and maxamillion as code owners May 27, 2026 16:42

mrunalp added 4 commits May 27, 2026 10:21

docs(server): finish renaming sandbox-secret to sandbox in method_aut…

0f54133

…hz doc comments Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

mrunalp force-pushed the feat/per-handler-rpc-auth branch from 5de68c4 to 0f54133 Compare May 27, 2026 17:22

TaylorMutch reviewed May 27, 2026

View reviewed changes

Comment thread crates/openshell-server/src/inference.rs Outdated

TaylorMutch reviewed May 27, 2026

View reviewed changes

Comment thread crates/openshell-server-macros/src/lib.rs Outdated

mrunalp added 2 commits May 27, 2026 11:19

TaylorMutch added the test:e2e Requires end-to-end coverage label May 27, 2026

TaylorMutch reviewed May 27, 2026

View reviewed changes

Comment thread crates/openshell-server-macros/src/lib.rs Outdated

TaylorMutch previously approved these changes May 27, 2026

View reviewed changes

docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment

cd2669f

OpenShell is one word; reference name in the doc should be OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA. Addresses review nit on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

mrunalp dismissed TaylorMutch’s stale review via cd2669f May 27, 2026 19:25

TaylorMutch approved these changes May 27, 2026

View reviewed changes

TaylorMutch merged commit 3f520dd into NVIDIA:main May 27, 2026
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router#1596

feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router#1596
TaylorMutch merged 7 commits into
NVIDIA:mainfrom
mrunalp:feat/per-handler-rpc-auth

mrunalp commented May 27, 2026

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

TaylorMutch commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

TaylorMutch commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Uh oh!

TaylorMutch commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mrunalp commented May 27, 2026

Summary

Related Issue

Changes

Auth model

Backwards compatibility

Testing

Manual verification

Checklist

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

TaylorMutch commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

TaylorMutch commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Uh oh!

TaylorMutch commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants