feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router#1596
Merged
Merged
Conversation
Collaborator
|
/ok to test 5de68c4 |
Move scope, role, and auth-mode metadata to the handler definition site via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and ALLOWED_SANDBOX_METHODS tables are now generated from per-method annotations on the tonic service impls, with canonical gRPC paths derived from the service name and method name. Adds a new openshell-server-macros proc-macro crate, an aggregator in auth/method_authz.rs, and an exhaustiveness test that decodes the protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and verifies every RPC has an annotation. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-NVIDIA#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Addresses review feedback on the per-handler auth-annotation work. - Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous router only checked is_sandbox_callable() for Principal::Sandbox; user principals still flowed into AuthzPolicy::check() and bypassed the per-handler declaration. A user with `openshell:all` could therefore reach `sandbox`-only handlers like GetSandboxProviderEnvironment, ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even though their annotations said sandbox-only. Adds an is_user_callable() predicate and rejects User principals at the router for `sandbox` / `unauthenticated` methods. - Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A second `auth`, `scope`, or `role` previously silently overwrote the first value; now it fails to compile. - Regression tests: a unit test for is_user_callable() and a router test that proves a user with admin role + openshell:all cannot reach the nine sandbox-only handlers. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
…hz doc comments Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
5de68c4 to
0f54133
Compare
TaylorMutch
reviewed
May 27, 2026
TaylorMutch
reviewed
May 27, 2026
The stub was a safety net that fired only when a method had
`#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it
required `rpc_auth` to be imported, which is why both call sites carried
`#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`.
Drop the stub and the unused-import workaround. A missing `#[rpc_authz]`
now surfaces as rustc's standard "cannot find attribute `rpc_auth` in
this scope" — clear enough, and one fewer import + lint exception.
Addresses review comment on PR NVIDIA#1596.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
The previous trait-derived const name turned `OpenShell` into `OPEN_SHELL_AUTH_METADATA`, splitting the project name across an underscore. Each impl already lives in its own module (`crate::grpc::`, `crate::inference::`), so the module path is enough to disambiguate between services — a fixed `AUTH_METADATA` name reads more naturally. Aggregator in `auth/method_authz.rs` now references `crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA` directly. Addresses review comment on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Collaborator
|
/ok to test a8f611e |
|
Label |
TaylorMutch
reviewed
May 27, 2026
TaylorMutch
previously approved these changes
May 27, 2026
OpenShell is one word; reference name in the doc should be OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA. Addresses review nit on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
TaylorMutch
approved these changes
May 27, 2026
Collaborator
|
/ok to test cd2669f |
mrunalp
added a commit
to mrunalp/OpenShell
that referenced
this pull request
May 28, 2026
…ust clients Addresses three findings from the branch review (`feat-python-sdk-bearer-auth-review.md`): Finding 1 (HIGH): HTTPS OIDC gateways without a full mTLS bundle were falling back to `grpc.insecure_channel`. Made `TlsConfig.ca_path`, `cert_path`, and `key_path` all optional with the cert/key pair required-together-or-not-at-all, so callers can express: - Full mTLS (all three): server trusts client identity. - CA-only (`ca_path` only): custom CA trust, no client identity. - System roots (`TlsConfig()`): OS trust store; the right default for OIDC gateways behind a public CA. `from_active_cluster` now mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: for any `https://` gateway, always build a secure channel and pick the strongest TLS profile available (mTLS → CA-only → system roots). Finding 2 (MEDIUM): `from_active_cluster` snapshotted the access token once. Replaced with `_make_cluster_bearer_provider`, a per-RPC closure that re-reads `oidc_token.json` each call. A long-lived `SandboxClient` now picks up rotations performed by `openshell gateway login` without being reconstructed. Provider fails closed with `SandboxError` (and a "re-authenticate with: openshell gateway login" hint) when the token file is missing, malformed, or expired. Finding 3 (MEDIUM): `from_active_cluster` was attaching bearer metadata whenever `oidc_token.json` existed, even for gateways registered as `mtls` or `plaintext`. Now gates on `metadata.json.auth_mode == "oidc"`, matching `crates/openshell-cli/src/main.rs` and the TUI. Test coverage expands 14 → 23: HTTPS-OIDC-without-mTLS, CA-only layout, stale-token-with-wrong-auth-mode, per-RPC reload, expired token rejection, missing-file rejection, partial-TlsConfig validation, and the existing channel/interceptor matrix. Verified end-to-end against the OpenShift Keycloak deployment: positive admin / reader paths, scope enforcement, and PR NVIDIA#1596 sandbox-principal gate all still pass via the SDK. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp
added a commit
to mrunalp/OpenShell
that referenced
this pull request
May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. * Use `_make_cluster_bearer_provider` — a per-RPC closure that reads `oidc_token.json` on every invocation, returning the current `access_token` if fresh and raising `SandboxError` with a "re-authenticate with: openshell gateway login" hint if the token is missing, malformed, or expired (the 30 s grace window matches `openshell-bootstrap::oidc_token::is_token_expired`). A long-lived `SandboxClient` now picks up token rotations done by the CLI without being reconstructed. OAuth2 refresh itself stays in the CLI; the SDK only consumes what's on disk. Tested: - 23 SDK unit tests pass (5 existing + 18 new across the bearer interceptor, token provider, `TlsConfig` validation, and the `from_active_cluster` auth ladder). `mise run test:python` → 31 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift (deploy recipe at `architecture/plans/deploy-openshift.md`): * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against four pre-PR failure modes: * `https://` OIDC gateway without `mtls/` no longer falls back to `insecure_channel` * CA-only `mtls/ca.crt` layout no longer raises `FileNotFoundError` * plaintext gateway with stale `oidc_token.json` no longer gets a bearer attached * long-lived client picks up rotated tokens; expired tokens surface as `SandboxError`, not silent gateway 401s Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp
added a commit
to mrunalp/OpenShell
that referenced
this pull request
May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the IdP's error body included for diagnostics. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=False)` factory. Default is the refresher path; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` (opt-in) atomically persists the rotated bundle with 0600 mode so other processes — including the Rust CLI — see the rotation. Off by default; treats the Rust CLI as the canonical writer. - `from_active_cluster` exposes `auto_refresh` / `write_back` kwargs (defaults: True / False). OAuth2 refresh refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | opt-in | Refresh in the SDK is the production answer because: - Long-running Python orchestrators (agent runs, data pipelines) outlast a Keycloak 1-hour access token. Without in-SDK refresh, they crash at expiry. - Headless containers (sandbox-controller pods, GitHub Actions runners) may not have the Rust CLI installed but always have Python and a refresh_token. - Subprocess-to-CLI per RPC would spawn `openshell` on every gRPC call, including hot streaming paths. Unacceptable. The Rust CLI keeps owning interactive flows (browser/device-code, keyring storage, the initial login). The SDK owns refresh during script execution. Tested: - 32 SDK unit tests pass (5 existing + 27 new across the bearer interceptor, fail-closed provider, refresher behavior, `TlsConfig` validation, `from_active_cluster` auth ladder, and the refresher's concurrency / rotation / write-back / error paths). `mise run test:python` → 40 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against the four pre-review failure modes: * `https://` OIDC gateway without `mtls/` no longer falls back to `insecure_channel` * CA-only `mtls/ca.crt` layout no longer raises `FileNotFoundError` * plaintext gateway with stale `oidc_token.json` no longer gets a bearer attached * long-lived client picks up rotated tokens; expired tokens surface as `SandboxError`, not silent gateway 401s - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with error body, and the client_credentials / no-refresh_token error path. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp
added a commit
to mrunalp/OpenShell
that referenced
this pull request
May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the RFC 6749 error body included for diagnostics. Mirrors the Rust CLI's HTTP-policy posture from `crates/openshell-cli/src/oidc_auth.rs`: * `follow_redirects=False` so a 3xx during discovery can't steer us to an attacker-controlled token endpoint. * Discovery `issuer` is validated against the configured issuer; a discovery document claiming a different issuer is rejected, preventing the SDK from POSTing the refresh_token to a malicious endpoint. * `insecure: bool` flag plumbed through to httpx's `verify=` so self-signed-cert deployments work the same way they do in the Rust CLI. Built on `httpx` (chosen over `urllib` specifically for follow_redirects + verify control as kwargs). The OAuth2 refresh-token grant itself (RFC 6749 §6) is one form-encoded POST — handled inline rather than via a dedicated OAuth library; tried `authlib`'s `OAuth2Client` first but it auto-injects an Authorization header on every request, which breaks the unauthenticated discovery GET. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=True, insecure=False)` factory. Defaults to the refresher path with write-back enabled; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` is the default (changed from the first round of review): IdPs with refresh-token rotation (Keycloak with rotation, Entra in strict mode) invalidate the old refresh_token on each refresh, so an in-memory-only refresh would leave the on-disk bundle pointing at an invalidated value — any second process starting from disk would `invalid_grant`. With write-back enabled by default, the SDK keeps the shared cache consistent with the IdP. - `from_active_cluster` exposes `auto_refresh`, `write_back`, and `insecure` kwargs (defaults: True / True / False). OAuth2 refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | yes (opt-out)| OpenShell sits between the two; chose write-back-by-default because the rotation invariant matters more for our deployments than the "CLI is the only writer" assumption that fits google-auth. Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library — the refresh grant is a single POST. Tested: - 36 SDK unit tests pass (5 existing + 31 across the bearer interceptor, fail-closed provider, refresher behavior, TlsConfig validation, from_active_cluster auth ladder, and three regression tests for the security review findings). `mise run test:python` → 44 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against three pre-review security findings: * **Discovery issuer validation** — a discovery document claiming a different `issuer` than the configured one is rejected with a clear `SandboxError` before any refresh POST can reach the attacker-controlled endpoint. * **Redirect during discovery** — `follow_redirects=False` on the underlying httpx client means a 3xx during discovery surfaces as a SandboxError rather than silently chasing the redirect. * **Cross-process rotation** — a two-process simulation shows process B starting from disk and successfully refreshing with the rotated refresh_token, because process A's write-back updated the shared cache. - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode (default), default-on write_back proven by test, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with the error body, the client_credentials / no-refresh_token error path, issuer- mismatch rejection, redirect-during-discovery rejection, insecure flag plumbing. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp
added a commit
to mrunalp/OpenShell
that referenced
this pull request
May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the RFC 6749 error body included for diagnostics. Mirrors the Rust CLI's HTTP-policy posture from `crates/openshell-cli/src/oidc_auth.rs`: * `follow_redirects=False` so a 3xx during discovery can't steer us to an attacker-controlled token endpoint. * Discovery `issuer` is validated against the configured issuer; a discovery document claiming a different issuer is rejected, preventing the SDK from POSTing the refresh_token to a malicious endpoint. * `insecure: bool` flag plumbed through to httpx's `verify=` so self-signed-cert deployments work the same way they do in the Rust CLI. Built on `httpx` (chosen over `urllib` specifically for follow_redirects + verify control as kwargs). The OAuth2 refresh-token grant itself (RFC 6749 §6) is one form-encoded POST — handled inline rather than via a dedicated OAuth library; tried `authlib`'s `OAuth2Client` first but it auto-injects an Authorization header on every request, which breaks the unauthenticated discovery GET. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=True, insecure=False)` factory. Defaults to the refresher path with write-back enabled; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` is the default (changed from the first round of review): IdPs with refresh-token rotation (Keycloak with rotation, Entra in strict mode) invalidate the old refresh_token on each refresh, so an in-memory-only refresh would leave the on-disk bundle pointing at an invalidated value — any second process starting from disk would `invalid_grant`. With write-back enabled by default, the SDK keeps the shared cache consistent with the IdP. - `from_active_cluster` exposes `auto_refresh`, `write_back`, and `insecure` kwargs (defaults: True / True / False). OAuth2 refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | yes (opt-out)| OpenShell sits between the two; chose write-back-by-default because the rotation invariant matters more for our deployments than the "CLI is the only writer" assumption that fits google-auth. Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library — the refresh grant is a single POST. Tested: - 36 SDK unit tests pass (5 existing + 31 across the bearer interceptor, fail-closed provider, refresher behavior, TlsConfig validation, from_active_cluster auth ladder, and three regression tests for the security review findings). `mise run test:python` → 44 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against three pre-review security findings: * **Discovery issuer validation** — a discovery document claiming a different `issuer` than the configured one is rejected with a clear `SandboxError` before any refresh POST can reach the attacker-controlled endpoint. * **Redirect during discovery** — `follow_redirects=False` on the underlying httpx client means a 3xx during discovery surfaces as a SandboxError rather than silently chasing the redirect. * **Cross-process rotation** — a two-process simulation shows process B starting from disk and successfully refreshing with the rotated refresh_token, because process A's write-back updated the shared cache. - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode (default), default-on write_back proven by test, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with the error body, the client_credentials / no-refresh_token error path, issuer- mismatch rejection, redirect-during-discovery rejection, insecure flag plumbing. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
st-gr
pushed a commit
to st-gr/OpenShell
that referenced
this pull request
May 28, 2026
… enforce at the router (NVIDIA#1596) * feat(server): per-handler gRPC auth annotations Move scope, role, and auth-mode metadata to the handler definition site via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and ALLOWED_SANDBOX_METHODS tables are now generated from per-method annotations on the tonic service impls, with canonical gRPC paths derived from the service name and method name. Adds a new openshell-server-macros proc-macro crate, an aggregator in auth/method_authz.rs, and an exhaustiveness test that decodes the protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and verifies every RPC has an annotation. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server): rename `sandbox-secret` auth mode to `sandbox` PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-NVIDIA#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(server): enforce per-handler AuthMode at the router Addresses review feedback on the per-handler auth-annotation work. - Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous router only checked is_sandbox_callable() for Principal::Sandbox; user principals still flowed into AuthzPolicy::check() and bypassed the per-handler declaration. A user with `openshell:all` could therefore reach `sandbox`-only handlers like GetSandboxProviderEnvironment, ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even though their annotations said sandbox-only. Adds an is_user_callable() predicate and rejects User principals at the router for `sandbox` / `unauthenticated` methods. - Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A second `auth`, `scope`, or `role` previously silently overwrote the first value; now it fails to compile. - Regression tests: a unit test for is_user_callable() and a router test that proves a user with admin role + openshell:all cannot reach the nine sandbox-only handlers. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): drop standalone `rpc_auth` stub The stub was a safety net that fired only when a method had `#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it required `rpc_auth` to be imported, which is why both call sites carried `#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`. Drop the stub and the unused-import workaround. A missing `#[rpc_authz]` now surfaces as rustc's standard "cannot find attribute `rpc_auth` in this scope" — clear enough, and one fewer import + lint exception. Addresses review comment on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): emit fixed `AUTH_METADATA` const per service The previous trait-derived const name turned `OpenShell` into `OPEN_SHELL_AUTH_METADATA`, splitting the project name across an underscore. Each impl already lives in its own module (`crate::grpc::`, `crate::inference::`), so the module path is enough to disambiguate between services — a fixed `AUTH_METADATA` name reads more naturally. Aggregator in `auth/method_authz.rs` now references `crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA` directly. Addresses review comment on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment OpenShell is one word; reference name in the doc should be OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA. Addresses review nit on PR NVIDIA#1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> --------- Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Move gRPC auth metadata (auth mode, Bearer scope, required role) from four
hand-maintained constants into per-handler
#[rpc_auth(...)]annotationsgenerated by a new proc macro, and close the asymmetric router enforcement
where a
Principal::Usercould reach handlers intended for sandboxsupervisors. A descriptor-set-driven exhaustiveness test pins the surface
so a new RPC can't silently fall back to
openshell:all.Related Issue
Fixes #1586
Changes
openshell-server-macrosexposing#[rpc_authz](impl-level) and
#[rpc_auth](per-method) attributes. First proc macroin the workspace; intentionally small and focused on auth metadata.
OpenShellServiceandInferenceService, declaringauth = "unauthenticated" | "sandbox" | "bearer" | "dual"and (whenbearer/dual) the requiredscopeandrole. The macro derives canonical gRPC paths from the proto servicename + PascalCased method name, so paths cannot drift from the proto.
auth/method_authz.rsexposinglookup,required_scope,required_role,is_unauthenticated,is_sandbox_callable, and the newis_user_callable. Single source oftruth for the four old constants.
multiplex.rs:Principal::Userisnow rejected with
PermissionDenied: this method requires a sandbox principalfor methods declaredsandbox(and forunauthenticatedmethods which would already short-circuit). Mirrors the existing
is_sandbox_callablecheck onPrincipal::Sandbox. Closes the gapwhere
GetSandboxProviderEnvironment,ReportPolicyStatus,SubmitPolicyAnalysis,PushSandboxLogs,ConnectSupervisor, andRelayStreamwere reachable by a user token because their handlersuse
ensure_sandbox_scope(which intentionally lets users through) orno guard at all.
SCOPED_METHODSandADMIN_METHODSfromauth/authz.rs,UNAUTHENTICATED_METHODSfromauth/oidc.rs, andALLOWED_SANDBOX_METHODSfromauth/sandbox_methods.rs. Their publicpredicates (
is_unauthenticated_method,is_sandbox_callable) nowdelegate to the aggregator; the existing unit tests keep passing.
UNAUTHENTICATED_PREFIXESstays — prefix matching for/grpc.reflection.*and/grpc.health.*is structural, not per-method.#[rpc_authz]fails compilation on:missing
#[rpc_auth],scope/roleonunauthenticated/sandboxmethods, missing
scope/roleonbearer/dualmethods, duplicategRPC paths within a service, duplicate keys inside one
#[rpc_auth],and invalid auth mode/role strings.
openshell-core/build.rsnowcalls
tonic_build::configure().file_descriptor_set_path(...)andexposes
openshell_core::FILE_DESCRIPTOR_SET. A new test inopenshell-serverenumerates every(service, method)from thedescriptor and asserts it is covered exactly once by a
MethodAuthentry. Catches new RPCs without annotations, stale annotations after
renames, and duplicates across services.
openshell-admin+openshell:allbearer user is denied on each
sandbox-annotated method.Auth model
Principal::Sandboxaccepted?unauthenticatedsandboxbearerdualsandboxhere refers to the per-sandbox gateway-minted JWT introduced in#1404 — the old shared sandbox secret no longer exists. A handler
annotated
sandboxauthenticates as a specificPrincipal::Sandbox.Backwards compatibility
Visible behavior changes for deployed gateways:
GetSandboxProviderEnvironment,ReportPolicyStatus,SubmitPolicyAnalysis,PushSandboxLogs,ConnectSupervisor,RelayStream) start rejecting Bearer users at the router. Nothing inthe CLI or any user-facing flow calls them; only sandbox supervisors
do, via the per-sandbox JWT path.
ExecSandboxInteractivethatpreviously fell back to
openshell:allnow have explicit scope/roledeclarations.
openshell:alltokens still work;provider:read-onlytokens gain access to
ListProviderProfiles/GetProviderProfile.Testing
mise run pre-commitpasses (rust:lint, rust:format, rust:check,helm:lint, helm:docs:check — all green; the only
mise run cifailuresare markdown lint errors in pre-existing files outside this branch).
auth::method_authz::tests— three exhaustiveness tests(
every_proto_rpc_has_an_annotation,every_annotated_path_matches_a_real_rpc,no_duplicate_paths_across_services) plususer_callable_matches_auth_mode.multiplex::tests::auth_router::user_principal_is_denied_on_sandbox_only_methods— proves the router rejects admin +
openshell:allon all ninesandbox-only methods.
authz.rs,oidc.rs,sandbox_methods.rs,multiplex.rscontinue to pass against the new aggregator.added in this PR; existing
e2e:kubernetessmoke run continues toexercise the auth path through the gateway.
Manual verification
Verified end-to-end against a local OIDC + Keycloak deployment.
GetSandboxProviderEnvironmentNotFound: sandbox not found(handler reached, store queried)PermissionDenied: this method requires a sandbox principalReportPolicyStatusInvalidArgument: sandbox_id is requiredPermissionDeniedSubmitPolicyAnalysisInvalidArgument: name is requiredPermissionDeniedPushSandboxLogs{}— stream accepted, push succeededPermissionDeniedConnectSupervisorInvalidArgument: expected SupervisorHello(bidi opened)PermissionDeniedRelayStreamInvalidArgument: first RelayFrame must be init…(bidi opened)PermissionDeniedIssueSandboxToken/RefreshSandboxToken/GetInferenceBundlePermissionDenied(handler-level guard)PermissionDenied(router-level)Positive paths (
readertoken canListSandboxes,writertoken reachesCreateSandboxvalidation,admin + provider:writecanCreateProvider/
ListProviders/DeleteProvider, scope-only and role-only denialsreturn the existing AuthzPolicy messages) are unchanged.
Checklist