Skip to content

Add ephemeral admin certificates with step-ca#4846

Draft
pcnudde wants to merge 12 commits into
NVIDIA:mainfrom
pcnudde:feat/step-ca-ephemeral-admin-certs
Draft

Add ephemeral admin certificates with step-ca#4846
pcnudde wants to merge 12 commits into
NVIDIA:mainfrom
pcnudde:feat/step-ca-ephemeral-admin-certs

Conversation

@pcnudde

@pcnudde pcnudde commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add per-admin ephemeral_admin_cert provisioning for startup kits without long-lived admin keys
  • add a narrow runtime provider interface with a built-in step_ca provider
  • acquire, validate, cache, and renew short-lived admin certificate chains before using the existing mTLS login and job-signing paths
  • preserve traditional static admin startup kits and certificate behavior
  • add startup-kit inspection, package-checker handling, clone diagnostics, documentation, and focused tests

Security model

The external certificate provider authenticates the admin and issues a short-lived certificate chain rooted in the FLARE project CA. FLARE validates the chain, certificate validity, key match, identity, organization, and role before use. The server and clients continue to rely on the existing admin certificate login, authorization, and job-signature verification paths; no OIDC tokens are introduced into those components.

The built-in step-ca adapter invokes the step CLI. step-ca owns OIDC login and claim-to-certificate mapping. Valid credentials are cached per OS user until they enter the renewal window, avoiding a browser flow for every CLI command.

Compatibility

Static admins remain unchanged and can coexist with ephemeral admin kits in the same project.yml. Ephemeral kits omit client.crt and client.key; the authenticated identity, organization, and FLARE role come from the issued certificate.

@pcnudde pcnudde force-pushed the feat/step-ca-ephemeral-admin-certs branch from 2a10ede to bbab945 Compare June 30, 2026 20:30
@pcnudde

pcnudde commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@greptileai review

@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds ephemeral admin certificates via an external CA (step-ca) so admins can receive short-lived mTLS certs issued through OIDC instead of shipping long-lived client.crt/client.key in startup kits. It introduces a provider plugin interface, an in-process cert cache with POSIX file locking, renewal-window logic, and server-side clone guards tied to cert validity.

  • New ephemeral cert path: EphemeralAdminCertConfig in fed_admin.json triggers cert acquisition via a named provider (e.g., step_ca) during AdminAPI.__init__, with automatic renewal before each connection, login, and job submit.
  • Clone guard: SUBMITTER_CERT_VALIDITY is stored in job metadata when --ephemeral-admin-cert is set; cloning is blocked once the original cert expires, with CLONED_META_KEYS propagating the window to each successive clone.
  • Backward compatibility preserved: static admin startup kits require no changes and coexist with ephemeral kits in the same project.yml.

Confidence Score: 5/5

Safe to merge. The ephemeral cert acquisition, caching, validation, and renewal paths are all correctly implemented; the static admin path is unchanged.

The new code handles all the tricky edge cases well: atomic cache writes via temp-dir rename, POSIX file locking to serialize provider calls, chain-to-rootCA validation before accepting any issued cert, correct ordering of cell reset and re-authentication after renewal, and defensive parsing of cert validity throughout the clone-guard path. No logic bugs were found in any of the changed files.

No files require special attention. The two observations (hardcoded RSA 2048 key parameters and unfiltered subprocess stderr) are minor design-level points that do not affect correctness or security.

Important Files Changed

Filename Overview
nvflare/fuel/sec/ephemeral_admin_cert.py Core cert acquisition, caching, and validation logic. POSIX file locking, atomic store via temp-dir rename, stale-entry pruning, and provider plugin loading are all well-handled.
nvflare/fuel/sec/step_ca_admin_cert.py step-ca provider: builds the step ca certificate command, validates the ca_url (https-only except localhost), and wraps subprocess.run with timeout/exit-code error handling. RSA 2048 key type is hardcoded with no config knob.
nvflare/fuel/hci/client/api.py AdminAPI now calls ensure_client_cert_valid() at init, connect(), cert_login(), and file transfer. Renewal correctly resets the cell and re-establishes the mTLS connection and HCI session in the right order.
nvflare/private/fed/server/job_cmds.py Adds _submitter_cert_validity to extract cert validity from the uploaded zip and _clone_signature_error to block cloning of expired-cert jobs. Both handle edge cases (bad zip, unparseable cert chain, missing cert file) cleanly.
nvflare/fuel/hci/client/file_transfer.py Added cert-refresh logic before job signing and appends --ephemeral-admin-cert to the submit command when applicable. Reconnect/re-login after renewal follows the correct order.
nvflare/lighter/entity.py Participant validation updated: ephemeral admins may omit org/role (sourced from the issued cert); name check falls back to admin_kit pattern for non-email names. Consistent with the consuming-parser idiom.
nvflare/tool/kit/kit_config.py classify_startup_kit and inspect_startup_kit_metadata now detect ephemeral admin kits by presence of ephemeral_admin_cert in fed_admin.json; skips cert expiry check and reports runtime_issued status correctly.
nvflare/tool/package_checker/nvflare_console_package_checker.py Overrides check_dry_run to validate step_ca config, check rootCA.pem presence, and verify the step binary is on PATH before skipping the interactive login step.
nvflare/lighter/impl/cert.py CertBuilder correctly skips client cert/key generation for ephemeral admins and only writes rootCA.pem to their kit directory.
nvflare/lighter/impl/static_file.py StaticFileBuilder sets uid_source=cert and embeds ephemeral_admin_cert config in fed_admin.json; strips client_key/client_cert fields for ephemeral kits via _modify_fed_admin_config callback.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Admin as Admin CLI
    participant Cache as Cert Cache (~/.nvflare)
    participant StepCA as step-ca (OIDC)
    participant FLARE as FLARE Server

    Admin->>Admin: AdminAPI.__init__() ensure_client_cert_valid()
    Admin->>Cache: acquire POSIX flock (LOCK_EX)
    Cache-->>Admin: no valid cached cert
    Admin->>StepCA: step ca certificate (OIDC browser flow)
    StepCA-->>Admin: client.crt + client.key
    Admin->>Admin: validate chain to rootCA.pem, verify key match CN org role
    Admin->>Cache: store under ~/.nvflare/ephemeral_admin_certs/hash/ns/
    Admin->>Cache: release flock

    Admin->>FLARE: connect() mTLS with new cert
    FLARE-->>Admin: mTLS OK
    Admin->>FLARE: CERT_LOGIN (HCI)
    FLARE-->>Admin: session token

    Note over Admin,FLARE: Submit job
    Admin->>Admin: ensure_client_cert_valid() renewal window check
    Admin->>Admin: sign_folders embed client.crt as .__nvfl_submitter.crt
    Admin->>FLARE: submit_job --ephemeral-admin-cert zip
    FLARE->>FLARE: _submitter_cert_validity(zip) extract not_before/not_after
    FLARE->>FLARE: store SUBMITTER_CERT_VALIDITY in job meta

    Note over Admin,FLARE: Clone job (later)
    Admin->>FLARE: clone_job
    FLARE->>FLARE: _clone_signature_error() check now vs not_before/not_after
    alt cert still valid
        FLARE-->>Admin: clone created SUBMITTER_CERT_VALIDITY propagated
    else cert expired
        FLARE-->>Admin: error download and resubmit
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Admin as Admin CLI
    participant Cache as Cert Cache (~/.nvflare)
    participant StepCA as step-ca (OIDC)
    participant FLARE as FLARE Server

    Admin->>Admin: AdminAPI.__init__() ensure_client_cert_valid()
    Admin->>Cache: acquire POSIX flock (LOCK_EX)
    Cache-->>Admin: no valid cached cert
    Admin->>StepCA: step ca certificate (OIDC browser flow)
    StepCA-->>Admin: client.crt + client.key
    Admin->>Admin: validate chain to rootCA.pem, verify key match CN org role
    Admin->>Cache: store under ~/.nvflare/ephemeral_admin_certs/hash/ns/
    Admin->>Cache: release flock

    Admin->>FLARE: connect() mTLS with new cert
    FLARE-->>Admin: mTLS OK
    Admin->>FLARE: CERT_LOGIN (HCI)
    FLARE-->>Admin: session token

    Note over Admin,FLARE: Submit job
    Admin->>Admin: ensure_client_cert_valid() renewal window check
    Admin->>Admin: sign_folders embed client.crt as .__nvfl_submitter.crt
    Admin->>FLARE: submit_job --ephemeral-admin-cert zip
    FLARE->>FLARE: _submitter_cert_validity(zip) extract not_before/not_after
    FLARE->>FLARE: store SUBMITTER_CERT_VALIDITY in job meta

    Note over Admin,FLARE: Clone job (later)
    Admin->>FLARE: clone_job
    FLARE->>FLARE: _clone_signature_error() check now vs not_before/not_after
    alt cert still valid
        FLARE-->>Admin: clone created SUBMITTER_CERT_VALIDITY propagated
    else cert expired
        FLARE-->>Admin: error download and resubmit
    end
Loading

Reviews (10): Last reviewed commit: "Avoid duplicate step-ca timeout validati..." | Re-trigger Greptile

Comment thread nvflare/fuel/sec/ephemeral_admin_cert.py
Comment thread nvflare/tool/package_checker/nvflare_console_package_checker.py Outdated
Comment thread nvflare/fuel/sec/ephemeral_admin_cert.py
Comment thread nvflare/lighter/entity.py
@pcnudde pcnudde force-pushed the feat/step-ca-ephemeral-admin-certs branch from bbab945 to 86e49e4 Compare June 30, 2026 21:34
@pcnudde

pcnudde commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@greptileai review

@codecov-commenter

codecov-commenter commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.99083% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.77%. Comparing base (64aeae5) to head (4599ccf).

Files with missing lines Patch % Lines
nvflare/fuel/sec/ephemeral_admin_cert.py 89.85% 21 Missing ⚠️
nvflare/private/fed/server/job_cmds.py 78.84% 11 Missing ⚠️
nvflare/fuel/sec/step_ca_admin_cert.py 91.66% 6 Missing ⚠️
nvflare/tool/package_checker/check_rule.py 50.00% 6 Missing ⚠️
nvflare/fuel/hci/client/api.py 92.95% 5 Missing ⚠️
nvflare/fuel/hci/client/file_transfer.py 68.75% 5 Missing ⚠️
nvflare/lighter/entity.py 88.00% 3 Missing ⚠️
nvflare/tool/kit/kit_config.py 91.66% 2 Missing ⚠️
...package_checker/nvflare_console_package_checker.py 96.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4846      +/-   ##
==========================================
+ Coverage   56.53%   56.77%   +0.24%     
==========================================
  Files         969      972       +3     
  Lines       92261    92764     +503     
==========================================
+ Hits        52161    52669     +508     
+ Misses      40100    40095       -5     
Flag Coverage Δ
unit-tests 56.77% <88.99%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pcnudde pcnudde marked this pull request as ready for review July 1, 2026 16:54
@pcnudde pcnudde marked this pull request as draft July 1, 2026 21:09
@pcnudde pcnudde marked this pull request as ready for review July 1, 2026 21:58

pcnudde commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up on the updated Greptile summary:

  • Fixed the server_running analyzer ambiguity in 7f12c44d8 by making the final supported-scheme branch exhaustive after the unsupported-scheme guard.
  • No change to _submitter_cert_validity returning {} when a submitter certificate is present but unreadable. That sentinel is intentional fail-closed behavior: None means there is no inspectable submitter certificate metadata to enforce, while {} makes _clone_signature_error reject cloning instead of silently bypassing certificate-validity enforcement.

Validation: tests/unit_test/tool/package_checker/ephemeral_admin_test.py (7 passed) and ./runtest.sh -s.

pcnudde commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the updated Greptile compatibility finding in 793b31eb5.

  • Ephemeral admin clients now mark job submissions with an internal --ephemeral-admin-cert flag.
  • The server records and enforces submitter-certificate validity only for those marked ephemeral submissions.
  • Static-admin jobs retain the historical clone behavior, while ephemeral clones still fail early after the signing certificate expires.

Validation: tests/unit_test/fuel/hci/client/test_push_folder_key_guard.py plus tests/unit_test/private/fed/server/job_cmds_test.py (103 passed), ./runtest.sh -s, and git diff --check.

@pcnudde

pcnudde commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the latest Greptile summary nits in 0918d3082.

  • Missing step_ca provider_config.ca_url now reports a focused required-field error instead of an HTTPS-scheme error.
  • The server submitter-validity path now reuses the existing cert_time helper instead of duplicating it.
  • Added regression coverage for the missing-ca_url validation.

Validation: tests/unit_test/fuel/sec/ephemeral_admin_cert_test.py plus tests/unit_test/private/fed/server/job_cmds_test.py (108 passed), ./runtest.sh -s, and git diff --check.

pcnudde commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the latest Greptile summary finding in de7d0b72f.

  • The step-ca URL validator now accepts bracketed IPv6 loopback (http://[::1]:...) alongside IPv4 localhost for local development.
  • Added regression coverage for the IPv6 loopback URL.

Validation: tests/unit_test/fuel/sec/ephemeral_admin_cert_test.py (16 passed), ./runtest.sh -s, and git diff --check.

pcnudde commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the latest Greptile summary cleanup in a93a3592a.

  • Removed the second _command_timeout validation after _build_step_ca_command already validated the step-ca config; acquisition now converts the already-validated timeout value directly for the subprocess call.
  • No change to the RSA-2048 request policy, since making key algorithms configurable would expand the provider/security configuration surface.
  • No change to malformed submitter-certificate handling; returning {} is the intentional fail-closed sentinel that prevents an invalid ephemeral submission from bypassing clone expiry enforcement.

Validation: tests/unit_test/fuel/sec/ephemeral_admin_cert_test.py (16 passed), ./runtest.sh -s, and git diff --check.

@pcnudde pcnudde marked this pull request as draft July 2, 2026 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants