Skip to content

Write Client API config files with owner-only permissions#4855

Merged
pcnudde merged 4 commits into
NVIDIA:mainfrom
YuanTingHsieh:yuantingh/client-api-ep1-config-perms
Jul 2, 2026
Merged

Write Client API config files with owner-only permissions#4855
pcnudde merged 4 commits into
NVIDIA:mainfrom
YuanTingHsieh:yuantingh/client-api-ep1-config-perms

Conversation

@YuanTingHsieh

Copy link
Copy Markdown
Collaborator

What

Write the Client API config (client_api_config.json) owner-only on POSIX. It embeds live AUTH_TOKEN/AUTH_TOKEN_SIGNATURE and was written with the default umask — world-readable on most systems.

ClientConfig.to_json (the single choke point for every writer: write_config_to_file, ClientAPILauncherExecutor.prepare_config_for_launch, ExternalConfigurator) now:

  • creates the file with O_NOFOLLOW (rejects a planted symlink at the config path);
  • fchmods the open descriptor to 0600 — applies whether the file is newly created or pre-existing (an O_CREAT mode only takes effect on creation);
  • fails closed: if the mode cannot be set (e.g. a pre-existing file owned by another user), it raises rather than writing credentials into an exposed file;
  • on Windows, POSIX modes don't map to NTFS ACLs, so fchmod is skipped and this is documented honestly — protection there relies on directory ACLs.

Program context

Client API Execution Modes (2.9) — EP-1, Wave 0 of the plan.
Design: docs/design/client_api_execution_modes.md § "Appendix B — Bootstrap config protection"
Plan: docs/design/client_api_execution_modes_plan.md (PR #4853)
Depends on: none · Unblocks: TE-2 (bootstrap config writer reuses this). Fixes a live exposure on today's external_process path independent of the rest of the program.

Behavior note (release-relevant)

Deployments where an externally started trainer runs as a different OS user than the FL client must now explicitly re-permission the config (or run same-user) — by design per the bootstrap-config protection contract. The 3rd-party integration doc is updated with this note.

Testing

New config_test.py: fresh write is 0600; pre-existing 0644 file tightened to 0600; fail-closed when fchmod is denied (token not written); symlink target rejected and left untouched. POSIX-only assertions guarded with skipif. 7 new + 377 regression pass; style clean.

🤖 Generated with Claude Code

EP-1 (Wave 0) of the Client API Execution Modes program (see
docs/design/client_api_execution_modes_plan.md on PR NVIDIA#4853).

client_api_config.json embeds live AUTH_TOKEN/AUTH_TOKEN_SIGNATURE and
was written with the default umask - world-readable on most systems.
ClientConfig.to_json (the single choke point for every writer:
write_config_to_file, ClientAPILauncherExecutor.prepare_config_for_launch,
ExternalConfigurator) now creates config files 0600 via os.open and
tightens pre-existing files with chmod before rewrite. chmod failures
degrade to a warning so limited-permission platforms (Windows) never
crash.

Behavior note: deployments that relied on a different OS user reading
client_api_config.json will need explicit permission management - by
design per the bootstrap-config protection contract (design doc
Appendix B).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 2, 2026 16:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens writing of the Client API config (client_api_config.json) to better protect embedded live auth material (e.g., AUTH_TOKEN/AUTH_TOKEN_SIGNATURE) by enforcing owner-only file permissions on POSIX, adding symlink protection, and documenting operational implications for 3rd-party integrations.

Changes:

  • Enforce POSIX owner-only permissions (0600) when persisting Client API config via ClientConfig.to_json, and reject symlink paths using O_NOFOLLOW when available.
  • Add unit tests covering fresh writes, tightening pre-existing permissions, fail-closed behavior when permissions cannot be set, and symlink rejection.
  • Document the same-OS-user requirement (or explicit re-permissioning) for externally started trainers on POSIX.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
nvflare/client/config.py Implements secure config writing (owner-only on POSIX + O_NOFOLLOW) as the central write path via ClientConfig.to_json.
tests/unit_test/client/config_test.py Adds focused unit tests validating permission behavior, fail-closed semantics, and symlink rejection.
docs/programming_guide/execution_api_type/3rd_party_integration.rst Documents the operational impact of owner-only config permissions for external trainers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nvflare/client/config.py Outdated
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a live credential exposure by rewriting ClientConfig.to_json to write the config file atomically via a sibling temp file (mkstemp) that is secured to 0600 with fchmod before any content is written, then renamed into place with os.replace. The approach achieves fail-closed behavior (original file is never truncated on failure), permission tightening of pre-existing world-readable files, and symlink neutralization — all in a single code path.

  • nvflare/client/config.py: to_json replaced with atomic temp-file + os.replace write; CONFIG_FILE_PERMISSION = 0o600 exported as a module constant; Windows explicitly documented as relying on directory ACLs.
  • tests/unit_test/client/config_test.py: New test file covering fresh-write permissions, permission tightening, fail-closed data preservation, symlink replacement, and a POSIX-guarded round-trip content check.
  • docs/programming_guide/execution_api_type/3rd_party_integration.rst: Added deployment note warning operators that externally started trainers must share the FL client's OS user (or the file must be explicitly re-permissioned).

Confidence Score: 5/5

Safe to merge — the atomic write and permission-hardening logic is correct, the fail-closed contract holds, and the test suite thoroughly exercises the key paths.

The core change is a well-scoped security fix: to_json now writes credentials only after the temp file is locked to 0600, and os.replace ensures the original file is never modified on failure. The logic for tracking fd ownership, closing on exception, and cleaning up the temp file is sound. The two nits are minor robustness improvements to the cleanup path that do not affect correctness under normal operation.

No files require special attention — all three changed files are straightforward and correctly implemented.

Important Files Changed

Filename Overview
nvflare/client/config.py Rewrites to_json to use atomic temp-file + os.replace pattern with fchmod(0600) on POSIX; correctly handles fail-closed, data preservation, and symlink replacement. One minor robustness nit in the cleanup path.
tests/unit_test/client/config_test.py Good coverage of fresh write, permission tightening, fail-closed with data preservation, and symlink replacement. One skipif guard references O_NOFOLLOW which is no longer used in the implementation.
docs/programming_guide/execution_api_type/3rd_party_integration.rst Adds a well-placed deployment note explaining the 0600 permission requirement and the same-OS-user constraint for externally started trainers on POSIX, plus the Windows guidance.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[to_json called] --> B["mkstemp(dir=config_dir)\ncreates temp file, returns fd"]
    B --> C{os.name == posix?}
    C -- Yes --> D["fchmod(fd, 0o600)"]
    C -- No --> E["skip fchmod\n(Windows: rely on dir ACLs)"]
    D -->|fchmod raises| F[except BaseException]
    D --> E
    E --> G["fdopen(fd, 'w')\nfd_owned = False"]
    G --> H["json.dump(config, f)"]
    H -->|write raises| F
    H --> I["os.replace(tmp_path, config_file)\natomic rename"]
    I -->|replace raises| F
    I --> J[Done — config_file is\nowner-only regular file]
    F --> K["close fd if still owned\n(fd_owned == True)"]
    K --> L["os.remove(tmp_path)\nbest-effort cleanup"]
    L --> M["raise — original\nconfig_file untouched"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[to_json called] --> B["mkstemp(dir=config_dir)\ncreates temp file, returns fd"]
    B --> C{os.name == posix?}
    C -- Yes --> D["fchmod(fd, 0o600)"]
    C -- No --> E["skip fchmod\n(Windows: rely on dir ACLs)"]
    D -->|fchmod raises| F[except BaseException]
    D --> E
    E --> G["fdopen(fd, 'w')\nfd_owned = False"]
    G --> H["json.dump(config, f)"]
    H -->|write raises| F
    H --> I["os.replace(tmp_path, config_file)\natomic rename"]
    I -->|replace raises| F
    I --> J[Done — config_file is\nowner-only regular file]
    F --> K["close fd if still owned\n(fd_owned == True)"]
    K --> L["os.remove(tmp_path)\nbest-effort cleanup"]
    L --> M["raise — original\nconfig_file untouched"]
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into yuantingh/clien..." | Re-trigger Greptile

Comment thread nvflare/client/config.py Outdated
Comment thread tests/unit_test/client/config_test.py Outdated
@codecov-commenter

codecov-commenter commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.97%. Comparing base (022b1eb) to head (7bde0dd).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
nvflare/client/config.py 90.47% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4855   +/-   ##
=======================================
  Coverage   56.97%   56.97%           
=======================================
  Files         969      969           
  Lines       92262    92281   +19     
=======================================
+ Hits        52563    52577   +14     
- Misses      39699    39704    +5     
Flag Coverage Δ
unit-tests 56.97% <90.47%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@YuanTingHsieh

Copy link
Copy Markdown
Collaborator Author

Context for reviewers: this PR is defense-in-depth, not a complete fix for AUTH_TOKEN exposure. The same credential is also leaked via the CJ command line (ps), the server INFO log, and container/orchestrator introspection, and the token itself is a replayable non-expiring bearer credential. Those higher-value items are tracked in #4858. Recommend this land as hardening, not advertised as closing the AUTH_TOKEN vulnerability (which would point at the still-open leaks).

…lure

Review fix (PR NVIDIA#4855): the previous approach opened the target with O_TRUNC
before fchmod, so the fail-closed path (fchmod denied) had already wiped the
original file to empty — data loss even though no secret was written.

Write to a sibling temp file secured with fchmod(0600) before any content,
then os.replace() into place. On failure the original file is never touched
(no truncation), and a planted symlink at the config path is replaced by a
regular owner-only file rather than being written through to its target.

Tests updated to assert the original content survives a failed write and that
a symlink target is never written through.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@pcnudde pcnudde enabled auto-merge (squash) July 2, 2026 21:37
@pcnudde pcnudde merged commit c04cdee into NVIDIA:main Jul 2, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants