Add Client API execution modes design docs and F3 aggregate transfer outcome#4853
Add Client API execution modes design docs and F3 aggregate transfer outcome#4853YuanTingHsieh wants to merge 6 commits into
Conversation
PR-0 + F3-1 of the Client API Execution Modes program (see docs/design/client_api_execution_modes_plan.md). Design docs: land the approved "Client API Execution Modes" design and its implementation plan in docs/design/ so subsequent program PRs can link them. F3 aggregate terminal outcome: TransactionDoneStatus.FINISHED only means a transaction reached its receiver count - a FAILED receiver still counts, and transaction_done_cb carries no per-receiver outcomes. This adds the normalized TransferOutcome next to the existing contract, additively: - compute_transfer_outcome(): COMPLETED only when every expected receiver of every ref succeeded (receiver truth wins - routine cleanup via delete_transaction after full success resolves COMPLETED); fails closed on receiver failure, missing receivers, no objects, timeout, or unknown receiver count. - new outcome_cb on new_transaction()/ObjectDownloader and a TTL-bounded outcome table behind get_transaction_outcome(), recorded before user callbacks run so pollers never see a terminated-but-unknown gap; a reused explicit tx_id purges the prior incarnation's record. - termination callbacks (obj.transaction_done, transaction_done_cb, outcome_cb) are now invoked exception-safe: a raising callback no longer kills the monitor thread or skips source release. - per-receiver status recording and terminal snapshots are lock-guarded; outcome table has its own lock and is cleared (and gated) on shutdown. - DownloadStatus/TransactionDoneStatus move to transfer_outcome.py with re-exports from download_service (import compatibility preserved). transaction_done_cb signature, TransactionDoneStatus values, and existing timeout/tombstone behavior are unchanged; existing gating tests pass unmodified. Shared DownloadService test helpers move to download_test_utils.py so the isolated-service subclass is defined once. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR lands the “Client API Execution Modes” design + implementation plan docs, and extends the F3 DownloadService with an additive, normalized terminal TransferOutcome that distinguishes “all receivers succeeded” from “transaction terminated”.
Changes:
- Add design reference docs for the Client API Execution Modes program (
client_api_execution_modes.md+ companion plan). - Introduce
transfer_outcome.pywithTransferOutcome/RefOutcomeand aggregation logic (compute_transfer_outcome()), re-exporting legacy names fromdownload_service. - Update
DownloadService/_Transactionto compute/record outcomes, add an optionaloutcome_cb, and make termination callbacks exception-safe; add comprehensive unit tests and shared test utilities.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
docs/design/client_api_execution_modes.md |
Adds the approved execution-modes design doc as in-repo reference material. |
docs/design/client_api_execution_modes_plan.md |
Adds the decomposed PR-by-PR implementation plan for the program. |
nvflare/fuel/f3/streaming/transfer_outcome.py |
New module defining normalized terminal transfer outcome and aggregation rules. |
nvflare/fuel/f3/streaming/download_service.py |
Computes/records outcomes, adds outcome_cb, callback exception-safety, and outcome polling/TTL. |
nvflare/fuel/f3/streaming/obj_downloader.py |
Threads the new optional outcome_cb through ObjectDownloader. |
tests/unit_test/fuel/f3/streaming/transfer_outcome_test.py |
New tests covering outcome aggregation, ordering, exception safety, TTL, shutdown, tx_id reuse. |
tests/unit_test/fuel/f3/streaming/download_test_utils.py |
New shared DownloadService test helpers (isolated subclass + monitor runner). |
tests/unit_test/fuel/f3/streaming/download_service_test.py |
Refactors tests to use shared helpers from download_test_utils.py. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR adds two pieces: design documentation for the Client API Execution Modes program (
Confidence Score: 5/5Safe to merge. The locking strategy is consistent (no deadlock potential), the outcome aggregation logic is correct, callback exception safety is properly implemented across all three termination paths, and the tx-id reuse guard correctly prevents stale outcomes from shadowing live retries. The core outcome computation is stateless and thoroughly unit-tested. The new per-receiver-status lock in No files require special attention. The Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[transaction terminates
FINISHED / TIMEOUT / DELETED] --> B{done_status
recognized?}
B -- No --> F1[FAILED
UNKNOWN_DONE_STATUS]
B -- Yes --> C{all expected receivers
of every ref succeeded?}
C -- Yes --> OK[COMPLETED
ALL_RECEIVERS_SUCCEEDED]
C -- No --> D{done_status?}
D -- DELETED --> AB[ABORTED
DELETED]
D -- TIMEOUT --> F2[FAILED
TIMEOUT]
D -- FINISHED --> E{num_receivers
> 0?}
E -- No --> F3[FAILED
UNKNOWN_RECEIVER_COUNT]
E -- Yes --> G{refs present?}
G -- No --> F4[FAILED
NO_OBJECTS]
G -- Yes --> F5[FAILED
RECEIVER_FAILED]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[transaction terminates
FINISHED / TIMEOUT / DELETED] --> B{done_status
recognized?}
B -- No --> F1[FAILED
UNKNOWN_DONE_STATUS]
B -- Yes --> C{all expected receivers
of every ref succeeded?}
C -- Yes --> OK[COMPLETED
ALL_RECEIVERS_SUCCEEDED]
C -- No --> D{done_status?}
D -- DELETED --> AB[ABORTED
DELETED]
D -- TIMEOUT --> F2[FAILED
TIMEOUT]
D -- FINISHED --> E{num_receivers
> 0?}
E -- No --> F3[FAILED
UNKNOWN_RECEIVER_COUNT]
E -- Yes --> G{refs present?}
G -- No --> F4[FAILED
NO_OBJECTS]
G -- Yes --> F5[FAILED
RECEIVER_FAILED]
Reviews (6): Last reviewed commit: "Plan: TE-1 freeze is vocabulary-only; au..." | Re-trigger Greptile |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4853 +/- ##
==========================================
+ Coverage 56.20% 57.01% +0.80%
==========================================
Files 967 970 +3
Lines 91978 92371 +393
==========================================
+ Hits 51699 52668 +969
+ Misses 40279 39703 -576
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Fixes from PR NVIDIA#4853 review: - Reused tx_id race: termination removes a transaction from _tx_table under _tx_lock but records its outcome afterward; a retry registering the same tx_id in that gap could have its purged table repopulated by the old incarnation's outcome (reproduced: live retry surfaced done_status="deleted"). new_transaction now registers the current incarnation under _outcome_lock and _record_outcome drops outcomes from any transaction that is no longer the current incarnation. - Unknown done_status fail-closed: status validation now precedes the receiver-truth check, so an unknown/future termination status with successful receivers can no longer certify COMPLETED. - Tombstone snapshot: _delete_tx now uses the lock-guarded snapshot_receiver_statuses() instead of copying receiver_statuses unguarded while obj_downloaded() may mutate it. - _expire_outcomes: full scan instead of insertion-order early-break; concurrent recorders can insert slightly out of timestamp order. - num_receivers == 0 docstrings unified (unknown/unbounded, never certified finished, outcome never COMPLETED). - compute_transfer_outcome timestamp annotated Optional[float]. Adds regression tests for the incarnation race and the unknown-status-with-successful-receivers case. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-up doc updates for the Client API Execution Modes program: - Configuration Surface + legacy-knob disposition table: add the executor arguments surfaced while building EX-2 (task_script_path/args, the task-name mapping powering flare.is_train()/is_evaluate()/ is_submit_model(), memory_gc_rounds/cuda_empty_cache) and note that mode-scoped args are validated at construction. - Plan F3-3 row: note the min_responses/quorum surface for fan-out (k-of-N receivers) is settled there; the TransferOutcome.completed certificate stays strictly all-receivers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sign - external_process mode: state its primary driver (multi-GPU via torchrun/Deepspeed/Horovod/mpirun; NVFlare shells out and talks only to rank 0 rather than reimplementing inter-rank comm), and derive from it why external_process auth is a localhost launch-token and why per-task trainer launch stays out of JobLauncherSpec. - Alternatives Considered: add "authenticate the trainer with transport-layer mTLS instead of a session token" with the ergonomics-vs-reuse tradeoff, and strengthen the JobLauncher rejection with the same rationale. - Plan: TE-1 trimmed to the stateless proof toolkit; the stateful SessionTokenManager moves to AT-2 (attach-only). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Reference Epic FLARE-2698 in the design doc status and plan header. - Configuration Surface: drop params_exchange_format/transfer_type/ server_expected_format/converter-ids from the executor; add the "no parameter converters on the executor" rationale (conversion moves to send/receive filters at the client edge; transfer type stays a Client API concern). Disposition table updated to match. - Plan: add EX-5 (converter->filter migration) as its own tracked PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
## What Export `get_task_name` from the `nvflare.client` package. It exists in `nvflare/client/api.py` but was never re-exported, so `flare.get_task_name()` failed despite being part of the Client API control surface. ## Program context Client API Execution Modes (2.9) — **EX-1, Wave 0** of the plan. Design: `docs/design/client_api_execution_modes.md` § "Client API Backends" / rank contract (which lists `get_task_name()` among the control-rank APIs) Plan: `docs/design/client_api_execution_modes_plan.md` (PR #4853) Depends on: none · Unblocks: nothing directly (removes a latent gap the rank contract assumes). ## Testing New `package_exports_test.py` asserts, by **identity** (`flare.X is api.X`), that all 10 control-API names resolve to the `api` module functions — so a future refactor that rebinds a name to the wrong object fails CI. 10 pass; black/isort/flake8 clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Reflect the NVIDIA#4856 trim: TE-1 (interface freeze #1) is now the Cell control-protocol vocabulary only. The auth mechanism is not frozen ahead of its decision — the proof helpers land with EP-3 (which owns the host-trust / auth-strength call: rendezvous-only vs one-round HMAC), the generic token/nonce/digest generators go to FLARE-3017, and the stateful SessionTokenManager stays attach-only (AT-2). Updated the TE-1, EP-3, AT-2 rows and the interface-freeze note accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
## What Write the Client API config (`client_api_config.json`) owner-only on POSIX. It embeds live `AUTH_TOKEN`/`AUTH_TOKEN_SIGNATURE` and was written with the default umask — world-readable on most systems. `ClientConfig.to_json` (the single choke point for every writer: `write_config_to_file`, `ClientAPILauncherExecutor.prepare_config_for_launch`, `ExternalConfigurator`) now: - creates the file with `O_NOFOLLOW` (rejects a planted symlink at the config path); - `fchmod`s the open descriptor to `0600` — applies whether the file is newly created or pre-existing (an `O_CREAT` mode only takes effect on creation); - **fails closed**: if the mode cannot be set (e.g. a pre-existing file owned by another user), it raises rather than writing credentials into an exposed file; - on Windows, POSIX modes don't map to NTFS ACLs, so `fchmod` is skipped and this is documented honestly — protection there relies on directory ACLs. ## Program context Client API Execution Modes (2.9) — **EP-1, Wave 0** of the plan. Design: `docs/design/client_api_execution_modes.md` § "Appendix B — Bootstrap config protection" Plan: `docs/design/client_api_execution_modes_plan.md` (PR #4853) Depends on: none · Unblocks: TE-2 (bootstrap config writer reuses this). Fixes a **live exposure** on today's external_process path independent of the rest of the program. ## Behavior note (release-relevant) Deployments where an externally started trainer runs as a **different OS user** than the FL client must now explicitly re-permission the config (or run same-user) — by design per the bootstrap-config protection contract. The 3rd-party integration doc is updated with this note. ## Testing New `config_test.py`: fresh write is `0600`; pre-existing `0644` file tightened to `0600`; **fail-closed** when `fchmod` is denied (token not written); symlink target rejected and left untouched. POSIX-only assertions guarded with `skipif`. 7 new + 377 regression pass; style clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
What
Two pieces of the Client API Execution Modes program, combined per plan discussion:
PR-0 — program reference material. Lands the approved "Client API Execution Modes" design (
docs/design/client_api_execution_modes.md) and its implementation plan (docs/design/client_api_execution_modes_plan.md) in-repo so every subsequent program PR links them instead of re-explaining.F3-1 — aggregate terminal transfer outcome.
TransactionDoneStatus.FINISHEDonly means a transaction reached its receiver count: a receiver that FAILED still counts towardnum_receivers, andtransaction_done_cbcarries no per-receiver outcomes — so "the transaction terminated" is indistinguishable from "every receiver got the bytes". This PR adds the normalizedTransferOutcomenext to the existing contract, additively:compute_transfer_outcome(): COMPLETED only when every expected receiver of every ref succeeded — receiver truth wins, so routine cleanup viadelete_transaction()after full success resolves COMPLETED; everything else fails closed (receiver failure, missing receivers, no objects, timeout, unknown receiver count).outcome_cbonDownloadService.new_transaction()/ObjectDownloader, plus a TTL-bounded outcome table behindget_transaction_outcome(). Outcomes are recorded before user callbacks run, so pollers never observe a terminated-but-unknown gap; a reused explicittx_idpurges the prior incarnation's record (the design's retry rule reusestransfer_idastx_id).obj.transaction_done,transaction_done_cb,outcome_cb) are now invoked exception-safe: a raising callback no longer kills the transaction monitor thread or skips source release._tx_lockhot path) and is cleared and gated onshutdown().DownloadStatus/TransactionDoneStatusmove totransfer_outcome.pywith re-exports fromdownload_service— all existing imports keep working.Compatibility:
transaction_done_cbsignature,TransactionDoneStatusvalues, and existing timeout/tombstone behavior are unchanged; the pre-existing gating suites (test_download_complete_gating,test_reverse_result_upload_progress_wait,task_exchanger_stream_progress_test) pass unmodified.Program context
Client API Execution Modes (2.9) — PR-0 + F3-1, Wave 0 of the plan.
Design:
docs/design/client_api_execution_modes.md§ "Payload Lifecycle State Machine" / "Terminal transfer outcome — the first contract to build" / "CellNet Boundary and Dependencies"Plan:
docs/design/client_api_execution_modes_plan.md(interface freeze #3)Depends on: none · Unblocks: F3-2 (receiver-confirmed completion), F3-3 (per-receiver budgets), F3-4 (awaitable facade) — all parallel behind this — and, transitively, the trainer send contract (TE-4) and external_process backend (EP-4).
Design contracts implemented
TransferProgressStateterminal names — no fourth status vocabulary.Out of scope (and where it lands instead)
flare.send()/ session-close gating) — F3-4Testing
31 new unit tests (
transfer_outcome_test.py) covering the aggregation matrix, callback ordering/exception safety, tx_id-reuse purge, TTL expiry (monitor + lazy), and shutdown gating; shared DownloadService test helpers extracted todownload_test_utils.pyso the isolated-service subclass is defined once. Fulltests/unit_test/fuel+client+app_common/ccwfsweep: 1290 passed. black/isort/flake8 clean.🤖 Generated with Claude Code