Skip to content

Add Client API execution modes design docs and F3 aggregate transfer outcome#4853

Open
YuanTingHsieh wants to merge 6 commits into
NVIDIA:mainfrom
YuanTingHsieh:yuantingh/client-api-f3-transfer-outcome
Open

Add Client API execution modes design docs and F3 aggregate transfer outcome#4853
YuanTingHsieh wants to merge 6 commits into
NVIDIA:mainfrom
YuanTingHsieh:yuantingh/client-api-f3-transfer-outcome

Conversation

@YuanTingHsieh

Copy link
Copy Markdown
Collaborator

What

Two pieces of the Client API Execution Modes program, combined per plan discussion:

PR-0 — program reference material. Lands the approved "Client API Execution Modes" design (docs/design/client_api_execution_modes.md) and its implementation plan (docs/design/client_api_execution_modes_plan.md) in-repo so every subsequent program PR links them instead of re-explaining.

F3-1 — aggregate terminal transfer outcome. TransactionDoneStatus.FINISHED only means a transaction reached its receiver count: a receiver that FAILED still counts toward num_receivers, and transaction_done_cb carries no per-receiver outcomes — so "the transaction terminated" is indistinguishable from "every receiver got the bytes". This PR adds the normalized TransferOutcome next to the existing contract, additively:

  • compute_transfer_outcome(): COMPLETED only when every expected receiver of every ref succeeded — receiver truth wins, so routine cleanup via delete_transaction() after full success resolves COMPLETED; everything else fails closed (receiver failure, missing receivers, no objects, timeout, unknown receiver count).
  • New optional outcome_cb on DownloadService.new_transaction() / ObjectDownloader, plus a TTL-bounded outcome table behind get_transaction_outcome(). Outcomes are recorded before user callbacks run, so pollers never observe a terminated-but-unknown gap; a reused explicit tx_id purges the prior incarnation's record (the design's retry rule reuses transfer_id as tx_id).
  • Termination callbacks (obj.transaction_done, transaction_done_cb, outcome_cb) are now invoked exception-safe: a raising callback no longer kills the transaction monitor thread or skips source release.
  • Per-receiver status recording and terminal snapshots are lock-guarded; the outcome table has its own lock (off the chunk-serving _tx_lock hot path) and is cleared and gated on shutdown().
  • DownloadStatus / TransactionDoneStatus move to transfer_outcome.py with re-exports from download_service — all existing imports keep working.

Compatibility: transaction_done_cb signature, TransactionDoneStatus values, and existing timeout/tombstone behavior are unchanged; the pre-existing gating suites (test_download_complete_gating, test_reverse_result_upload_progress_wait, task_exchanger_stream_progress_test) pass unmodified.

Program context

Client API Execution Modes (2.9) — PR-0 + F3-1, Wave 0 of the plan.
Design: docs/design/client_api_execution_modes.md § "Payload Lifecycle State Machine" / "Terminal transfer outcome — the first contract to build" / "CellNet Boundary and Dependencies"
Plan: docs/design/client_api_execution_modes_plan.md (interface freeze #3)
Depends on: none · Unblocks: F3-2 (receiver-confirmed completion), F3-3 (per-receiver budgets), F3-4 (awaitable facade) — all parallel behind this — and, transitively, the trainer send contract (TE-4) and external_process backend (EP-4).

Design contracts implemented

  • One normalized terminal outcome per transfer, distinguishing all-receivers-success from completed-with-receiver-failure (design: "the shared payload layer must expose that distinction first").
  • Outcome status vocabulary reuses the TransferProgressState terminal names — no fourth status vocabulary.
  • Fail-closed aggregation, including the mid-assembly race (FINISHED with zero refs must not certify success).

Out of scope (and where it lands instead)

  • Receiver-confirmed terminal statuses (statuses here are producer-served at EOF) — F3-2
  • Per-receiver acquire/idle budgets and receiver identity (num_receivers is a count) — F3-3
  • The awaitable producer-facing wait (flare.send() / session-close gating) — F3-4
  • Bounded post-completion linger before producer release — F3-5

Testing

31 new unit tests (transfer_outcome_test.py) covering the aggregation matrix, callback ordering/exception safety, tx_id-reuse purge, TTL expiry (monitor + lazy), and shutdown gating; shared DownloadService test helpers extracted to download_test_utils.py so the isolated-service subclass is defined once. Full tests/unit_test/fuel + client + app_common/ccwf sweep: 1290 passed. black/isort/flake8 clean.

🤖 Generated with Claude Code

PR-0 + F3-1 of the Client API Execution Modes program (see
docs/design/client_api_execution_modes_plan.md).

Design docs: land the approved "Client API Execution Modes" design and
its implementation plan in docs/design/ so subsequent program PRs can
link them.

F3 aggregate terminal outcome: TransactionDoneStatus.FINISHED only means
a transaction reached its receiver count - a FAILED receiver still counts,
and transaction_done_cb carries no per-receiver outcomes. This adds the
normalized TransferOutcome next to the existing contract, additively:

- compute_transfer_outcome(): COMPLETED only when every expected receiver
  of every ref succeeded (receiver truth wins - routine cleanup via
  delete_transaction after full success resolves COMPLETED); fails closed
  on receiver failure, missing receivers, no objects, timeout, or unknown
  receiver count.
- new outcome_cb on new_transaction()/ObjectDownloader and a TTL-bounded
  outcome table behind get_transaction_outcome(), recorded before user
  callbacks run so pollers never see a terminated-but-unknown gap; a
  reused explicit tx_id purges the prior incarnation's record.
- termination callbacks (obj.transaction_done, transaction_done_cb,
  outcome_cb) are now invoked exception-safe: a raising callback no
  longer kills the monitor thread or skips source release.
- per-receiver status recording and terminal snapshots are lock-guarded;
  outcome table has its own lock and is cleared (and gated) on shutdown.
- DownloadStatus/TransactionDoneStatus move to transfer_outcome.py with
  re-exports from download_service (import compatibility preserved).

transaction_done_cb signature, TransactionDoneStatus values, and existing
timeout/tombstone behavior are unchanged; existing gating tests pass
unmodified. Shared DownloadService test helpers move to
download_test_utils.py so the isolated-service subclass is defined once.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 2, 2026 00:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR lands the “Client API Execution Modes” design + implementation plan docs, and extends the F3 DownloadService with an additive, normalized terminal TransferOutcome that distinguishes “all receivers succeeded” from “transaction terminated”.

Changes:

  • Add design reference docs for the Client API Execution Modes program (client_api_execution_modes.md + companion plan).
  • Introduce transfer_outcome.py with TransferOutcome/RefOutcome and aggregation logic (compute_transfer_outcome()), re-exporting legacy names from download_service.
  • Update DownloadService / _Transaction to compute/record outcomes, add an optional outcome_cb, and make termination callbacks exception-safe; add comprehensive unit tests and shared test utilities.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
docs/design/client_api_execution_modes.md Adds the approved execution-modes design doc as in-repo reference material.
docs/design/client_api_execution_modes_plan.md Adds the decomposed PR-by-PR implementation plan for the program.
nvflare/fuel/f3/streaming/transfer_outcome.py New module defining normalized terminal transfer outcome and aggregation rules.
nvflare/fuel/f3/streaming/download_service.py Computes/records outcomes, adds outcome_cb, callback exception-safety, and outcome polling/TTL.
nvflare/fuel/f3/streaming/obj_downloader.py Threads the new optional outcome_cb through ObjectDownloader.
tests/unit_test/fuel/f3/streaming/transfer_outcome_test.py New tests covering outcome aggregation, ordering, exception safety, TTL, shutdown, tx_id reuse.
tests/unit_test/fuel/f3/streaming/download_test_utils.py New shared DownloadService test helpers (isolated subclass + monitor runner).
tests/unit_test/fuel/f3/streaming/download_service_test.py Refactors tests to use shared helpers from download_test_utils.py.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nvflare/fuel/f3/streaming/download_service.py
Comment thread nvflare/fuel/f3/streaming/download_service.py
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds two pieces: design documentation for the Client API Execution Modes program (docs/design/) and the F3-1 feature — a normalized TransferOutcome layer on top of DownloadService that distinguishes all-receivers-success from mere transaction termination.

  • transfer_outcome.py (new): defines TransferOutcome, compute_transfer_outcome, and RefOutcome; moves DownloadStatus/TransactionDoneStatus here with re-exports so all existing imports keep working.
  • download_service.py: adds per-receiver-status locking in _Ref.obj_downloaded, exception-safe callback wrappers (_invoke_cb_safely), an outcome table (_tx_outcomes/_tx_incarnations) with TTL expiry, get_transaction_outcome(), and the on_outcome/outcome_cb hooks wired to all three termination paths (TIMEOUT, FINISHED, DELETED).
  • Tests (transfer_outcome_test.py, download_test_utils.py): 31 new unit tests covering the aggregation matrix, callback ordering/exception safety, tx_id reuse/race protection, TTL expiry (both monitor-driven and lazy), and shutdown gating.

Confidence Score: 5/5

Safe to merge. The locking strategy is consistent (no deadlock potential), the outcome aggregation logic is correct, callback exception safety is properly implemented across all three termination paths, and the tx-id reuse guard correctly prevents stale outcomes from shadowing live retries.

The core outcome computation is stateless and thoroughly unit-tested. The new per-receiver-status lock in _Ref.obj_downloaded correctly guards receiver_statuses and _downloaded_to_all_called while keeping user callbacks outside the lock. The outcome table has its own lock (_outcome_lock) that is never held simultaneously with _tx_lock, eliminating deadlock risk. The only notable item is a dead-code timed_out() helper that was not updated to thread on_outcome through — it cannot cause a regression today because it is never called, but would silently bypass outcome recording if resurrected.

No files require special attention. The download_service.py changes are the most complex but are well-guarded by the 31 new unit tests in transfer_outcome_test.py.

Important Files Changed

Filename Overview
nvflare/fuel/f3/streaming/transfer_outcome.py New module defining the normalized outcome types and compute_transfer_outcome. Aggregation logic is correct and well-tested: unknown done-status fails closed before receiver truth is evaluated, DELETED/TIMEOUT with full-receiver-success correctly resolve COMPLETED, and the frozen dataclasses prevent accidental mutation.
nvflare/fuel/f3/streaming/download_service.py Main change file: adds outcome table with TTL, per-receiver-status locking in _Ref.obj_downloaded, exception-safe callbacks for all three termination paths, and get_transaction_outcome(). One dead-code method (timed_out) bypasses the new on_outcome parameter but is never called.
tests/unit_test/fuel/f3/streaming/transfer_outcome_test.py 367-line test suite covering the aggregation matrix, callback ordering invariant (on_outcome before done_cb before outcome_cb), raising-callback safety, tx-id reuse purge, TTL expiry paths, and shutdown gating — comprehensive and all assertions are correct.
tests/unit_test/fuel/f3/streaming/download_test_utils.py Shared test helpers extracted from download_service_test.py. The isolated service subclass now correctly overrides all new class-level tables (_tx_outcomes, _tx_incarnations, _outcome_lock, _accept_outcomes), preventing state leakage between tests.
nvflare/fuel/f3/streaming/obj_downloader.py Minimal change — threads outcome_cb through to DownloadService.new_transaction. Docstring is accurate and the parameter is correctly forwarded.
tests/unit_test/fuel/f3/streaming/download_service_test.py Refactored to import shared helpers from download_test_utils.py; no existing test behavior changed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[transaction terminates
FINISHED / TIMEOUT / DELETED] --> B{done_status
recognized?}
    B -- No --> F1[FAILED
UNKNOWN_DONE_STATUS]
    B -- Yes --> C{all expected receivers
of every ref succeeded?}
    C -- Yes --> OK[COMPLETED
ALL_RECEIVERS_SUCCEEDED]
    C -- No --> D{done_status?}
    D -- DELETED --> AB[ABORTED
DELETED]
    D -- TIMEOUT --> F2[FAILED
TIMEOUT]
    D -- FINISHED --> E{num_receivers
> 0?}
    E -- No --> F3[FAILED
UNKNOWN_RECEIVER_COUNT]
    E -- Yes --> G{refs present?}
    G -- No --> F4[FAILED
NO_OBJECTS]
    G -- Yes --> F5[FAILED
RECEIVER_FAILED]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[transaction terminates
FINISHED / TIMEOUT / DELETED] --> B{done_status
recognized?}
    B -- No --> F1[FAILED
UNKNOWN_DONE_STATUS]
    B -- Yes --> C{all expected receivers
of every ref succeeded?}
    C -- Yes --> OK[COMPLETED
ALL_RECEIVERS_SUCCEEDED]
    C -- No --> D{done_status?}
    D -- DELETED --> AB[ABORTED
DELETED]
    D -- TIMEOUT --> F2[FAILED
TIMEOUT]
    D -- FINISHED --> E{num_receivers
> 0?}
    E -- No --> F3[FAILED
UNKNOWN_RECEIVER_COUNT]
    E -- Yes --> G{refs present?}
    G -- No --> F4[FAILED
NO_OBJECTS]
    G -- Yes --> F5[FAILED
RECEIVER_FAILED]
Loading

Reviews (6): Last reviewed commit: "Plan: TE-1 freeze is vocabulary-only; au..." | Re-trigger Greptile

Comment thread nvflare/fuel/f3/streaming/download_service.py
Comment thread nvflare/fuel/f3/streaming/transfer_outcome.py
@codecov-commenter

codecov-commenter commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.27007% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 57.01%. Comparing base (1cff2b3) to head (2bd13f9).
⚠️ Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
nvflare/fuel/f3/streaming/transfer_outcome.py 98.41% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4853      +/-   ##
==========================================
+ Coverage   56.20%   57.01%   +0.80%     
==========================================
  Files         967      970       +3     
  Lines       91978    92371     +393     
==========================================
+ Hits        51699    52668     +969     
+ Misses      40279    39703     -576     
Flag Coverage Δ
unit-tests 57.01% <99.27%> (+0.80%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

YuanTingHsieh and others added 2 commits July 1, 2026 18:17
Fixes from PR NVIDIA#4853 review:

- Reused tx_id race: termination removes a transaction from _tx_table
  under _tx_lock but records its outcome afterward; a retry registering
  the same tx_id in that gap could have its purged table repopulated by
  the old incarnation's outcome (reproduced: live retry surfaced
  done_status="deleted"). new_transaction now registers the current
  incarnation under _outcome_lock and _record_outcome drops outcomes
  from any transaction that is no longer the current incarnation.
- Unknown done_status fail-closed: status validation now precedes the
  receiver-truth check, so an unknown/future termination status with
  successful receivers can no longer certify COMPLETED.
- Tombstone snapshot: _delete_tx now uses the lock-guarded
  snapshot_receiver_statuses() instead of copying receiver_statuses
  unguarded while obj_downloaded() may mutate it.
- _expire_outcomes: full scan instead of insertion-order early-break;
  concurrent recorders can insert slightly out of timestamp order.
- num_receivers == 0 docstrings unified (unknown/unbounded, never
  certified finished, outcome never COMPLETED).
- compute_transfer_outcome timestamp annotated Optional[float].

Adds regression tests for the incarnation race and the
unknown-status-with-successful-receivers case.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-up doc updates for the Client API Execution Modes program:

- Configuration Surface + legacy-knob disposition table: add the executor
  arguments surfaced while building EX-2 (task_script_path/args, the
  task-name mapping powering flare.is_train()/is_evaluate()/
  is_submit_model(), memory_gc_rounds/cuda_empty_cache) and note that
  mode-scoped args are validated at construction.
- Plan F3-3 row: note the min_responses/quorum surface for fan-out
  (k-of-N receivers) is settled there; the TransferOutcome.completed
  certificate stays strictly all-receivers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
YuanTingHsieh and others added 2 commits July 2, 2026 11:54
…sign

- external_process mode: state its primary driver (multi-GPU via
  torchrun/Deepspeed/Horovod/mpirun; NVFlare shells out and talks only to
  rank 0 rather than reimplementing inter-rank comm), and derive from it why
  external_process auth is a localhost launch-token and why per-task trainer
  launch stays out of JobLauncherSpec.
- Alternatives Considered: add "authenticate the trainer with transport-layer
  mTLS instead of a session token" with the ergonomics-vs-reuse tradeoff, and
  strengthen the JobLauncher rejection with the same rationale.
- Plan: TE-1 trimmed to the stateless proof toolkit; the stateful
  SessionTokenManager moves to AT-2 (attach-only).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Reference Epic FLARE-2698 in the design doc status and plan header.
- Configuration Surface: drop params_exchange_format/transfer_type/
  server_expected_format/converter-ids from the executor; add the
  "no parameter converters on the executor" rationale (conversion moves to
  send/receive filters at the client edge; transfer type stays a Client API
  concern). Disposition table updated to match.
- Plan: add EX-5 (converter->filter migration) as its own tracked PR.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
pcnudde pushed a commit that referenced this pull request Jul 2, 2026
## What

Export `get_task_name` from the `nvflare.client` package. It exists in
`nvflare/client/api.py` but was never re-exported, so
`flare.get_task_name()` failed despite being part of the Client API
control surface.

## Program context

Client API Execution Modes (2.9) — **EX-1, Wave 0** of the plan.
Design: `docs/design/client_api_execution_modes.md` § "Client API
Backends" / rank contract (which lists `get_task_name()` among the
control-rank APIs)
Plan: `docs/design/client_api_execution_modes_plan.md` (PR #4853)
Depends on: none · Unblocks: nothing directly (removes a latent gap the
rank contract assumes).

## Testing

New `package_exports_test.py` asserts, by **identity** (`flare.X is
api.X`), that all 10 control-API names resolve to the `api` module
functions — so a future refactor that rebinds a name to the wrong object
fails CI. 10 pass; black/isort/flake8 clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Reflect the NVIDIA#4856 trim: TE-1 (interface freeze #1) is now the Cell
control-protocol vocabulary only. The auth mechanism is not frozen ahead
of its decision — the proof helpers land with EP-3 (which owns the
host-trust / auth-strength call: rendezvous-only vs one-round HMAC), the
generic token/nonce/digest generators go to FLARE-3017, and the stateful
SessionTokenManager stays attach-only (AT-2). Updated the TE-1, EP-3, AT-2
rows and the interface-freeze note accordingly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
pcnudde pushed a commit that referenced this pull request Jul 2, 2026
## What

Write the Client API config (`client_api_config.json`) owner-only on
POSIX. It embeds live `AUTH_TOKEN`/`AUTH_TOKEN_SIGNATURE` and was
written with the default umask — world-readable on most systems.

`ClientConfig.to_json` (the single choke point for every writer:
`write_config_to_file`,
`ClientAPILauncherExecutor.prepare_config_for_launch`,
`ExternalConfigurator`) now:
- creates the file with `O_NOFOLLOW` (rejects a planted symlink at the
config path);
- `fchmod`s the open descriptor to `0600` — applies whether the file is
newly created or pre-existing (an `O_CREAT` mode only takes effect on
creation);
- **fails closed**: if the mode cannot be set (e.g. a pre-existing file
owned by another user), it raises rather than writing credentials into
an exposed file;
- on Windows, POSIX modes don't map to NTFS ACLs, so `fchmod` is skipped
and this is documented honestly — protection there relies on directory
ACLs.

## Program context

Client API Execution Modes (2.9) — **EP-1, Wave 0** of the plan.
Design: `docs/design/client_api_execution_modes.md` § "Appendix B —
Bootstrap config protection"
Plan: `docs/design/client_api_execution_modes_plan.md` (PR #4853)
Depends on: none · Unblocks: TE-2 (bootstrap config writer reuses this).
Fixes a **live exposure** on today's external_process path independent
of the rest of the program.

## Behavior note (release-relevant)

Deployments where an externally started trainer runs as a **different OS
user** than the FL client must now explicitly re-permission the config
(or run same-user) — by design per the bootstrap-config protection
contract. The 3rd-party integration doc is updated with this note.

## Testing

New `config_test.py`: fresh write is `0600`; pre-existing `0644` file
tightened to `0600`; **fail-closed** when `fchmod` is denied (token not
written); symlink target rejected and left untouched. POSIX-only
assertions guarded with `skipif`. 7 new + 377 regression pass; style
clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants