Skip to content

Add ClientAPIExecutor skeleton and backend spec#4857

Open
YuanTingHsieh wants to merge 2 commits into
NVIDIA:mainfrom
YuanTingHsieh:yuantingh/client-api-ex2-executor-skeleton
Open

Add ClientAPIExecutor skeleton and backend spec#4857
YuanTingHsieh wants to merge 2 commits into
NVIDIA:mainfrom
YuanTingHsieh:yuantingh/client-api-ex2-executor-skeleton

Conversation

@YuanTingHsieh

@YuanTingHsieh YuanTingHsieh commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

What

The public ClientAPIExecutor (at nvflare/app_common/executors/client_api_executor.py, the module path the design's example config pins) with execution_mode dispatch (in_process / external_process / attach), a ClientAPIBackendSpec ABC, and a frozen ClientAPIBackendContext handed to each backend. The three backends land in follow-up PRs; until then each mode fails the job cleanly via system_panic.

This is the frozen configuration surface for the program. Mode-scoped args are validated symmetrically — an arg set for a mode that ignores it is rejected with a clear error, not silently dropped.

Scope update (FLARE-2698 bullet 2): no param converters on the executor

Per Epic FLARE-2698 ("remove all params converter, let filters handle it"), the converter/format args are deliberately excluded from the frozen constructor: params_exchange_format, params_transfer_type, server_expected_format, from_nvflare_converter_id, to_nvflare_converter_id (also dropped from ClientAPIBackendContext). Because this is an interface freeze, freezing args we intend to delete would make their later removal a breaking change — so they're out now.

Conversion between the framework-agnostic aggregation representation (numpy) and framework-native tensors moves to send/receive filters at the client edge (tracked as EX-5 / FLARE-3015 — the converter→filter migration; the PT quantizer/dequantizer are the existing filter precedent). The executor and Cell layers pass through. Transfer type (FULL/DIFF) is not a converter — it stays a Client API concern (model_registry), decided separately.

The frozen surface still carries the load-bearing args both legacy executors expose: task_script_path/task_script_args (in_process entry point), the task-name mapping (train/evaluate/submit_model_task_name + train_with_evaluation, powering flare.is_train()/is_evaluate()/is_submit_model()), and memory_gc_rounds/cuda_empty_cache.

Program context

Client API Execution Modes (2.9) — EX-2, Wave 0, interface freeze #2 (Epic FLARE-2698 / Story FLARE-3010).
Design: docs/design/client_api_execution_modes.md § "What We Propose" / "Execution Modes" / "Configuration Surface".
Unblocks: EX-3 (in_process backend), EP-4 (external_process backend), AT-2 (attach backend).

Design contracts implemented

Analytics ownership moves to the executor: local path fires the un-prefixed event + ConvertToFedEvent (today's in-process behavior); fed path fires fed.analytix_log_stats (today's MetricRelay behavior). UnsafeJobError propagates out of execute() so ClientRunner's UNSAFE_JOB handling still fires. No existing class is modified.

Testing

83 unit tests: full mode-scoped validation matrix, per-mode clean-panic dispatch, execute-no-hang, backend-context plumbing, both analytics fire paths, UnsafeJobError propagation vs UnsafeComponentErrorEXECUTION_EXCEPTION, and a surface-freeze test pinning the constructor arg list + order. black/isort/flake8 clean.

🤖 Generated with Claude Code

EX-2 (Wave 0, interface freeze #2) of the Client API Execution Modes
program (see docs/design/client_api_execution_modes_plan.md on PR NVIDIA#4853).

New public ClientAPIExecutor at nvflare/app_common/executors/
client_api_executor.py with execution_mode dispatch (in_process /
external_process / attach), plus a ClientAPIBackendSpec ABC and a frozen
ClientAPIBackendContext the executor hands each backend (config + a
back-reference for analytics). The three backends land in follow-up PRs
(in_process, external_process, attach); until then each mode fails the
job cleanly via system_panic rather than hanging.

The constructor is the frozen configuration surface for the program:
mode-scoped args are validated symmetrically (an arg set for a mode that
ignores it is rejected with a clear error, not silently dropped). Beyond
the design's V1 Configuration Surface list it also carries
task_script_path/task_script_args (in_process trainer entry point),
train/evaluate/submit_model_task_name + train_with_evaluation (power
flare.is_train()/is_evaluate()/is_submit_model()), and memory_gc_rounds/
cuda_empty_cache, all forwarded from today's InProcessClientAPIExecutor/
ScriptRunner so existing jobs map without loss. The design doc's
Configuration Surface and disposition table are updated to match.

Analytics: the executor owns LOG-to-analytics conversion; the local path
fires the un-prefixed event + ConvertToFedEvent (today's in-process
behavior), the fed path fires "fed.analytix_log_stats" (today's
MetricRelay behavior). UnsafeJobError propagates out of execute() so
ClientRunner's UNSAFE_JOB handling still fires. No existing class is
modified.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 2, 2026 16:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the new public ClientAPIExecutor entry point and its internal backend contract as an interface-freeze skeleton for the “Client API Execution Modes” program. It establishes a frozen constructor/configuration surface, dispatch scaffolding for in_process / external_process / attach, and a backend context/spec that follow-up PRs will implement.

Changes:

  • Add ClientAPIExecutor skeleton with execution-mode validation, backend creation hooks, clean failure behavior (system_panic on START_RUN), and executor-owned analytics event firing.
  • Add internal backend contract types: ClientAPIBackendSpec ABC and frozen ClientAPIBackendContext snapshot handed to backends.
  • Add a comprehensive unit test suite that pins constructor signature/order/defaults and validates mode-scoped argument handling and failure semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
nvflare/app_common/executors/client_api_executor.py Adds the public executor skeleton with frozen configuration surface, backend dispatch hooks, and analytics ownership.
nvflare/app_common/executors/client_api/backend_spec.py Introduces the internal backend interface (ClientAPIBackendSpec) and frozen backend context snapshot (ClientAPIBackendContext).
nvflare/app_common/executors/client_api/__init__.py Marks the new client_api package.
tests/unit_test/app_common/executors/client_api_executor_test.py Adds unit tests for validation matrix, dispatch failure behavior, analytics event firing paths, and surface-freeze signature pinning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +52 to +60
The fields mirror the frozen ``ClientAPIExecutor`` constructor surface one-to-one. ``executor``
is a back-reference so a backend can:

- call ``executor.fire_log_analytics(fl_ctx, dxo)`` for every trainer LOG message (the single
LOG-to-analytics ownership point; see design "Configuration Surface"), and
- select the federation-scoped analytics path when appropriate by setting
``executor._analytics_fire_fed_event = True`` in ``initialize()`` (Cell backends do this when
no ConvertToFedEvent widget is configured), and
- use the executor's FLComponent logging helpers.
Comment on lines +79 to +89
# data / params
params_exchange_format: str = "numpy"
params_transfer_type: str = "FULL"
server_expected_format: str = "numpy"
from_nvflare_converter_id: Optional[str] = None
to_nvflare_converter_id: Optional[str] = None
# task-name / rank contract (all modes)
train_task_name: str = "train"
evaluate_task_name: str = "validate"
submit_model_task_name: str = "submit_model"
train_with_evaluation: bool = False
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces the unified ClientAPIExecutor skeleton, ClientAPIBackendSpec ABC, and frozen ClientAPIBackendContext dataclass as the interface-freeze foundation for the Client API execution-modes redesign (plan EX-2). No existing class is modified; the three backends land in follow-up PRs.

  • Mode dispatch skeleton: The executor validates all 25 constructor parameters, rejects mode-scoped args set in the wrong mode (while accepting frozen defaults), and fails cleanly via system_panic for all three unimplemented backends instead of hanging.
  • Backend contract: ClientAPIBackendSpec defines a strict lifecycle (initialize → execute → handle_event → finalize), with documented self-unwind requirements for partial-initialization failures, and ClientAPIBackendContext provides a frozen config snapshot plus executor back-reference.
  • Analytics ownership: fire_log_analytics() centralises LOG-to-analytics event firing with two selectable paths (local + ConvertToFedEvent vs. direct federation-scoped), verified by the test suite against the exact "fed.analytix_log_stats" string.

Confidence Score: 5/5

Safe to merge — this is an additive skeleton with no modifications to existing classes, comprehensive test coverage, and clean failure semantics for all three unimplemented backends.

The change introduces four new files (no existing code is touched). The constructor validation matrix is exhaustive and the execute/handle_event paths have well-defined failure modes: backends that fail to initialize cause an immediate system_panic rather than hanging; tasks dispatched without a backend get an EXECUTION_EXCEPTION reply. The 83-test suite pins the frozen constructor surface, both analytics fire paths, and UnsafeJobError propagation. The three open findings from prior review threads are minor type-normalization and internal-API design concerns that do not affect the skeleton's correctness.

No files require special attention. The main executor file has a known double-call to _backend_registry() in the error path (flagged in a prior review thread) that is worth cleaning up when the first backend PR lands.

Important Files Changed

Filename Overview
nvflare/app_common/executors/client_api/init.py New package init, intentionally empty — the module is internal-only and backend_spec is imported by fully-qualified path.
nvflare/app_common/executors/client_api/backend_spec.py Introduces ClientAPIBackendContext (frozen dataclass) and ClientAPIBackendSpec (ABC). TYPE_CHECKING-guarded import correctly avoids the circular dependency with client_api_executor at runtime. Lifecycle contracts are thoroughly documented, including the self-unwind requirement on initialize().
nvflare/app_common/executors/client_api_executor.py New ClientAPIExecutor with full mode-scoped validation, analytics ownership via fire_log_analytics(), correct UnsafeJobError propagation, and clean system_panic-on-unimplemented dispatch. The _backend_registry() double-call in the error-message path (already noted in prior review) is the only structural rough edge.
tests/unit_test/app_common/executors/client_api_executor_test.py 83-test suite covers the full validation matrix, per-mode panic dispatch, backend lifecycle plumbing, both analytics fire paths, UnsafeJobError propagation, and a surface-freeze test pinning all 25 constructor args and defaults. Coverage is comprehensive for a skeleton.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant FR as NVFlare Framework
    participant EX as ClientAPIExecutor
    participant BE as ClientAPIBackendSpec

    FR->>EX: handle_event(START_RUN)
    EX->>EX: super().handle_event(START_RUN)
    EX->>EX: _create_backend()
    alt NotImplementedError
        EX->>FR: system_panic(mode not yet implemented)
    else initialize raises
        EX->>BE: initialize(ClientAPIBackendContext, fl_ctx)
        BE-->>EX: raises
        EX->>FR: system_panic(initialization failed)
    else success
        EX->>BE: initialize(ClientAPIBackendContext, fl_ctx)
        BE-->>EX: ok
    end

    loop Per task
        FR->>EX: execute(task_name, shareable, fl_ctx, signal)
        alt backend is None
            EX-->>FR: make_reply(EXECUTION_EXCEPTION)
        else UnsafeJobError
            EX->>BE: execute(task_name, shareable, fl_ctx, signal)
            BE-->>EX: UnsafeJobError
            EX-->>FR: UnsafeJobError propagated
        else backend exception or bad result
            EX->>BE: execute(task_name, shareable, fl_ctx, signal)
            BE-->>EX: exception or non-Shareable
            EX-->>FR: make_reply(EXECUTION_EXCEPTION)
        else success
            EX->>BE: execute(task_name, shareable, fl_ctx, signal)
            BE-->>EX: result Shareable
            EX-->>FR: result Shareable
        end
    end

    FR->>EX: handle_event(OTHER_EVENT)
    EX->>BE: handle_event(OTHER_EVENT, fl_ctx)
    EX->>EX: super().handle_event(OTHER_EVENT)

    FR->>EX: handle_event(END_RUN)
    EX->>BE: finalize(fl_ctx)
    EX->>EX: super().handle_event(END_RUN)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant FR as NVFlare Framework
    participant EX as ClientAPIExecutor
    participant BE as ClientAPIBackendSpec

    FR->>EX: handle_event(START_RUN)
    EX->>EX: super().handle_event(START_RUN)
    EX->>EX: _create_backend()
    alt NotImplementedError
        EX->>FR: system_panic(mode not yet implemented)
    else initialize raises
        EX->>BE: initialize(ClientAPIBackendContext, fl_ctx)
        BE-->>EX: raises
        EX->>FR: system_panic(initialization failed)
    else success
        EX->>BE: initialize(ClientAPIBackendContext, fl_ctx)
        BE-->>EX: ok
    end

    loop Per task
        FR->>EX: execute(task_name, shareable, fl_ctx, signal)
        alt backend is None
            EX-->>FR: make_reply(EXECUTION_EXCEPTION)
        else UnsafeJobError
            EX->>BE: execute(task_name, shareable, fl_ctx, signal)
            BE-->>EX: UnsafeJobError
            EX-->>FR: UnsafeJobError propagated
        else backend exception or bad result
            EX->>BE: execute(task_name, shareable, fl_ctx, signal)
            BE-->>EX: exception or non-Shareable
            EX-->>FR: make_reply(EXECUTION_EXCEPTION)
        else success
            EX->>BE: execute(task_name, shareable, fl_ctx, signal)
            BE-->>EX: result Shareable
            EX-->>FR: result Shareable
        end
    end

    FR->>EX: handle_event(OTHER_EVENT)
    EX->>BE: handle_event(OTHER_EVENT, fl_ctx)
    EX->>EX: super().handle_event(OTHER_EVENT)

    FR->>EX: handle_event(END_RUN)
    EX->>BE: finalize(fl_ctx)
    EX->>EX: super().handle_event(END_RUN)
Loading

Reviews (2): Last reviewed commit: "EX-2: drop param converters from the fro..." | Re-trigger Greptile

Comment on lines +250 to +274
self._execution_mode = execution_mode
self._command = command
self._task_script_path = task_script_path
self._task_script_args = task_script_args
self._launch_once = launch_once
self._launch_timeout = launch_timeout
self._shutdown_timeout = shutdown_timeout
self._stop_grace_period = stop_grace_period
self._heartbeat_interval = heartbeat_interval
self._heartbeat_timeout = heartbeat_timeout
self._task_wait_timeout = task_wait_timeout
self._result_wait_timeout = result_wait_timeout
self._params_exchange_format = params_exchange_format
self._params_transfer_type = params_transfer_type
self._server_expected_format = server_expected_format
self._from_nvflare_converter_id = from_nvflare_converter_id
self._to_nvflare_converter_id = to_nvflare_converter_id
self._train_task_name = train_task_name
self._evaluate_task_name = evaluate_task_name
self._submit_model_task_name = submit_model_task_name
self._train_with_evaluation = train_with_evaluation
self._memory_gc_rounds = memory_gc_rounds
self._cuda_empty_cache = cuda_empty_cache
self._attach_timeout = attach_timeout
self._allow_reconnect = allow_reconnect

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 allow_reconnect falsy non-bool values are accepted by the wrong-mode guard but stored and forwarded to ClientAPIBackendContext without normalization. When allow_reconnect=None or allow_reconnect=0 is passed (in any mode), self._allow_reconnect holds the non-bool value, and ClientAPIBackendContext.allow_reconnect: bool receives it. A backend in a follow-up PR checking context.allow_reconnect == False or isinstance(context.allow_reconnect, bool) would silently get the wrong answer. Since the comment already calls out the intentional truthy check, normalizing to bool here would preserve the guard semantics while ensuring the field always satisfies its type annotation.

Suggested change
self._execution_mode = execution_mode
self._command = command
self._task_script_path = task_script_path
self._task_script_args = task_script_args
self._launch_once = launch_once
self._launch_timeout = launch_timeout
self._shutdown_timeout = shutdown_timeout
self._stop_grace_period = stop_grace_period
self._heartbeat_interval = heartbeat_interval
self._heartbeat_timeout = heartbeat_timeout
self._task_wait_timeout = task_wait_timeout
self._result_wait_timeout = result_wait_timeout
self._params_exchange_format = params_exchange_format
self._params_transfer_type = params_transfer_type
self._server_expected_format = server_expected_format
self._from_nvflare_converter_id = from_nvflare_converter_id
self._to_nvflare_converter_id = to_nvflare_converter_id
self._train_task_name = train_task_name
self._evaluate_task_name = evaluate_task_name
self._submit_model_task_name = submit_model_task_name
self._train_with_evaluation = train_with_evaluation
self._memory_gc_rounds = memory_gc_rounds
self._cuda_empty_cache = cuda_empty_cache
self._attach_timeout = attach_timeout
self._allow_reconnect = allow_reconnect
self._execution_mode = execution_mode
self._command = command
self._task_script_path = task_script_path
self._task_script_args = task_script_args
self._launch_once = launch_once
self._launch_timeout = launch_timeout
self._shutdown_timeout = shutdown_timeout
self._stop_grace_period = stop_grace_period
self._heartbeat_interval = heartbeat_interval
self._heartbeat_timeout = heartbeat_timeout
self._task_wait_timeout = task_wait_timeout
self._result_wait_timeout = result_wait_timeout
self._params_exchange_format = params_exchange_format
self._params_transfer_type = params_transfer_type
self._server_expected_format = server_expected_format
self._from_nvflare_converter_id = from_nvflare_converter_id
self._to_nvflare_converter_id = to_nvflare_converter_id
self._train_task_name = train_task_name
self._evaluate_task_name = evaluate_task_name
self._submit_model_task_name = submit_model_task_name
self._train_with_evaluation = train_with_evaluation
self._memory_gc_rounds = memory_gc_rounds
self._cuda_empty_cache = cuda_empty_cache
self._attach_timeout = attach_timeout
self._allow_reconnect = bool(allow_reconnect)

Comment on lines +453 to +462
def _create_backend(self) -> ClientAPIBackendSpec:
factory = self._backend_registry().get(self._execution_mode)
if factory is None:
# Unreachable via the public constructor (execution_mode is validated there);
# guards subclasses that override _backend_registry().
raise ValueError(
f"no backend factory registered for execution_mode '{self._execution_mode}': "
f"registered modes are {list(self._backend_registry().keys())}"
)
return factory()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _backend_registry() is called twice: once to fetch the factory and once inside the error-message f-string. For the base class this only rebuilds a cheap dict, but the comment above explicitly frames this as a subclass extension point ("guards subclasses that override _backend_registry()"), making the double-call observable for any override with state or side-effects. Storing the result in a local variable eliminates the redundant call and the risk.

Suggested change
def _create_backend(self) -> ClientAPIBackendSpec:
factory = self._backend_registry().get(self._execution_mode)
if factory is None:
# Unreachable via the public constructor (execution_mode is validated there);
# guards subclasses that override _backend_registry().
raise ValueError(
f"no backend factory registered for execution_mode '{self._execution_mode}': "
f"registered modes are {list(self._backend_registry().keys())}"
)
return factory()
def _create_backend(self) -> ClientAPIBackendSpec:
registry = self._backend_registry()
factory = registry.get(self._execution_mode)
if factory is None:
# Unreachable via the public constructor (execution_mode is validated there);
# guards subclasses that override _backend_registry().
raise ValueError(
f"no backend factory registered for execution_mode '{self._execution_mode}': "
f"registered modes are {list(registry.keys())}"
)
return factory()

Comment on lines +82 to +85
server_expected_format: str = "numpy"
from_nvflare_converter_id: Optional[str] = None
to_nvflare_converter_id: Optional[str] = None
# task-name / rank contract (all modes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Private attribute documented as a backend protocol

The docstring canonizes executor._analytics_fire_fed_event = True as the supported way for Cell backends to switch analytics paths — but _analytics_fire_fed_event is a private attribute with no accessor. Future refactors or renames of that attribute won't show up in any public API contract check, and a backend implementer may not find it via normal API exploration. Consider exposing a small public setter (e.g., executor.use_fed_analytics()) so the contract is part of the executor's public surface and can be tracked through the interface-freeze test.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@codecov-commenter

codecov-commenter commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.36082% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.63%. Comparing base (1cff2b3) to head (9b0d045).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
...vflare/app_common/executors/client_api_executor.py 96.75% 5 Missing ⚠️
...re/app_common/executors/client_api/backend_spec.py 90.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4857      +/-   ##
==========================================
+ Coverage   56.20%   56.63%   +0.42%     
==========================================
  Files         967      971       +4     
  Lines       91978    92455     +477     
==========================================
+ Hits        51699    52364     +665     
+ Misses      40279    40091     -188     
Flag Coverage Δ
unit-tests 56.63% <95.36%> (+0.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

FLARE-2698 bullet 2 removes params converters in favor of filters. Since
EX-2 is an interface freeze, freezing args we intend to delete would make
their later removal a breaking change — so exclude them now.

Removed from the ClientAPIExecutor constructor: params_exchange_format,
params_transfer_type, server_expected_format, from_nvflare_converter_id,
to_nvflare_converter_id (and from ClientAPIBackendContext). Param conversion
between the framework-agnostic aggregation representation (numpy) and the
framework-native training representation moves to send/receive filters at
the client edge (tracked as EX-5 / the converter->filter migration); the
executor and Cell layers pass through. Transfer type FULL/DIFF is not a
converter and stays a Client API (model_registry) concern.

Surface-freeze test updated to the trimmed arg list. 83 tests pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants