Add ClientAPIExecutor skeleton and backend spec#4857
Conversation
EX-2 (Wave 0, interface freeze #2) of the Client API Execution Modes program (see docs/design/client_api_execution_modes_plan.md on PR NVIDIA#4853). New public ClientAPIExecutor at nvflare/app_common/executors/ client_api_executor.py with execution_mode dispatch (in_process / external_process / attach), plus a ClientAPIBackendSpec ABC and a frozen ClientAPIBackendContext the executor hands each backend (config + a back-reference for analytics). The three backends land in follow-up PRs (in_process, external_process, attach); until then each mode fails the job cleanly via system_panic rather than hanging. The constructor is the frozen configuration surface for the program: mode-scoped args are validated symmetrically (an arg set for a mode that ignores it is rejected with a clear error, not silently dropped). Beyond the design's V1 Configuration Surface list it also carries task_script_path/task_script_args (in_process trainer entry point), train/evaluate/submit_model_task_name + train_with_evaluation (power flare.is_train()/is_evaluate()/is_submit_model()), and memory_gc_rounds/ cuda_empty_cache, all forwarded from today's InProcessClientAPIExecutor/ ScriptRunner so existing jobs map without loss. The design doc's Configuration Surface and disposition table are updated to match. Analytics: the executor owns LOG-to-analytics conversion; the local path fires the un-prefixed event + ConvertToFedEvent (today's in-process behavior), the fed path fires "fed.analytix_log_stats" (today's MetricRelay behavior). UnsafeJobError propagates out of execute() so ClientRunner's UNSAFE_JOB handling still fires. No existing class is modified. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces the new public ClientAPIExecutor entry point and its internal backend contract as an interface-freeze skeleton for the “Client API Execution Modes” program. It establishes a frozen constructor/configuration surface, dispatch scaffolding for in_process / external_process / attach, and a backend context/spec that follow-up PRs will implement.
Changes:
- Add
ClientAPIExecutorskeleton with execution-mode validation, backend creation hooks, clean failure behavior (system_panicon START_RUN), and executor-owned analytics event firing. - Add internal backend contract types:
ClientAPIBackendSpecABC and frozenClientAPIBackendContextsnapshot handed to backends. - Add a comprehensive unit test suite that pins constructor signature/order/defaults and validates mode-scoped argument handling and failure semantics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
nvflare/app_common/executors/client_api_executor.py |
Adds the public executor skeleton with frozen configuration surface, backend dispatch hooks, and analytics ownership. |
nvflare/app_common/executors/client_api/backend_spec.py |
Introduces the internal backend interface (ClientAPIBackendSpec) and frozen backend context snapshot (ClientAPIBackendContext). |
nvflare/app_common/executors/client_api/__init__.py |
Marks the new client_api package. |
tests/unit_test/app_common/executors/client_api_executor_test.py |
Adds unit tests for validation matrix, dispatch failure behavior, analytics event firing paths, and surface-freeze signature pinning. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| The fields mirror the frozen ``ClientAPIExecutor`` constructor surface one-to-one. ``executor`` | ||
| is a back-reference so a backend can: | ||
|
|
||
| - call ``executor.fire_log_analytics(fl_ctx, dxo)`` for every trainer LOG message (the single | ||
| LOG-to-analytics ownership point; see design "Configuration Surface"), and | ||
| - select the federation-scoped analytics path when appropriate by setting | ||
| ``executor._analytics_fire_fed_event = True`` in ``initialize()`` (Cell backends do this when | ||
| no ConvertToFedEvent widget is configured), and | ||
| - use the executor's FLComponent logging helpers. |
| # data / params | ||
| params_exchange_format: str = "numpy" | ||
| params_transfer_type: str = "FULL" | ||
| server_expected_format: str = "numpy" | ||
| from_nvflare_converter_id: Optional[str] = None | ||
| to_nvflare_converter_id: Optional[str] = None | ||
| # task-name / rank contract (all modes) | ||
| train_task_name: str = "train" | ||
| evaluate_task_name: str = "validate" | ||
| submit_model_task_name: str = "submit_model" | ||
| train_with_evaluation: bool = False |
Greptile SummaryThis PR introduces the unified
Confidence Score: 5/5Safe to merge — this is an additive skeleton with no modifications to existing classes, comprehensive test coverage, and clean failure semantics for all three unimplemented backends. The change introduces four new files (no existing code is touched). The constructor validation matrix is exhaustive and the execute/handle_event paths have well-defined failure modes: backends that fail to initialize cause an immediate system_panic rather than hanging; tasks dispatched without a backend get an EXECUTION_EXCEPTION reply. The 83-test suite pins the frozen constructor surface, both analytics fire paths, and UnsafeJobError propagation. The three open findings from prior review threads are minor type-normalization and internal-API design concerns that do not affect the skeleton's correctness. No files require special attention. The main executor file has a known double-call to _backend_registry() in the error path (flagged in a prior review thread) that is worth cleaning up when the first backend PR lands. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant FR as NVFlare Framework
participant EX as ClientAPIExecutor
participant BE as ClientAPIBackendSpec
FR->>EX: handle_event(START_RUN)
EX->>EX: super().handle_event(START_RUN)
EX->>EX: _create_backend()
alt NotImplementedError
EX->>FR: system_panic(mode not yet implemented)
else initialize raises
EX->>BE: initialize(ClientAPIBackendContext, fl_ctx)
BE-->>EX: raises
EX->>FR: system_panic(initialization failed)
else success
EX->>BE: initialize(ClientAPIBackendContext, fl_ctx)
BE-->>EX: ok
end
loop Per task
FR->>EX: execute(task_name, shareable, fl_ctx, signal)
alt backend is None
EX-->>FR: make_reply(EXECUTION_EXCEPTION)
else UnsafeJobError
EX->>BE: execute(task_name, shareable, fl_ctx, signal)
BE-->>EX: UnsafeJobError
EX-->>FR: UnsafeJobError propagated
else backend exception or bad result
EX->>BE: execute(task_name, shareable, fl_ctx, signal)
BE-->>EX: exception or non-Shareable
EX-->>FR: make_reply(EXECUTION_EXCEPTION)
else success
EX->>BE: execute(task_name, shareable, fl_ctx, signal)
BE-->>EX: result Shareable
EX-->>FR: result Shareable
end
end
FR->>EX: handle_event(OTHER_EVENT)
EX->>BE: handle_event(OTHER_EVENT, fl_ctx)
EX->>EX: super().handle_event(OTHER_EVENT)
FR->>EX: handle_event(END_RUN)
EX->>BE: finalize(fl_ctx)
EX->>EX: super().handle_event(END_RUN)
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant FR as NVFlare Framework
participant EX as ClientAPIExecutor
participant BE as ClientAPIBackendSpec
FR->>EX: handle_event(START_RUN)
EX->>EX: super().handle_event(START_RUN)
EX->>EX: _create_backend()
alt NotImplementedError
EX->>FR: system_panic(mode not yet implemented)
else initialize raises
EX->>BE: initialize(ClientAPIBackendContext, fl_ctx)
BE-->>EX: raises
EX->>FR: system_panic(initialization failed)
else success
EX->>BE: initialize(ClientAPIBackendContext, fl_ctx)
BE-->>EX: ok
end
loop Per task
FR->>EX: execute(task_name, shareable, fl_ctx, signal)
alt backend is None
EX-->>FR: make_reply(EXECUTION_EXCEPTION)
else UnsafeJobError
EX->>BE: execute(task_name, shareable, fl_ctx, signal)
BE-->>EX: UnsafeJobError
EX-->>FR: UnsafeJobError propagated
else backend exception or bad result
EX->>BE: execute(task_name, shareable, fl_ctx, signal)
BE-->>EX: exception or non-Shareable
EX-->>FR: make_reply(EXECUTION_EXCEPTION)
else success
EX->>BE: execute(task_name, shareable, fl_ctx, signal)
BE-->>EX: result Shareable
EX-->>FR: result Shareable
end
end
FR->>EX: handle_event(OTHER_EVENT)
EX->>BE: handle_event(OTHER_EVENT, fl_ctx)
EX->>EX: super().handle_event(OTHER_EVENT)
FR->>EX: handle_event(END_RUN)
EX->>BE: finalize(fl_ctx)
EX->>EX: super().handle_event(END_RUN)
Reviews (2): Last reviewed commit: "EX-2: drop param converters from the fro..." | Re-trigger Greptile |
| self._execution_mode = execution_mode | ||
| self._command = command | ||
| self._task_script_path = task_script_path | ||
| self._task_script_args = task_script_args | ||
| self._launch_once = launch_once | ||
| self._launch_timeout = launch_timeout | ||
| self._shutdown_timeout = shutdown_timeout | ||
| self._stop_grace_period = stop_grace_period | ||
| self._heartbeat_interval = heartbeat_interval | ||
| self._heartbeat_timeout = heartbeat_timeout | ||
| self._task_wait_timeout = task_wait_timeout | ||
| self._result_wait_timeout = result_wait_timeout | ||
| self._params_exchange_format = params_exchange_format | ||
| self._params_transfer_type = params_transfer_type | ||
| self._server_expected_format = server_expected_format | ||
| self._from_nvflare_converter_id = from_nvflare_converter_id | ||
| self._to_nvflare_converter_id = to_nvflare_converter_id | ||
| self._train_task_name = train_task_name | ||
| self._evaluate_task_name = evaluate_task_name | ||
| self._submit_model_task_name = submit_model_task_name | ||
| self._train_with_evaluation = train_with_evaluation | ||
| self._memory_gc_rounds = memory_gc_rounds | ||
| self._cuda_empty_cache = cuda_empty_cache | ||
| self._attach_timeout = attach_timeout | ||
| self._allow_reconnect = allow_reconnect |
There was a problem hiding this comment.
allow_reconnect falsy non-bool values are accepted by the wrong-mode guard but stored and forwarded to ClientAPIBackendContext without normalization. When allow_reconnect=None or allow_reconnect=0 is passed (in any mode), self._allow_reconnect holds the non-bool value, and ClientAPIBackendContext.allow_reconnect: bool receives it. A backend in a follow-up PR checking context.allow_reconnect == False or isinstance(context.allow_reconnect, bool) would silently get the wrong answer. Since the comment already calls out the intentional truthy check, normalizing to bool here would preserve the guard semantics while ensuring the field always satisfies its type annotation.
| self._execution_mode = execution_mode | |
| self._command = command | |
| self._task_script_path = task_script_path | |
| self._task_script_args = task_script_args | |
| self._launch_once = launch_once | |
| self._launch_timeout = launch_timeout | |
| self._shutdown_timeout = shutdown_timeout | |
| self._stop_grace_period = stop_grace_period | |
| self._heartbeat_interval = heartbeat_interval | |
| self._heartbeat_timeout = heartbeat_timeout | |
| self._task_wait_timeout = task_wait_timeout | |
| self._result_wait_timeout = result_wait_timeout | |
| self._params_exchange_format = params_exchange_format | |
| self._params_transfer_type = params_transfer_type | |
| self._server_expected_format = server_expected_format | |
| self._from_nvflare_converter_id = from_nvflare_converter_id | |
| self._to_nvflare_converter_id = to_nvflare_converter_id | |
| self._train_task_name = train_task_name | |
| self._evaluate_task_name = evaluate_task_name | |
| self._submit_model_task_name = submit_model_task_name | |
| self._train_with_evaluation = train_with_evaluation | |
| self._memory_gc_rounds = memory_gc_rounds | |
| self._cuda_empty_cache = cuda_empty_cache | |
| self._attach_timeout = attach_timeout | |
| self._allow_reconnect = allow_reconnect | |
| self._execution_mode = execution_mode | |
| self._command = command | |
| self._task_script_path = task_script_path | |
| self._task_script_args = task_script_args | |
| self._launch_once = launch_once | |
| self._launch_timeout = launch_timeout | |
| self._shutdown_timeout = shutdown_timeout | |
| self._stop_grace_period = stop_grace_period | |
| self._heartbeat_interval = heartbeat_interval | |
| self._heartbeat_timeout = heartbeat_timeout | |
| self._task_wait_timeout = task_wait_timeout | |
| self._result_wait_timeout = result_wait_timeout | |
| self._params_exchange_format = params_exchange_format | |
| self._params_transfer_type = params_transfer_type | |
| self._server_expected_format = server_expected_format | |
| self._from_nvflare_converter_id = from_nvflare_converter_id | |
| self._to_nvflare_converter_id = to_nvflare_converter_id | |
| self._train_task_name = train_task_name | |
| self._evaluate_task_name = evaluate_task_name | |
| self._submit_model_task_name = submit_model_task_name | |
| self._train_with_evaluation = train_with_evaluation | |
| self._memory_gc_rounds = memory_gc_rounds | |
| self._cuda_empty_cache = cuda_empty_cache | |
| self._attach_timeout = attach_timeout | |
| self._allow_reconnect = bool(allow_reconnect) |
| def _create_backend(self) -> ClientAPIBackendSpec: | ||
| factory = self._backend_registry().get(self._execution_mode) | ||
| if factory is None: | ||
| # Unreachable via the public constructor (execution_mode is validated there); | ||
| # guards subclasses that override _backend_registry(). | ||
| raise ValueError( | ||
| f"no backend factory registered for execution_mode '{self._execution_mode}': " | ||
| f"registered modes are {list(self._backend_registry().keys())}" | ||
| ) | ||
| return factory() |
There was a problem hiding this comment.
_backend_registry() is called twice: once to fetch the factory and once inside the error-message f-string. For the base class this only rebuilds a cheap dict, but the comment above explicitly frames this as a subclass extension point ("guards subclasses that override _backend_registry()"), making the double-call observable for any override with state or side-effects. Storing the result in a local variable eliminates the redundant call and the risk.
| def _create_backend(self) -> ClientAPIBackendSpec: | |
| factory = self._backend_registry().get(self._execution_mode) | |
| if factory is None: | |
| # Unreachable via the public constructor (execution_mode is validated there); | |
| # guards subclasses that override _backend_registry(). | |
| raise ValueError( | |
| f"no backend factory registered for execution_mode '{self._execution_mode}': " | |
| f"registered modes are {list(self._backend_registry().keys())}" | |
| ) | |
| return factory() | |
| def _create_backend(self) -> ClientAPIBackendSpec: | |
| registry = self._backend_registry() | |
| factory = registry.get(self._execution_mode) | |
| if factory is None: | |
| # Unreachable via the public constructor (execution_mode is validated there); | |
| # guards subclasses that override _backend_registry(). | |
| raise ValueError( | |
| f"no backend factory registered for execution_mode '{self._execution_mode}': " | |
| f"registered modes are {list(registry.keys())}" | |
| ) | |
| return factory() |
| server_expected_format: str = "numpy" | ||
| from_nvflare_converter_id: Optional[str] = None | ||
| to_nvflare_converter_id: Optional[str] = None | ||
| # task-name / rank contract (all modes) |
There was a problem hiding this comment.
Private attribute documented as a backend protocol
The docstring canonizes executor._analytics_fire_fed_event = True as the supported way for Cell backends to switch analytics paths — but _analytics_fire_fed_event is a private attribute with no accessor. Future refactors or renames of that attribute won't show up in any public API contract check, and a backend implementer may not find it via normal API exploration. Consider exposing a small public setter (e.g., executor.use_fed_analytics()) so the contract is part of the executor's public surface and can be tracked through the interface-freeze test.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4857 +/- ##
==========================================
+ Coverage 56.20% 56.63% +0.42%
==========================================
Files 967 971 +4
Lines 91978 92455 +477
==========================================
+ Hits 51699 52364 +665
+ Misses 40279 40091 -188
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
FLARE-2698 bullet 2 removes params converters in favor of filters. Since EX-2 is an interface freeze, freezing args we intend to delete would make their later removal a breaking change — so exclude them now. Removed from the ClientAPIExecutor constructor: params_exchange_format, params_transfer_type, server_expected_format, from_nvflare_converter_id, to_nvflare_converter_id (and from ClientAPIBackendContext). Param conversion between the framework-agnostic aggregation representation (numpy) and the framework-native training representation moves to send/receive filters at the client edge (tracked as EX-5 / the converter->filter migration); the executor and Cell layers pass through. Transfer type FULL/DIFF is not a converter and stays a Client API (model_registry) concern. Surface-freeze test updated to the trimmed arg list. 83 tests pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
What
The public
ClientAPIExecutor(atnvflare/app_common/executors/client_api_executor.py, the module path the design's example config pins) withexecution_modedispatch (in_process / external_process / attach), aClientAPIBackendSpecABC, and a frozenClientAPIBackendContexthanded to each backend. The three backends land in follow-up PRs; until then each mode fails the job cleanly viasystem_panic.This is the frozen configuration surface for the program. Mode-scoped args are validated symmetrically — an arg set for a mode that ignores it is rejected with a clear error, not silently dropped.
Scope update (FLARE-2698 bullet 2): no param converters on the executor
Per Epic FLARE-2698 ("remove all params converter, let filters handle it"), the converter/format args are deliberately excluded from the frozen constructor:
params_exchange_format,params_transfer_type,server_expected_format,from_nvflare_converter_id,to_nvflare_converter_id(also dropped fromClientAPIBackendContext). Because this is an interface freeze, freezing args we intend to delete would make their later removal a breaking change — so they're out now.Conversion between the framework-agnostic aggregation representation (numpy) and framework-native tensors moves to send/receive filters at the client edge (tracked as EX-5 / FLARE-3015 — the converter→filter migration; the PT quantizer/dequantizer are the existing filter precedent). The executor and Cell layers pass through. Transfer type (FULL/DIFF) is not a converter — it stays a Client API concern (
model_registry), decided separately.The frozen surface still carries the load-bearing args both legacy executors expose:
task_script_path/task_script_args(in_process entry point), the task-name mapping (train/evaluate/submit_model_task_name+train_with_evaluation, poweringflare.is_train()/is_evaluate()/is_submit_model()), andmemory_gc_rounds/cuda_empty_cache.Program context
Client API Execution Modes (2.9) — EX-2, Wave 0, interface freeze #2 (Epic FLARE-2698 / Story FLARE-3010).
Design:
docs/design/client_api_execution_modes.md§ "What We Propose" / "Execution Modes" / "Configuration Surface".Unblocks: EX-3 (in_process backend), EP-4 (external_process backend), AT-2 (attach backend).
Design contracts implemented
Analytics ownership moves to the executor: local path fires the un-prefixed event +
ConvertToFedEvent(today's in-process behavior); fed path firesfed.analytix_log_stats(today'sMetricRelaybehavior).UnsafeJobErrorpropagates out ofexecute()soClientRunner'sUNSAFE_JOBhandling still fires. No existing class is modified.Testing
83 unit tests: full mode-scoped validation matrix, per-mode clean-panic dispatch, execute-no-hang, backend-context plumbing, both analytics fire paths,
UnsafeJobErrorpropagation vsUnsafeComponentError→EXECUTION_EXCEPTION, and a surface-freeze test pinning the constructor arg list + order. black/isort/flake8 clean.🤖 Generated with Claude Code