Add opt-in PyTorch Auto-SCAFFOLD mode by holgerroth · Pull Request #4642 · NVIDIA/NVFlare

holgerroth · 2026-05-19T18:57:59Z

Summary

Add auto_scaffold=True for PyTorch ScaffoldRecipe to support standard FedAvg-style Client API scripts.
Add an Auto-SCAFFOLD runtime hook manager that detects model loading, applies correction after optimizer steps, and injects scaffold_c_diff on send.
Add generic task_exchange_config plumbing through ScriptRunner/client executors, plus docs and recipe/example guidance.

Validation

python3 -m pytest tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py tests/unit_test/recipe/scaffold_recipe_test.py -q
python3 -m pytest tests/unit_test/job_config/script_runner_test.py tests/unit_test/app_common/executors/task_script_runner_test.py -q
python3 -m py_compile nvflare/app_opt/pt/scaffold_auto_patch.py nvflare/app_opt/pt/recipes/scaffold.py nvflare/job_config/script_runner.py nvflare/client/in_process/api.py nvflare/client/ex_process/api.py tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py
git diff --check

holgerroth · 2026-05-20T02:52:11Z

@greptileai review

greptile-apps · 2026-05-20T02:57:17Z

Greptile Summary

This PR introduces an opt-in Auto-SCAFFOLD mode for PyTorch, letting existing FedAvg-style Client API scripts participate in SCAFFOLD without manually instrumenting control-variate code. It also threads a generic task_exchange_config dict through the full executor/runner stack so arbitrary per-job flags can reach the client process.

scaffold_auto_patch.py: New per-thread PTScaffoldAutoPatchManager that monkey-patches torch.nn.Module.load_state_dict and Optimizer.step process-wide, routes calls to the current thread's manager via threading.local, validates a constant LR across the local round, and guards against manual PTScaffoldHelper calls while auto mode is active.
ScaffoldRecipe + executor chain: Adds auto_scaffold=True parameter that sets {PT_SCAFFOLD_AUTO_PATCH: True} in task_exchange_config; the flag is propagated through ScriptRunner → ClientAPILauncherExecutor/InProcessClientAPIExecutor → config file or task metadata → ExProcessClientAPI/InProcessClientAPI → lazy import + patch activation.
TaskScriptRunner: Adds a finally-block cleanup that lazily calls disable_pt_scaffold_auto_patch() to restore PyTorch methods even if the training script raises.

Confidence Score: 5/5

The change is safe to merge; the new Auto-SCAFFOLD path is opt-in and the default behaviour (auto_scaffold=False) is completely unchanged.

All three issues raised in the previous review round have been addressed with thread-local state, a shared constants module, and a first-step LR check that raises on change. The plumbing additions are additive and backward-compatible. The single remaining nit (a dead _last_lr assignment) has no effect on correctness.

No files require special attention; the core logic is in scaffold_auto_patch.py which has thorough test coverage.

Important Files Changed

Filename	Overview
nvflare/app_opt/pt/scaffold_auto_patch.py	New Auto-SCAFFOLD runtime hook manager; thread-local state correctly isolates concurrent simulation clients; one dead assignment (_last_lr) remains.
nvflare/app_opt/pt/recipes/scaffold.py	Adds auto_scaffold parameter to ScaffoldRecipe; correctly validates launch_once constraint and propagates PT_SCAFFOLD_AUTO_PATCH flag through task_exchange_config.
nvflare/client/in_process/api.py	Adds lazy Auto-SCAFFOLD activation via PT_SCAFFOLD_AUTO_PATCH constant; correctly hooks receive/send lifecycle and disables manager on shutdown.
nvflare/client/ex_process/api.py	Mirrors in_process/api.py changes for external-process client; activation outside the rank==0 guard is safe because non-rank-0 managers stay inactive.
nvflare/job_config/script_runner.py	Adds task_exchange_config plumbing through BaseScriptRunner and ScriptRunner; config is copied defensively and forwarded to both executor paths.
nvflare/app_common/executors/task_script_runner.py	Adds lazy disable_pt_scaffold_auto_patch() call in finally block to ensure thread-local manager is cleaned up even when the training script exits with an exception.
nvflare/client/constants.py	Adds lightweight PT_SCAFFOLD_AUTO_PATCH constant to avoid forcing PyTorch import in generic client init path.
tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py	Comprehensive tests covering the training loop, multi-round persistence, thread-local isolation, idempotent patching, and all error paths.

_{Reviews (2): Last reviewed commit: "Address Auto-SCAFFOLD review feedback" | Re-trigger Greptile}

holgerroth · 2026-05-20T16:01:31Z

Follow-up pushed in 5cc85b4b6 for the Greptile review summary as well:

Auto-SCAFFOLD state is now thread-local while hooks remain process-wide.
PT_SCAFFOLD_AUTO_PATCH now lives in lightweight nvflare.client.constants and is shared by recipe/client runtime code.
Auto mode now fails fast if the LR changes within a local round, with docs/tests covering the constant-LR requirement.
task_exchange_config now adds algorithm flags without being able to override executor-owned Client API settings.

Validation run:

python3 -m pytest tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py tests/unit_test/recipe/scaffold_recipe_test.py -q
python3 -m pytest tests/unit_test/job_config/script_runner_test.py tests/unit_test/app_common/executors/task_script_runner_test.py tests/unit_test/app_common/executors/in_process_client_api_executor_test.py tests/unit_test/app_common/executors/client_api_launcher_executor_test.py -q
python3 -m py_compile nvflare/app_opt/pt/scaffold_auto_patch.py nvflare/app_opt/pt/recipes/scaffold.py nvflare/job_config/script_runner.py nvflare/client/in_process/api.py nvflare/client/ex_process/api.py nvflare/client/constants.py tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py
git diff --check

holgerroth · 2026-05-20T16:03:20Z

@greptileai review again

holgerroth · 2026-06-29T17:20:35Z

Closing this draft as superseded by #4838.

We decided not to pursue process-wide PyTorch monkey-patching for arbitrary Client API scripts. Inferring the SCAFFOLD protocol from load_state_dict(), optimizer.step(), and flare.send() is too implicit and fragile for the recipe layer.

#4838 provides the supported automatic path through nvflare.client.lightning.patch(trainer), where Lightning exposes reliable lifecycle hooks around optimizer steps and result submission. Raw PyTorch Client API loops will continue to use PTScaffoldHelper explicitly and return AlgorithmConstants.SCAFFOLD_CTRL_DIFF. Other standardized trainer integrations can add automatic support in the future when they provide similarly reliable hooks.

holgerroth and others added 2 commits May 19, 2026 14:57

Add opt-in PyTorch Auto-SCAFFOLD mode

a768a5d

Merge branch 'main' into codex/flare-2768-auto-scaffold

f676688

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

Comment thread nvflare/app_opt/pt/scaffold_auto_patch.py Outdated

Comment thread nvflare/client/in_process/api.py

Comment thread nvflare/app_opt/pt/scaffold_auto_patch.py

Address Auto-SCAFFOLD review feedback

5cc85b4

holgerroth and others added 2 commits May 20, 2026 14:47

Merge branch 'main' into codex/flare-2768-auto-scaffold

85ff83a

Guard optional Auto-SCAFFOLD manager access

a9bb12c

holgerroth requested a review from YuanTingHsieh June 25, 2026 15:24

holgerroth closed this Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add opt-in PyTorch Auto-SCAFFOLD mode#4642

Add opt-in PyTorch Auto-SCAFFOLD mode#4642
holgerroth wants to merge 5 commits into
NVIDIA:mainfrom
holgerroth:codex/flare-2768-auto-scaffold

holgerroth commented May 19, 2026

Uh oh!

holgerroth commented May 20, 2026

Uh oh!

greptile-apps Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

holgerroth commented May 20, 2026

Uh oh!

holgerroth commented May 20, 2026

Uh oh!

holgerroth commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

holgerroth commented May 19, 2026

Summary

Validation

Uh oh!

holgerroth commented May 20, 2026

Uh oh!

greptile-apps Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

holgerroth commented May 20, 2026

Uh oh!

holgerroth commented May 20, 2026

Uh oh!

holgerroth commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 20, 2026 •

edited

Loading