Skip to content

Add opt-in PyTorch Auto-SCAFFOLD mode#4642

Closed
holgerroth wants to merge 5 commits into
NVIDIA:mainfrom
holgerroth:codex/flare-2768-auto-scaffold
Closed

Add opt-in PyTorch Auto-SCAFFOLD mode#4642
holgerroth wants to merge 5 commits into
NVIDIA:mainfrom
holgerroth:codex/flare-2768-auto-scaffold

Conversation

@holgerroth

Copy link
Copy Markdown
Collaborator

Summary

  • Add auto_scaffold=True for PyTorch ScaffoldRecipe to support standard FedAvg-style Client API scripts.
  • Add an Auto-SCAFFOLD runtime hook manager that detects model loading, applies correction after optimizer steps, and injects scaffold_c_diff on send.
  • Add generic task_exchange_config plumbing through ScriptRunner/client executors, plus docs and recipe/example guidance.

Validation

  • python3 -m pytest tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py tests/unit_test/recipe/scaffold_recipe_test.py -q
  • python3 -m pytest tests/unit_test/job_config/script_runner_test.py tests/unit_test/app_common/executors/task_script_runner_test.py -q
  • python3 -m py_compile nvflare/app_opt/pt/scaffold_auto_patch.py nvflare/app_opt/pt/recipes/scaffold.py nvflare/job_config/script_runner.py nvflare/client/in_process/api.py nvflare/client/ex_process/api.py tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py
  • git diff --check

@holgerroth

Copy link
Copy Markdown
Collaborator Author

@greptileai review

@greptile-apps

greptile-apps Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces an opt-in Auto-SCAFFOLD mode for PyTorch, letting existing FedAvg-style Client API scripts participate in SCAFFOLD without manually instrumenting control-variate code. It also threads a generic task_exchange_config dict through the full executor/runner stack so arbitrary per-job flags can reach the client process.

  • scaffold_auto_patch.py: New per-thread PTScaffoldAutoPatchManager that monkey-patches torch.nn.Module.load_state_dict and Optimizer.step process-wide, routes calls to the current thread's manager via threading.local, validates a constant LR across the local round, and guards against manual PTScaffoldHelper calls while auto mode is active.
  • ScaffoldRecipe + executor chain: Adds auto_scaffold=True parameter that sets {PT_SCAFFOLD_AUTO_PATCH: True} in task_exchange_config; the flag is propagated through ScriptRunner → ClientAPILauncherExecutor/InProcessClientAPIExecutor → config file or task metadata → ExProcessClientAPI/InProcessClientAPI → lazy import + patch activation.
  • TaskScriptRunner: Adds a finally-block cleanup that lazily calls disable_pt_scaffold_auto_patch() to restore PyTorch methods even if the training script raises.

Confidence Score: 5/5

The change is safe to merge; the new Auto-SCAFFOLD path is opt-in and the default behaviour (auto_scaffold=False) is completely unchanged.

All three issues raised in the previous review round have been addressed with thread-local state, a shared constants module, and a first-step LR check that raises on change. The plumbing additions are additive and backward-compatible. The single remaining nit (a dead _last_lr assignment) has no effect on correctness.

No files require special attention; the core logic is in scaffold_auto_patch.py which has thorough test coverage.

Important Files Changed

Filename Overview
nvflare/app_opt/pt/scaffold_auto_patch.py New Auto-SCAFFOLD runtime hook manager; thread-local state correctly isolates concurrent simulation clients; one dead assignment (_last_lr) remains.
nvflare/app_opt/pt/recipes/scaffold.py Adds auto_scaffold parameter to ScaffoldRecipe; correctly validates launch_once constraint and propagates PT_SCAFFOLD_AUTO_PATCH flag through task_exchange_config.
nvflare/client/in_process/api.py Adds lazy Auto-SCAFFOLD activation via PT_SCAFFOLD_AUTO_PATCH constant; correctly hooks receive/send lifecycle and disables manager on shutdown.
nvflare/client/ex_process/api.py Mirrors in_process/api.py changes for external-process client; activation outside the rank==0 guard is safe because non-rank-0 managers stay inactive.
nvflare/job_config/script_runner.py Adds task_exchange_config plumbing through BaseScriptRunner and ScriptRunner; config is copied defensively and forwarded to both executor paths.
nvflare/app_common/executors/task_script_runner.py Adds lazy disable_pt_scaffold_auto_patch() call in finally block to ensure thread-local manager is cleaned up even when the training script exits with an exception.
nvflare/client/constants.py Adds lightweight PT_SCAFFOLD_AUTO_PATCH constant to avoid forcing PyTorch import in generic client init path.
tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py Comprehensive tests covering the training loop, multi-round persistence, thread-local isolation, idempotent patching, and all error paths.

Reviews (2): Last reviewed commit: "Address Auto-SCAFFOLD review feedback" | Re-trigger Greptile

Comment thread nvflare/app_opt/pt/scaffold_auto_patch.py Outdated
Comment thread nvflare/client/in_process/api.py
Comment thread nvflare/app_opt/pt/scaffold_auto_patch.py

Copy link
Copy Markdown
Collaborator Author

Follow-up pushed in 5cc85b4b6 for the Greptile review summary as well:

  • Auto-SCAFFOLD state is now thread-local while hooks remain process-wide.
  • PT_SCAFFOLD_AUTO_PATCH now lives in lightweight nvflare.client.constants and is shared by recipe/client runtime code.
  • Auto mode now fails fast if the LR changes within a local round, with docs/tests covering the constant-LR requirement.
  • task_exchange_config now adds algorithm flags without being able to override executor-owned Client API settings.

Validation run:

  • python3 -m pytest tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py tests/unit_test/recipe/scaffold_recipe_test.py -q
  • python3 -m pytest tests/unit_test/job_config/script_runner_test.py tests/unit_test/app_common/executors/task_script_runner_test.py tests/unit_test/app_common/executors/in_process_client_api_executor_test.py tests/unit_test/app_common/executors/client_api_launcher_executor_test.py -q
  • python3 -m py_compile nvflare/app_opt/pt/scaffold_auto_patch.py nvflare/app_opt/pt/recipes/scaffold.py nvflare/job_config/script_runner.py nvflare/client/in_process/api.py nvflare/client/ex_process/api.py nvflare/client/constants.py tests/unit_test/app_opt/pt/scaffold_auto_patch_test.py
  • git diff --check

@holgerroth

Copy link
Copy Markdown
Collaborator Author

@greptileai review again

@holgerroth holgerroth requested a review from YuanTingHsieh June 25, 2026 15:24

Copy link
Copy Markdown
Collaborator Author

Closing this draft as superseded by #4838.

We decided not to pursue process-wide PyTorch monkey-patching for arbitrary Client API scripts. Inferring the SCAFFOLD protocol from load_state_dict(), optimizer.step(), and flare.send() is too implicit and fragile for the recipe layer.

#4838 provides the supported automatic path through nvflare.client.lightning.patch(trainer), where Lightning exposes reliable lifecycle hooks around optimizer steps and result submission. Raw PyTorch Client API loops will continue to use PTScaffoldHelper explicitly and return AlgorithmConstants.SCAFFOLD_CTRL_DIFF. Other standardized trainer integrations can add automatic support in the future when they provide similarly reliable hooks.

@holgerroth holgerroth closed this Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant