Add automatic Lightning SCAFFOLD support#4838
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4838 +/- ##
==========================================
+ Coverage 56.52% 56.69% +0.16%
==========================================
Files 969 971 +2
Lines 92255 92457 +202
==========================================
+ Hits 52151 52417 +266
+ Misses 40104 40040 -64
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: Holger Roth <hroth@nvidia.com>
ebf1743 to
5c31e8b
Compare
chesterxgchen
left a comment
There was a problem hiding this comment.
Seems the scaffold handling logics leak to the generic lightning training process.
Signed-off-by: Holger Roth <hroth@nvidia.com>
|
@chesterxgchen Addressed the requested changes in The Lightning API is now algorithm-neutral: The other review items were addressed as follows:
Validation completed on this commit:
I also replied point-by-point on each inline review thread with the corresponding implementation and validation details, and requested re-review. |
Signed-off-by: Holger Roth <hroth@nvidia.com>
Summary
nvflare.client.lightning.patch()PTScaffoldHelperaround Lightning optimizer steps and returnSCAFFOLD_CTRL_DIFFMotivation
ScaffoldReciperequires every client to apply control-variate corrections during local training and return a control delta. Raw PyTorch clients already do this explicitly withPTScaffoldHelper, but the Lightningpatch()callback previously only handled model receive/send. As a result, changing a patched Lightning job fromFedAvgRecipetoScaffoldRecipeleft out the client-side algorithm and caused server aggregation to fail.This change closes that contract gap without changing the public
patch()API. Patched Lightning clients now activate SCAFFOLD automatically when the receivedFLModelcontains global controls.Availability
This feature targets NVFlare 2.9.0. The Hello Lightning requirement pins
nvflare~=2.9.0rc, and the example documentation directs main-branch users to install NVFlare from this repository until the 2.9 package is published. The feature is intentionally not listed in the 2.8 release notes.Implementation
FLCallbackalgorithm-neutral through a private algorithm-handler manager.SCAFFOLD_CTRL_GLOBALmetadata is received; FedAvg never creates a SCAFFOLD handler.PTScaffoldHelperonce per SCAFFOLD client and preserve its local controls across rounds.receive().on_before_optimizer_stepand apply corrections after completed optimizer steps, including with gradient accumulation.SCAFFOLD_CTRL_DIFF.The automatic path supports Lightning automatic optimization with one optimizer and
precision="32-true"orprecision="bf16-mixed". Raw PyTorch loops, manual Lightning optimization, and scaler-backed mixed precision must use an explicit receive/train/send loop withPTScaffoldHelper;patch()is not used for that path. FedProx support is intentionally out of scope.Validation
./runtest.sh -s --skip-install: passed./build_doc.sh --html --skip-api: passed (existing documentation warnings only)H100 end-to-end evaluation
The evaluation harness and results are not included in this PR.
Commit
d11f4198dwas tested on NVIDIA H100 NVL GPUs with the advanced CIFAR-10 example setup:NUM_STEPS_CURRENT_ROUND, exercising automatic SCAFFOLD step accountingModerateCNN, 8 clients, Dirichlet alpha 0.1, seed 0SCAFFOLD improved final accuracy by 2.15 percentage points with 16.9% additional runtime. Successful aggregation in every SCAFFOLD round verifies that all eight patched Lightning clients returned the required parameter-only control delta and per-round step metadata automatically. The FedAvg run verifies that the generic manager preserves behavior without loading a SCAFFOLD handler.