Add torchrun CPU Client API integration test#4834
Conversation
|
Follow-up added in
Local validation:
Targeted pytest was not runnable in this shell because |
fb4646e to
d5d26e3
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4834 +/- ##
=======================================
Coverage 56.52% 56.53%
=======================================
Files 969 969
Lines 92255 92255
=======================================
+ Hits 52151 52158 +7
+ Misses 40104 40097 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new CPU-only torchrun-based integration job to exercise multi-process Client API behavior (control rank vs non-control ranks) and updates docs/examples to clarify using global distributed rank (RANK) for NVFlare Client API vs LOCAL_RANK for GPU placement.
Changes:
- Added unit tests to verify
ExProcessClientAPIdefaults rank based onRANKenv var (and does not treatLOCAL_RANKalone as a control-rank signal). - Registered a new standalone integration test/job (
pt_client_api_torchrun_cpu) that launches 2 localglooranks per site viatorch.distributed.run. - Updated Client API docstring and the multi-GPU PyTorch example/README to use global rank for
flare.init(...)andLOCAL_RANKfor CUDA device selection.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit_test/client/ex_process/api_test.py | Adds unit tests asserting RANK drives default control-rank behavior. |
| tests/integration_test/data/test_configs/standalone_job/client_api.yml | Registers the new pt_client_api_torchrun_cpu standalone integration test. |
| tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/meta.conf | Declares the new integration job with 2 required clients. |
| tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py | Implements the torchrun CPU distributed client script used by the job. |
| tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/net.py | Provides the minimal PyTorch model referenced by job config. |
| tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_server.conf | Adds server-side scatter-and-gather workflow config for the new job. |
| tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_client.conf | Adds client-side launcher config to run 2-process torch.distributed.run per site. |
| nvflare/client/api.py | Clarifies docstring guidance: use global distributed rank (e.g., RANK) for Client API rank. |
| examples/advanced/multi-gpu/pt/README.md | Documents global vs local rank usage for NVFlare Client API + CUDA placement. |
| examples/advanced/multi-gpu/pt/client.py | Updates example to use global_rank for Client API and LOCAL_RANK for device/DDP binding. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR adds a CPU-only torchrun integration test covering the multi-process NVFlare Client API contract (rank 0 receives FLModel, non-zero ranks receive None, rank 0 sends result), and fixes the multi-GPU example scripts to correctly broadcast the run-state so non-zero ranks can exit the training loop alongside rank 0.
Confidence Score: 5/5Safe to merge. All changes are additive (new test job + example fixes) and touch no production control paths. The new integration test correctly implements the rank-0 control contract with a try/finally for group teardown and a dist.barrier() before flare.send(). The multi-GPU example loop fix resolves a pre-existing infinite-loop bug for non-rank-0 processes without altering any shared library code. The public api.py already normalises int to str ranks before they reach the ex_process comparison, so no type-mismatch risk exists. Unit tests cover the RANK vs LOCAL_RANK env var cases explicitly. No files require special attention. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant NVFlare as NVFlare Server
participant R0 as torchrun rank 0
participant R1 as torchrun rank 1
NVFlare->>R0: "launch subprocess (torchrun --nproc_per_node=2)"
Note over R0,R1: dist.init_process_group(gloo)
R0->>R0: "flare.init(rank=0) starts FlareAgent"
R1->>R1: "flare.init(rank=1) no FlareAgent"
NVFlare->>R0: send FLModel via CellPipe
R0->>R0: flare.receive() returns FLModel
R1->>R1: flare.receive() returns None
R0->>R1: "broadcast_object_list(FLModel, src=0)"
Note over R0,R1: all_reduce(rank_contribution) asserts sum=3.0
Note over R0,R1: dist.barrier()
R0->>NVFlare: flare.send(output_model)
Note over R0,R1: finally: dist.destroy_process_group()
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant NVFlare as NVFlare Server
participant R0 as torchrun rank 0
participant R1 as torchrun rank 1
NVFlare->>R0: "launch subprocess (torchrun --nproc_per_node=2)"
Note over R0,R1: dist.init_process_group(gloo)
R0->>R0: "flare.init(rank=0) starts FlareAgent"
R1->>R1: "flare.init(rank=1) no FlareAgent"
NVFlare->>R0: send FLModel via CellPipe
R0->>R0: flare.receive() returns FLModel
R1->>R1: flare.receive() returns None
R0->>R1: "broadcast_object_list(FLModel, src=0)"
Note over R0,R1: all_reduce(rank_contribution) asserts sum=3.0
Note over R0,R1: dist.barrier()
R0->>NVFlare: flare.send(output_model)
Note over R0,R1: finally: dist.destroy_process_group()
Reviews (5): Last reviewed commit: "Merge branch 'main' into codex/client-ap..." | Re-trigger Greptile |
- Move dist.barrier() before flare.send() in the torchrun test script: with launch_once=false the Client API exits the process inside flare.send() after the result upload, so the trailing barrier was unreachable on rank 0 and rank 1 always crashed (launcher recorded COMPLETE_FAILED every run) - Build the result FLModel only on rank 0 and drop unused metrics/meta (accuracy, torchrun_world_size, torchrun_rank) - Remove the dead IntimeModelSelector from the test server config (it skips round 0 and the job only runs round 0) - Apply the corrected rank docstring to APISpec.init, ExProcessClientAPI.init and InProcessClientAPI.init - Apply the global/local rank split to both pt-ddp-docker example copies to match the updated multi-gpu example - Broadcast the running state from rank 0 in the DDP examples so nonzero ranks exit the training loop cleanly at job end Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
What changed
Adds a CPU-only
torchrunintegration job for Client API coverage. The job launches two localglooranks per site and verifies the rank-aware boundary:FLModelfrom NVFlareNonefrom Client APIThe job is registered in the existing
client_apistandalone integration suite.This PR also clarifies PyTorch DDP example/docs guidance around global rank versus local rank:
RANK/dist.get_rank()is the global distributed rank and should be passed toflare.init(rank=...)LOCAL_RANKis only for node-local CUDA device placementWhy
This locks down the multi-process Client API behavior before the execution-mode refactor, without requiring GPU/NCCL CI hardware or adding a higher-level distributed abstraction.
Validation
PYTHONPYCACHEPREFIX=/tmp/nvflare_pycache python3 -m py_compile tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/net.pygit diff --checkThe full local style gate was attempted with
./runtest.sh -s, but this shell cannot complete it because the Homebrew-managed Python environment blocks pip dependency installation via PEP 668. GitHub CI style and unit checks are passing.