Add torchrun CPU Client API integration test by YuanTingHsieh · Pull Request #4834 · NVIDIA/NVFlare

YuanTingHsieh · 2026-06-25T20:44:26Z

What changed

Adds a CPU-only torchrun integration job for Client API coverage. The job launches two local gloo ranks per site and verifies the rank-aware boundary:

rank 0 receives the FLModel from NVFlare
nonzero ranks receive None from Client API
the user script owns distributed broadcast/all-reduce
rank 0 is the NVFlare control rank and sends the result back to NVFlare

The job is registered in the existing client_api standalone integration suite.

This PR also clarifies PyTorch DDP example/docs guidance around global rank versus local rank:

RANK / dist.get_rank() is the global distributed rank and should be passed to flare.init(rank=...)
LOCAL_RANK is only for node-local CUDA device placement

Why

This locks down the multi-process Client API behavior before the execution-mode refactor, without requiring GPU/NCCL CI hardware or adding a higher-level distributed abstraction.

Validation

PYTHONPYCACHEPREFIX=/tmp/nvflare_pycache python3 -m py_compile tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/net.py
git diff --check

The full local style gate was attempted with ./runtest.sh -s, but this shell cannot complete it because the Homebrew-managed Python environment blocks pip dependency installation via PEP 668. GitHub CI style and unit checks are passing.

YuanTingHsieh · 2026-06-25T21:05:19Z

Follow-up added in fb4646e81:

Updates the PyTorch multi-GPU example to use global rank for flare.init(...) and LOCAL_RANK for CUDA placement.
Clarifies the Client API rank docstring.
Adds ex-process Client API unit coverage for RANK vs LOCAL_RANK default behavior.

Local validation:

PYTHONPYCACHEPREFIX=/tmp/nvflare_pycache python3 -m py_compile examples/advanced/multi-gpu/pt/client.py tests/unit_test/client/ex_process/api_test.py nvflare/client/api.py
git diff --cached --check

Targeted pytest was not runnable in this shell because pytest is not installed.

codecov-commenter · 2026-06-25T21:42:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.53%. Comparing base (a6c83e2) to head (0520750).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4834   +/-   ##
=======================================
  Coverage   56.52%   56.53%           
=======================================
  Files         969      969           
  Lines       92255    92255           
=======================================
+ Hits        52151    52158    +7     
+ Misses      40104    40097    -7

Flag	Coverage Δ
unit-tests	`56.53% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds a new CPU-only torchrun-based integration job to exercise multi-process Client API behavior (control rank vs non-control ranks) and updates docs/examples to clarify using global distributed rank (RANK) for NVFlare Client API vs LOCAL_RANK for GPU placement.

Changes:

Added unit tests to verify ExProcessClientAPI defaults rank based on RANK env var (and does not treat LOCAL_RANK alone as a control-rank signal).
Registered a new standalone integration test/job (pt_client_api_torchrun_cpu) that launches 2 local gloo ranks per site via torch.distributed.run.
Updated Client API docstring and the multi-GPU PyTorch example/README to use global rank for flare.init(...) and LOCAL_RANK for CUDA device selection.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/unit_test/client/ex_process/api_test.py	Adds unit tests asserting `RANK` drives default control-rank behavior.
tests/integration_test/data/test_configs/standalone_job/client_api.yml	Registers the new `pt_client_api_torchrun_cpu` standalone integration test.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/meta.conf	Declares the new integration job with 2 required clients.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py	Implements the torchrun CPU distributed client script used by the job.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/net.py	Provides the minimal PyTorch model referenced by job config.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_server.conf	Adds server-side scatter-and-gather workflow config for the new job.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_client.conf	Adds client-side launcher config to run 2-process `torch.distributed.run` per site.
nvflare/client/api.py	Clarifies docstring guidance: use global distributed rank (e.g., `RANK`) for Client API rank.
examples/advanced/multi-gpu/pt/README.md	Documents global vs local rank usage for NVFlare Client API + CUDA placement.
examples/advanced/multi-gpu/pt/client.py	Updates example to use `global_rank` for Client API and `LOCAL_RANK` for device/DDP binding.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

greptile-apps · 2026-06-30T01:16:51Z

Greptile Summary

This PR adds a CPU-only torchrun integration test covering the multi-process NVFlare Client API contract (rank 0 receives FLModel, non-zero ranks receive None, rank 0 sends result), and fixes the multi-GPU example scripts to correctly broadcast the run-state so non-zero ranks can exit the training loop alongside rank 0.

New integration job (pt_client_api_torchrun_cpu): two gloo ranks per site, verifies rank-0 control semantics, collective assertions, and barrier-before-send ordering; registered in client_api.yml standalone suite.
Multi-GPU example fixes: replaces while flare.is_running(): (which looped forever on non-rank-0 processes since no agent is attached) with a broadcast_object_list-based loop; separates global_rank from local_rank for correct CUDA device placement.
Doc + type updates: docstrings clarified to say "global rank" (torchrun RANK) not LOCAL_RANK; unit tests added to confirm RANK env var drives default rank selection, not LOCAL_RANK.

Confidence Score: 5/5

Safe to merge. All changes are additive (new test job + example fixes) and touch no production control paths.

The new integration test correctly implements the rank-0 control contract with a try/finally for group teardown and a dist.barrier() before flare.send(). The multi-GPU example loop fix resolves a pre-existing infinite-loop bug for non-rank-0 processes without altering any shared library code. The public api.py already normalises int to str ranks before they reach the ex_process comparison, so no type-mismatch risk exists. Unit tests cover the RANK vs LOCAL_RANK env var cases explicitly.

No files require special attention.

Important Files Changed

Filename	Overview
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py	Core integration test: correct try/finally teardown, dist.barrier() before flare.send(), proper rank-0 control semantics, and all_reduce assertion at 3.0.
examples/advanced/multi-gpu/pt/client.py	Fixes infinite-loop bug for non-rank-0 processes; separates global_rank from local_rank for CUDA placement; broadcast_object_list pattern is correct.
examples/docker/jobs/pt-ddp-docker/app_site-1/custom/client.py	Identical fix to multi-gpu example; global_rank/local_rank split and broadcast loop pattern applied consistently.
examples/docker/jobs/pt-ddp-docker/app_site-2/custom/client.py	Same as app_site-1; correct mirror of the DDP fix.
nvflare/client/ex_process/api.py	Docstring clarification only; rank comparison logic (rank == "0") is already safe because the public api.py normalises int to str before this point.
tests/unit_test/client/ex_process/api_test.py	Two new tests correctly verify RANK env var drives default rank and LOCAL_RANK alone does not suppress the FlareAgent; MagicMock import moved to top-level correctly.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_client.conf	launch_once=false, nproc_per_node=2 (gloo), heartbeat_timeout=60, shutdown_timeout=30 — values consistent with a CPU-only single-round test.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_server.conf	ScatterAndGather workflow, num_rounds=1, min_clients=2, train_timeout=0 — correct for a single-round CPU integration test with two federated clients.
tests/integration_test/data/test_configs/standalone_job/client_api.yml	New test entry registered with correct event sequence; no extra setup required since the job uses a trivial Linear(2,1) net with no dataset.
nvflare/client/api.py	Docstring update only; the public int to str normalisation at line 67-68 is already present and ensures downstream rank == "0" comparisons remain correct.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant NVFlare as NVFlare Server
    participant R0 as torchrun rank 0
    participant R1 as torchrun rank 1

    NVFlare->>R0: "launch subprocess (torchrun --nproc_per_node=2)"
    Note over R0,R1: dist.init_process_group(gloo)
    R0->>R0: "flare.init(rank=0) starts FlareAgent"
    R1->>R1: "flare.init(rank=1) no FlareAgent"

    NVFlare->>R0: send FLModel via CellPipe
    R0->>R0: flare.receive() returns FLModel
    R1->>R1: flare.receive() returns None

    R0->>R1: "broadcast_object_list(FLModel, src=0)"
    Note over R0,R1: all_reduce(rank_contribution) asserts sum=3.0

    Note over R0,R1: dist.barrier()

    R0->>NVFlare: flare.send(output_model)
    Note over R0,R1: finally: dist.destroy_process_group()

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant NVFlare as NVFlare Server
    participant R0 as torchrun rank 0
    participant R1 as torchrun rank 1

    NVFlare->>R0: "launch subprocess (torchrun --nproc_per_node=2)"
    Note over R0,R1: dist.init_process_group(gloo)
    R0->>R0: "flare.init(rank=0) starts FlareAgent"
    R1->>R1: "flare.init(rank=1) no FlareAgent"

    NVFlare->>R0: send FLModel via CellPipe
    R0->>R0: flare.receive() returns FLModel
    R1->>R1: flare.receive() returns None

    R0->>R1: "broadcast_object_list(FLModel, src=0)"
    Note over R0,R1: all_reduce(rank_contribution) asserts sum=3.0

    Note over R0,R1: dist.barrier()

    R0->>NVFlare: flare.send(output_model)
    Note over R0,R1: finally: dist.destroy_process_group()

_{Reviews (5): Last reviewed commit: "Merge branch 'main' into codex/client-ap..." | Re-trigger Greptile}

- Move dist.barrier() before flare.send() in the torchrun test script: with launch_once=false the Client API exits the process inside flare.send() after the result upload, so the trailing barrier was unreachable on rank 0 and rank 1 always crashed (launcher recorded COMPLETE_FAILED every run) - Build the result FLModel only on rank 0 and drop unused metrics/meta (accuracy, torchrun_world_size, torchrun_rank) - Remove the dead IntimeModelSelector from the test server config (it skips round 0 and the job only runs round 0) - Apply the corrected rank docstring to APISpec.init, ExProcessClientAPI.init and InProcessClientAPI.init - Apply the global/local rank split to both pt-ddp-docker example copies to match the updated multi-gpu example - Broadcast the running state from rank 0 in the DDP examples so nonzero ranks exit the training loop cleanly at job end Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

YuanTingHsieh added 2 commits June 25, 2026 14:31

Add torchrun CPU Client API integration test

39da42b

Clarify Client API rank handling for torchrun

d5d26e3

YuanTingHsieh force-pushed the codex/client-api-torchrun-cpu-test branch from fb4646e to d5d26e3 Compare June 25, 2026 21:34

Align torchrun test with rank-zero send contract

7ab2a10

YuanTingHsieh changed the title ~~[codex] Add torchrun CPU Client API integration test~~ Add torchrun CPU Client API integration test Jun 30, 2026

YuanTingHsieh marked this pull request as ready for review June 30, 2026 01:09

Copilot AI review requested due to automatic review settings June 30, 2026 01:09

Merge branch 'main' into codex/client-api-torchrun-cpu-test

600cf09

Copilot started reviewing on behalf of YuanTingHsieh June 30, 2026 01:10 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py Outdated

Comment thread examples/advanced/multi-gpu/pt/README.md

greptile-apps Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py Outdated

Comment thread tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py Outdated

YuanTingHsieh and others added 4 commits June 30, 2026 13:33

Address torchrun Client API review comments

235c9e8

Merge branch 'main' into codex/client-api-torchrun-cpu-test

64017eb

Merge branch 'main' into codex/client-api-torchrun-cpu-test

0520750

YuanTingHsieh added examples Label for all example related work cicd continuous integration/continuous development labels Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add torchrun CPU Client API integration test#4834

Add torchrun CPU Client API integration test#4834
YuanTingHsieh wants to merge 8 commits into
NVIDIA:mainfrom
YuanTingHsieh:codex/client-api-torchrun-cpu-test

YuanTingHsieh commented Jun 25, 2026 •

edited

Loading

Uh oh!

YuanTingHsieh commented Jun 25, 2026

Uh oh!

codecov-commenter commented Jun 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

YuanTingHsieh commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Validation

Uh oh!

YuanTingHsieh commented Jun 25, 2026

Uh oh!

codecov-commenter commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YuanTingHsieh commented Jun 25, 2026 •

edited

Loading

codecov-commenter commented Jun 25, 2026 •

edited

Loading

greptile-apps Bot commented Jun 30, 2026 •

edited

Loading