Skip to content

Add torchrun CPU Client API integration test#4834

Open
YuanTingHsieh wants to merge 8 commits into
NVIDIA:mainfrom
YuanTingHsieh:codex/client-api-torchrun-cpu-test
Open

Add torchrun CPU Client API integration test#4834
YuanTingHsieh wants to merge 8 commits into
NVIDIA:mainfrom
YuanTingHsieh:codex/client-api-torchrun-cpu-test

Conversation

@YuanTingHsieh

@YuanTingHsieh YuanTingHsieh commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

What changed

Adds a CPU-only torchrun integration job for Client API coverage. The job launches two local gloo ranks per site and verifies the rank-aware boundary:

  • rank 0 receives the FLModel from NVFlare
  • nonzero ranks receive None from Client API
  • the user script owns distributed broadcast/all-reduce
  • rank 0 is the NVFlare control rank and sends the result back to NVFlare

The job is registered in the existing client_api standalone integration suite.

This PR also clarifies PyTorch DDP example/docs guidance around global rank versus local rank:

  • RANK / dist.get_rank() is the global distributed rank and should be passed to flare.init(rank=...)
  • LOCAL_RANK is only for node-local CUDA device placement

Why

This locks down the multi-process Client API behavior before the execution-mode refactor, without requiring GPU/NCCL CI hardware or adding a higher-level distributed abstraction.

Validation

  • PYTHONPYCACHEPREFIX=/tmp/nvflare_pycache python3 -m py_compile tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/net.py
  • git diff --check

The full local style gate was attempted with ./runtest.sh -s, but this shell cannot complete it because the Homebrew-managed Python environment blocks pip dependency installation via PEP 668. GitHub CI style and unit checks are passing.

Copy link
Copy Markdown
Collaborator Author

Follow-up added in fb4646e81:

  • Updates the PyTorch multi-GPU example to use global rank for flare.init(...) and LOCAL_RANK for CUDA placement.
  • Clarifies the Client API rank docstring.
  • Adds ex-process Client API unit coverage for RANK vs LOCAL_RANK default behavior.

Local validation:

  • PYTHONPYCACHEPREFIX=/tmp/nvflare_pycache python3 -m py_compile examples/advanced/multi-gpu/pt/client.py tests/unit_test/client/ex_process/api_test.py nvflare/client/api.py
  • git diff --cached --check

Targeted pytest was not runnable in this shell because pytest is not installed.

@YuanTingHsieh YuanTingHsieh force-pushed the codex/client-api-torchrun-cpu-test branch from fb4646e to d5d26e3 Compare June 25, 2026 21:34
@codecov-commenter

codecov-commenter commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.53%. Comparing base (a6c83e2) to head (0520750).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4834   +/-   ##
=======================================
  Coverage   56.52%   56.53%           
=======================================
  Files         969      969           
  Lines       92255    92255           
=======================================
+ Hits        52151    52158    +7     
+ Misses      40104    40097    -7     
Flag Coverage Δ
unit-tests 56.53% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@YuanTingHsieh YuanTingHsieh changed the title [codex] Add torchrun CPU Client API integration test Add torchrun CPU Client API integration test Jun 30, 2026
@YuanTingHsieh YuanTingHsieh marked this pull request as ready for review June 30, 2026 01:09
Copilot AI review requested due to automatic review settings June 30, 2026 01:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CPU-only torchrun-based integration job to exercise multi-process Client API behavior (control rank vs non-control ranks) and updates docs/examples to clarify using global distributed rank (RANK) for NVFlare Client API vs LOCAL_RANK for GPU placement.

Changes:

  • Added unit tests to verify ExProcessClientAPI defaults rank based on RANK env var (and does not treat LOCAL_RANK alone as a control-rank signal).
  • Registered a new standalone integration test/job (pt_client_api_torchrun_cpu) that launches 2 local gloo ranks per site via torch.distributed.run.
  • Updated Client API docstring and the multi-GPU PyTorch example/README to use global rank for flare.init(...) and LOCAL_RANK for CUDA device selection.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit_test/client/ex_process/api_test.py Adds unit tests asserting RANK drives default control-rank behavior.
tests/integration_test/data/test_configs/standalone_job/client_api.yml Registers the new pt_client_api_torchrun_cpu standalone integration test.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/meta.conf Declares the new integration job with 2 required clients.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py Implements the torchrun CPU distributed client script used by the job.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/net.py Provides the minimal PyTorch model referenced by job config.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_server.conf Adds server-side scatter-and-gather workflow config for the new job.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_client.conf Adds client-side launcher config to run 2-process torch.distributed.run per site.
nvflare/client/api.py Clarifies docstring guidance: use global distributed rank (e.g., RANK) for Client API rank.
examples/advanced/multi-gpu/pt/README.md Documents global vs local rank usage for NVFlare Client API + CUDA placement.
examples/advanced/multi-gpu/pt/client.py Updates example to use global_rank for Client API and LOCAL_RANK for device/DDP binding.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/advanced/multi-gpu/pt/README.md
@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a CPU-only torchrun integration test covering the multi-process NVFlare Client API contract (rank 0 receives FLModel, non-zero ranks receive None, rank 0 sends result), and fixes the multi-GPU example scripts to correctly broadcast the run-state so non-zero ranks can exit the training loop alongside rank 0.

  • New integration job (pt_client_api_torchrun_cpu): two gloo ranks per site, verifies rank-0 control semantics, collective assertions, and barrier-before-send ordering; registered in client_api.yml standalone suite.
  • Multi-GPU example fixes: replaces while flare.is_running(): (which looped forever on non-rank-0 processes since no agent is attached) with a broadcast_object_list-based loop; separates global_rank from local_rank for correct CUDA device placement.
  • Doc + type updates: docstrings clarified to say "global rank" (torchrun RANK) not LOCAL_RANK; unit tests added to confirm RANK env var drives default rank selection, not LOCAL_RANK.

Confidence Score: 5/5

Safe to merge. All changes are additive (new test job + example fixes) and touch no production control paths.

The new integration test correctly implements the rank-0 control contract with a try/finally for group teardown and a dist.barrier() before flare.send(). The multi-GPU example loop fix resolves a pre-existing infinite-loop bug for non-rank-0 processes without altering any shared library code. The public api.py already normalises int to str ranks before they reach the ex_process comparison, so no type-mismatch risk exists. Unit tests cover the RANK vs LOCAL_RANK env var cases explicitly.

No files require special attention.

Important Files Changed

Filename Overview
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/custom/torchrun_client.py Core integration test: correct try/finally teardown, dist.barrier() before flare.send(), proper rank-0 control semantics, and all_reduce assertion at 3.0.
examples/advanced/multi-gpu/pt/client.py Fixes infinite-loop bug for non-rank-0 processes; separates global_rank from local_rank for CUDA placement; broadcast_object_list pattern is correct.
examples/docker/jobs/pt-ddp-docker/app_site-1/custom/client.py Identical fix to multi-gpu example; global_rank/local_rank split and broadcast loop pattern applied consistently.
examples/docker/jobs/pt-ddp-docker/app_site-2/custom/client.py Same as app_site-1; correct mirror of the DDP fix.
nvflare/client/ex_process/api.py Docstring clarification only; rank comparison logic (rank == "0") is already safe because the public api.py normalises int to str before this point.
tests/unit_test/client/ex_process/api_test.py Two new tests correctly verify RANK env var drives default rank and LOCAL_RANK alone does not suppress the FlareAgent; MagicMock import moved to top-level correctly.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_client.conf launch_once=false, nproc_per_node=2 (gloo), heartbeat_timeout=60, shutdown_timeout=30 — values consistent with a CPU-only single-round test.
tests/integration_test/data/jobs/pt_client_api_torchrun_cpu/app/config/config_fed_server.conf ScatterAndGather workflow, num_rounds=1, min_clients=2, train_timeout=0 — correct for a single-round CPU integration test with two federated clients.
tests/integration_test/data/test_configs/standalone_job/client_api.yml New test entry registered with correct event sequence; no extra setup required since the job uses a trivial Linear(2,1) net with no dataset.
nvflare/client/api.py Docstring update only; the public int to str normalisation at line 67-68 is already present and ensures downstream rank == "0" comparisons remain correct.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant NVFlare as NVFlare Server
    participant R0 as torchrun rank 0
    participant R1 as torchrun rank 1

    NVFlare->>R0: "launch subprocess (torchrun --nproc_per_node=2)"
    Note over R0,R1: dist.init_process_group(gloo)
    R0->>R0: "flare.init(rank=0) starts FlareAgent"
    R1->>R1: "flare.init(rank=1) no FlareAgent"

    NVFlare->>R0: send FLModel via CellPipe
    R0->>R0: flare.receive() returns FLModel
    R1->>R1: flare.receive() returns None

    R0->>R1: "broadcast_object_list(FLModel, src=0)"
    Note over R0,R1: all_reduce(rank_contribution) asserts sum=3.0

    Note over R0,R1: dist.barrier()

    R0->>NVFlare: flare.send(output_model)
    Note over R0,R1: finally: dist.destroy_process_group()
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant NVFlare as NVFlare Server
    participant R0 as torchrun rank 0
    participant R1 as torchrun rank 1

    NVFlare->>R0: "launch subprocess (torchrun --nproc_per_node=2)"
    Note over R0,R1: dist.init_process_group(gloo)
    R0->>R0: "flare.init(rank=0) starts FlareAgent"
    R1->>R1: "flare.init(rank=1) no FlareAgent"

    NVFlare->>R0: send FLModel via CellPipe
    R0->>R0: flare.receive() returns FLModel
    R1->>R1: flare.receive() returns None

    R0->>R1: "broadcast_object_list(FLModel, src=0)"
    Note over R0,R1: all_reduce(rank_contribution) asserts sum=3.0

    Note over R0,R1: dist.barrier()

    R0->>NVFlare: flare.send(output_model)
    Note over R0,R1: finally: dist.destroy_process_group()
Loading

Reviews (5): Last reviewed commit: "Merge branch 'main' into codex/client-ap..." | Re-trigger Greptile

YuanTingHsieh and others added 4 commits June 30, 2026 13:33
- Move dist.barrier() before flare.send() in the torchrun test script: with launch_once=false the Client API exits the process inside flare.send() after the result upload, so the trailing barrier was unreachable on rank 0 and rank 1 always crashed (launcher recorded COMPLETE_FAILED every run)
- Build the result FLModel only on rank 0 and drop unused metrics/meta (accuracy, torchrun_world_size, torchrun_rank)
- Remove the dead IntimeModelSelector from the test server config (it skips round 0 and the job only runs round 0)
- Apply the corrected rank docstring to APISpec.init, ExProcessClientAPI.init and InProcessClientAPI.init
- Apply the global/local rank split to both pt-ddp-docker example copies to match the updated multi-gpu example
- Broadcast the running state from rank 0 in the DDP examples so nonzero ranks exit the training loop cleanly at job end

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@YuanTingHsieh YuanTingHsieh added examples Label for all example related work cicd continuous integration/continuous development labels Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cicd continuous integration/continuous development examples Label for all example related work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants