[DoNotMerge] Modularity/wire weight sync TransferQueue validation by saumishr · Pull Request #2690 · NVIDIA-NeMo/RL

saumishr · 2026-06-04T17:56:38Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Adds Transfer-Queue (data_plane.impl=transfer_queue) coverage to the nightly suite across the simple and mooncake_cpu backends. For each of 18 GRPO/ProRLv2 recipes that are exercised under TQ, this introduces a paired (.yaml, .sh) at the conventional locations: examples/configs/recipes/llm/<base>-tq_{simple,mooncake}.yaml Inherits from the base recipe via `defaults:` and enables data_plane (minimized — `impl: transfer_queue` and the simple backend are inherited defaults). tests/test_suites/llm/<base>-tq_{simple,mooncake}.sh Has its own CONFIG block (mirrors the base recipe's Slurm resource params) so tools/launch can allocate the job, then sources common-tq.env (which sets BASE_RECIPE / TQ_EXP_NAME and enforces a grpo|dapo|prorlv2 prefix) and delegates to the base recipe script: export EXP_NAME="$TQ_EXP_NAME" bash "$SCRIPT_DIR/$BASE_RECIPE.sh" "$@" The export lets common.env (in the base subshell) use the TQ identity for $EXP_DIR / $LOG_DIR / $CKPT_DIR / $JSON_METRICS / wandb.name; the auto-derived $CONFIG_PATH lands on the TQ YAML, which inherits the base config and enables data_plane. Coverage matrix (18 wrappers, 10 simple + 8 mooncake) spans: algo: grpo, prorlv2 strategy: fsdp2tp1, fsdp2tp2, megatron, megatron-fp8 (rollouts + e2e), megatron-eagle3, megatron-pack-cp model: qwen2.5-math-1.5B, llama3.2-1B, llama3.1-8B, gemma3-1B, deepscaler-1.5B, gspo-deepscaler-1.5B, moonlight-16BA3B, qwen3-1.7B, nanov3-30BA3B, qwen3-8B-base scale: 1n8g (×13), 2n8g (×4), 4n8g (×1) features: FP8 rollouts, FP8 e2e, non-colocated, eagle3 spec, custom sampling (temp/top-p/top-k), megatron_generation, context-parallel + sequence packing, LoRA TQ backend: simple, mooncake_cpu common.env: one-line semantic change — `EXP_NAME` accepts a caller- provided value via `${EXP_NAME:-…}`, falling back to `basename $0 .sh` when unset. Backward-compatible — base recipes run unchanged. tests/test_suites/nightly.txt: new "Transfer Queue (TQ) coverage" section listing the 18 wrapper paths so they pick up the existing nightly schedule. tests/unit/test_recipes_and_test_suites.py: raise nightly compute cap 1360 → 1800 to absorb the ~427 GPU-hr the TQ wrappers add. The YAML 1:1 invariant requires no exemption — each wrapper has a paired YAML. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…refit logic Replace refit_policy_generation() calls and NEED_REFIT/POLICY_GENERATION_STALE flags in grpo.py and distillation.py with WeightSynchronizer method calls (sync_weights, mark_stale, is_stale). The setup() functions now create and initialize the appropriate WeightSynchronizer and return it in the tuple. Key changes: - grpo.py setup(): create WeightSynchronizer via factory, replace inline init_collective/prepare_refit_info with weight_sync.init_communicator() - grpo_train/async_grpo_train: accept weight_sync param, use it for all weight transfer instead of refit_policy_generation() - distillation.py: same treatment for distillation_train() - refit_policy_generation(): kept with deprecation warning for external users - Factory: keep NotImplementedError for non-colocated SGLang (SGLang's update_weights_from_collective() is a no-op, would silently skip transfer) - All example scripts updated to thread weight_sync through Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>

copy-pr-bot · 2026-06-04T17:56:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ation When generation uses the megatron framework backend, policy_generation is None, so setup() does not create a WeightSynchronizer and therefore never calls init_communicator() (which is where prepare_refit_info() now runs for the vLLM/SGLang/non-colocated paths). As a result the policy's refit metadata was left uninitialized (MegatronPolicyWorker.refit_conversion_tasks stays None), and the in-place weight conversion on the first rollout dereferenced uninitialized state, crashing with "CUDA: illegal memory access". Previously this call was unconditional in setup(); wiring it into the WeightSynchronizer inadvertently gated it behind policy_generation is not None. Restore it via an else branch for the megatron-generation case.

ZhiyuLi-Nvidia and others added 3 commits June 1, 2026 16:08

Merge branch 'pr-2616' into modularity/wire-weight-sync-tq

581f181

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DoNotMerge] Modularity/wire weight sync TransferQueue validation#2690

[DoNotMerge] Modularity/wire weight sync TransferQueue validation#2690
saumishr wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
saumishr:modularity/wire-weight-sync-tq

saumishr commented Jun 4, 2026

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saumishr commented Jun 4, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants