[DoNotMerge] Modularity/wire weight sync TransferQueue validation#2690
Draft
saumishr wants to merge 4 commits into
Draft
[DoNotMerge] Modularity/wire weight sync TransferQueue validation#2690saumishr wants to merge 4 commits into
saumishr wants to merge 4 commits into
Conversation
Adds Transfer-Queue (data_plane.impl=transfer_queue) coverage to the
nightly suite across the simple and mooncake_cpu backends.
For each of 18 GRPO/ProRLv2 recipes that are exercised under TQ, this
introduces a paired (.yaml, .sh) at the conventional locations:
examples/configs/recipes/llm/<base>-tq_{simple,mooncake}.yaml
Inherits from the base recipe via `defaults:` and enables
data_plane (minimized — `impl: transfer_queue` and the simple
backend are inherited defaults).
tests/test_suites/llm/<base>-tq_{simple,mooncake}.sh
Has its own CONFIG block (mirrors the base recipe's Slurm
resource params) so tools/launch can allocate the job, then
sources common-tq.env (which sets BASE_RECIPE / TQ_EXP_NAME and
enforces a grpo|dapo|prorlv2 prefix) and delegates to the base
recipe script:
export EXP_NAME="$TQ_EXP_NAME"
bash "$SCRIPT_DIR/$BASE_RECIPE.sh" "$@"
The export lets common.env (in the base subshell) use the TQ
identity for $EXP_DIR / $LOG_DIR / $CKPT_DIR / $JSON_METRICS /
wandb.name; the auto-derived $CONFIG_PATH lands on the TQ YAML,
which inherits the base config and enables data_plane.
Coverage matrix (18 wrappers, 10 simple + 8 mooncake) spans:
algo: grpo, prorlv2
strategy: fsdp2tp1, fsdp2tp2, megatron, megatron-fp8 (rollouts +
e2e), megatron-eagle3, megatron-pack-cp
model: qwen2.5-math-1.5B, llama3.2-1B, llama3.1-8B, gemma3-1B,
deepscaler-1.5B, gspo-deepscaler-1.5B, moonlight-16BA3B,
qwen3-1.7B, nanov3-30BA3B, qwen3-8B-base
scale: 1n8g (×13), 2n8g (×4), 4n8g (×1)
features: FP8 rollouts, FP8 e2e, non-colocated, eagle3 spec,
custom sampling (temp/top-p/top-k), megatron_generation,
context-parallel + sequence packing, LoRA
TQ backend: simple, mooncake_cpu
common.env: one-line semantic change — `EXP_NAME` accepts a caller-
provided value via `${EXP_NAME:-…}`, falling back to `basename $0 .sh`
when unset. Backward-compatible — base recipes run unchanged.
tests/test_suites/nightly.txt: new "Transfer Queue (TQ) coverage"
section listing the 18 wrapper paths so they pick up the existing
nightly schedule.
tests/unit/test_recipes_and_test_suites.py: raise nightly compute cap
1360 → 1800 to absorb the ~427 GPU-hr the TQ wrappers add. The YAML
1:1 invariant requires no exemption — each wrapper has a paired YAML.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…refit logic Replace refit_policy_generation() calls and NEED_REFIT/POLICY_GENERATION_STALE flags in grpo.py and distillation.py with WeightSynchronizer method calls (sync_weights, mark_stale, is_stale). The setup() functions now create and initialize the appropriate WeightSynchronizer and return it in the tuple. Key changes: - grpo.py setup(): create WeightSynchronizer via factory, replace inline init_collective/prepare_refit_info with weight_sync.init_communicator() - grpo_train/async_grpo_train: accept weight_sync param, use it for all weight transfer instead of refit_policy_generation() - distillation.py: same treatment for distillation_train() - refit_policy_generation(): kept with deprecation warning for external users - Factory: keep NotImplementedError for non-colocated SGLang (SGLang's update_weights_from_collective() is a no-op, would silently skip transfer) - All example scripts updated to thread weight_sync through Signed-off-by: Saurabh Mishra <sauramishra@nvidia.com>
…ation When generation uses the megatron framework backend, policy_generation is None, so setup() does not create a WeightSynchronizer and therefore never calls init_communicator() (which is where prepare_refit_info() now runs for the vLLM/SGLang/non-colocated paths). As a result the policy's refit metadata was left uninitialized (MegatronPolicyWorker.refit_conversion_tasks stays None), and the in-place weight conversion on the first rollout dereferenced uninitialized state, crashing with "CUDA: illegal memory access". Previously this call was unconditional in setup(); wiring it into the WeightSynchronizer inadvertently gated it behind policy_generation is not None. Restore it via an else branch for the megatron-generation case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information