fix: resolve qwen3.5-35ba3b megatron ep16 OOM via TP=2 (#2619)#2668
Open
sharonyu-115 wants to merge 2 commits into
Open
fix: resolve qwen3.5-35ba3b megatron ep16 OOM via TP=2 (#2619)#2668sharonyu-115 wants to merge 2 commits into
sharonyu-115 wants to merge 2 commits into
Conversation
) The grpo-qwen3.5-35ba3b-2n8g-megatron-ep16 nightly began OOMing on the distributed-logprob path after the vLLM 0.17.1->0.20.0 bump (NVIDIA-NeMo#2384). Per issue NVIDIA-NeMo#2619, moving the MCore path from TP=1 to TP=2 (with expert_tensor_parallel_size=1 and sequence_parallel=True) halves the per-rank logprob tensor and clears the OOM. Verified: steps 1 and 2 complete with zero OOM, including the step-2 logprob that previously failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
yuki-97
approved these changes
Jun 2, 2026
Contributor
yuki-97
left a comment
There was a problem hiding this comment.
thanks @sharonyu-115 @zpqiu , according to the investigation in #2619, it makes sense to me for the changes.
@terrykong could you check if the changes make sense to you?
zpqiu
approved these changes
Jun 2, 2026
terrykong
approved these changes
Jun 8, 2026
Collaborator
|
/ok to test 8756321 |
Collaborator
|
/ok to test 8756321 |
Contributor
|
/ok to test 504fbfc |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The grpo-qwen3.5-35ba3b-2n8g-megatron-ep16 nightly began OOMing on the distributed-logprob path after the vLLM 0.17.1->0.20.0 bump (#2384). Per issue #2619, moving the MCore path from TP=1 to TP=2 (with expert_tensor_parallel_size=1 and sequence_parallel=True) halves the per-rank logprob tensor and clears the OOM.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information