fix: resolve qwen3.5-35ba3b megatron ep16 OOM via TP=2 (#2619) by sharonyu-115 · Pull Request #2668 · NVIDIA-NeMo/RL

sharonyu-115 · 2026-06-02T06:01:20Z

The grpo-qwen3.5-35ba3b-2n8g-megatron-ep16 nightly began OOMing on the distributed-logprob path after the vLLM 0.17.1->0.20.0 bump (#2384). Per issue #2619, moving the MCore path from TP=1 to TP=2 (with expert_tensor_parallel_size=1 and sequence_parallel=True) halves the per-rank logprob tensor and clears the OOM.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

) The grpo-qwen3.5-35ba3b-2n8g-megatron-ep16 nightly began OOMing on the distributed-logprob path after the vLLM 0.17.1->0.20.0 bump (NVIDIA-NeMo#2384). Per issue NVIDIA-NeMo#2619, moving the MCore path from TP=1 to TP=2 (with expert_tensor_parallel_size=1 and sequence_parallel=True) halves the per-rank logprob tensor and clears the OOM. Verified: steps 1 and 2 complete with zero OOM, including the step-2 logprob that previously failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>

copy-pr-bot · 2026-06-02T06:01:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuki-97

thanks @sharonyu-115 @zpqiu , according to the investigation in #2619, it makes sense to me for the changes.
@terrykong could you check if the changes make sense to you?

terrykong · 2026-06-08T04:34:10Z

/ok to test 8756321

terrykong · 2026-06-08T04:34:37Z

/ok to test 8756321

yuki-97 · 2026-06-08T06:54:04Z

/ok to test 504fbfc

sharonyu-115 requested a review from a team as a code owner June 2, 2026 06:01

sharonyu-115 requested review from yuki-97 and zpqiu June 2, 2026 07:13

yuki-97 approved these changes Jun 2, 2026

View reviewed changes

yuki-97 requested a review from terrykong June 2, 2026 08:12

zpqiu approved these changes Jun 2, 2026

View reviewed changes

terrykong approved these changes Jun 8, 2026

View reviewed changes

terrykong added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 8, 2026

copy-pr-bot Bot temporarily deployed to public June 8, 2026 04:34 Inactive

terrykong enabled auto-merge (squash) June 8, 2026 04:34

copy-pr-bot Bot temporarily deployed to public June 8, 2026 04:34 Inactive

copy-pr-bot Bot temporarily deployed to public June 8, 2026 04:39 Inactive

Merge branch 'main' into qwen35-35ba3b-megatron-ep16-oom-fix

504fbfc

copy-pr-bot Bot temporarily deployed to public June 8, 2026 06:54 Inactive

copy-pr-bot Bot temporarily deployed to public June 8, 2026 07:05 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve qwen3.5-35ba3b megatron ep16 OOM via TP=2 (#2619)#2668

fix: resolve qwen3.5-35ba3b megatron ep16 OOM via TP=2 (#2619)#2668
sharonyu-115 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
sharonyu-115:qwen35-35ba3b-megatron-ep16-oom-fix

sharonyu-115 commented Jun 2, 2026

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

yuki-97 left a comment

Uh oh!

terrykong commented Jun 8, 2026

Uh oh!

terrykong commented Jun 8, 2026

Uh oh!

yuki-97 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sharonyu-115 commented Jun 2, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong commented Jun 8, 2026

Uh oh!

terrykong commented Jun 8, 2026

Uh oh!

yuki-97 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants