fix: Fix fp8 memory fragmentation by ashors1 · Pull Request #2670 · NVIDIA-NeMo/RL

ashors1 · 2026-06-02T18:12:25Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

closes #2003

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> WIP oom Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> for test Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> memory

Signed-off-by: Anna Shors <ashors@nvidia.com>

Signed-off-by: ashors1 <ashors@nvidia.com>

copy-pr-bot · 2026-06-02T18:12:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…refit Signed-off-by: Anna Shors <ashors@nvidia.com>

ZhiyuLi-Nvidia · 2026-06-05T03:09:27Z

+            if hasattr(te, "module") and hasattr(te.module.base, "clear_workspace"):
+                te.module.base.clear_workspace()
+        except ImportError:
+            pass


Hi, @ashors1 might be my issue. I can't find this in TE and I don't think this piece code is helpful to the memory saving, maybe we can simply remove it?

ZhiyuLi-Nvidia · 2026-06-05T03:18:08Z

+        gc.collect()
+        torch.cuda.empty_cache()


Would it be more efficient to run gc.collect() and torch.cuda.empty_cache() just once after all references are cleared? Calling them frequently might introduce unnecessary overhead to the training pipeline.

ZhiyuLi-Nvidia · 2026-06-05T03:19:09Z

+            gc.collect()
+            torch.cuda.empty_cache()


Same above:
Would it be more efficient to run gc.collect() and torch.cuda.empty_cache() just once after all references are cleared? Calling them frequently might introduce unnecessary overhead to the training pipeline.

ZhiyuLi-Nvidia

Thank you @ashors1 for driving the fix!
Just left some nit.
LGTM!

…memory

Signed-off-by: Anna Shors <ashors@nvidia.com>

Signed-off-by: ashors1 <ashors@nvidia.com>

ZhiyuLi-Nvidia and others added 4 commits June 2, 2026 10:53

additional memory clear

ea5beff

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> WIP oom Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> for test Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> memory

fixes to reduce allocated/reserved memory after offload

8a7da38

Signed-off-by: Anna Shors <ashors@nvidia.com>

revert config changes

37294c4

Signed-off-by: ashors1 <ashors@nvidia.com>

small cleanup of comments

10469c3

Signed-off-by: ashors1 <ashors@nvidia.com>

make memory optimizations configurable via clear_memory_caches_after_…

97efa3a

…refit Signed-off-by: Anna Shors <ashors@nvidia.com>

ashors1 marked this pull request as ready for review June 5, 2026 02:53

ashors1 requested review from a team as code owners June 5, 2026 02:53

ashors1 requested a review from ZhiyuLi-Nvidia June 5, 2026 02:53

ZhiyuLi-Nvidia reviewed Jun 5, 2026

View reviewed changes

ZhiyuLi-Nvidia previously approved these changes Jun 5, 2026

View reviewed changes

ashors1 added 2 commits June 5, 2026 12:20

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into ashors/fix-fp8-…

9e8f799

…memory

address comments

468ee20

Signed-off-by: Anna Shors <ashors@nvidia.com>

ashors1 dismissed ZhiyuLi-Nvidia’s stale review via 468ee20 June 5, 2026 21:48

remove redundant gc.collect

ec41a13

Signed-off-by: Anna Shors <ashors@nvidia.com>

ZhiyuLi-Nvidia previously approved these changes Jun 5, 2026

View reviewed changes

improve documentation

378c101

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 dismissed ZhiyuLi-Nvidia’s stale review via 378c101 June 6, 2026 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix fp8 memory fragmentation#2670

fix: Fix fp8 memory fragmentation#2670
ashors1 wants to merge 9 commits into
mainfrom
ashors/fix-fp8-memory

ashors1 commented Jun 2, 2026

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

ZhiyuLi-Nvidia Jun 5, 2026

Uh oh!

ZhiyuLi-Nvidia Jun 5, 2026

Uh oh!

ZhiyuLi-Nvidia Jun 5, 2026

Uh oh!

ZhiyuLi-Nvidia left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ashors1 commented Jun 2, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

ZhiyuLi-Nvidia Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants