[release/2.12] fix leak in CUDAGraph::capture_end (#180395) by dnikolaev-amd · Pull Request #3357 · ROCm/pytorch

dnikolaev-amd · 2026-06-23T14:01:58Z

CUDAGraph::capture_end() previously performed a CUDA error check (AT_CUDA_CHECK(endCaptureErr)) immediately after calling cudaStreamEndCapture. If this CUDA call returned an error, the check would throw an exception, bypassing the subsequent calls to:

c10::cuda::CUDACachingAllocator::endAllocateToPool
at::getHostAllocator(at::kCUDA)->end_allocate_to_pool

This left the CUDACachingAllocator in a state where it believed a capture was still underway for that specific memory pool. Any subsequent attempt to synchronize (e.g., during garbage collection of a MemPool object) would then trigger the captures_underway.empty() assertion failure and crash the process.

libc++abi: terminating due to uncaught exception of type c10::Error: captures_underway.empty() INTERNAL ASSERT FAILED at "third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp":3941, please report a bug to PyTorch.
Exception raised from synchronize_and_free_events at [third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp:3941](https://cs.corp.google.com/piper///depot/google3/third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp?l=3941&ws=ddelgadovargas/171324&snapshot=5186) (most recent call first):
C++ CapturedTraceback:
#4 0x5583bafd4db5: c10::Error::Error(c10::SourceLocation, std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>) from ??:0
#5 0x5583bafd2077: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#6 0x5583a3d50749: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, c10::detail::CompileTimeEmptyString) from ??:0
#7 0x5583bac90543: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::synchronize_and_free_events(std::__u::shared_ptr<c10::GatheredContext> const&, c10::cuda::CUDACachingAllocator::Native::(anonymous namespace)::PrivatePool*) from ??:0
#8 0x5583bac7f720: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::release_cached_blocks(std::__u::shared_ptr<c10::GatheredContext> const&, std::__u::pair<unsigned long long, unsigned long long>) from ??:0
#9 0x5583bac93670: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::emptyCache(std::__u::pair<unsigned long long, unsigned long long>) from ??:0
#10 0x5583bac6355f: c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::emptyCache(std::__u::pair<unsigned long long, unsigned long long>) from ??:0
#11 0x5583a9758035: at::cuda::MemPool::~MemPool() from ??:0

This change reorders the logic in CUDAGraph::capture_end() to call the allocator's "end capture" notification before checking the error status of the CUDA call. This ensures that the allocator's internal state is always synchronized with the actual state of the CUDA stream, preventing "zombie" captures from leaking and causing crashes in unrelated code paths.

Pull Request resolved: pytorch#180395
Approved by: https://github.com/eee4017, https://github.com/Skylion007

(cherry picked from commit 6fa359b)

Minimal reproducer is just those two tests in one process:

  python test_cuda.py -v \r
    TestCuda.test_graph_rng_after_failed_capture \r
    TestMemPool.test_deleted_mempool_not_used_on_oom

Cherry-picked to release/2.11 branch via #3367

Cherry-picked to release/2.10 branch via #3368

`CUDAGraph::capture_end()` previously performed a CUDA error check (`AT_CUDA_CHECK(endCaptureErr)`) immediately after calling `cudaStreamEndCapture`. If this CUDA call returned an error, the check would throw an exception, bypassing the subsequent calls to: - `c10::cuda::CUDACachingAllocator::endAllocateToPool` - `at::getHostAllocator(at::kCUDA)->end_allocate_to_pool` This left the `CUDACachingAllocator` in a state where it believed a capture was still underway for that specific memory pool. Any subsequent attempt to synchronize (e.g., during garbage collection of a `MemPool` object) would then trigger the `captures_underway.empty()` assertion failure and crash the process. ``` libc++abi: terminating due to uncaught exception of type c10::Error: captures_underway.empty() INTERNAL ASSERT FAILED at "third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp":3941, please report a bug to PyTorch. Exception raised from synchronize_and_free_events at [third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp:3941](https://cs.corp.google.com/piper///depot/google3/third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp?l=3941&ws=ddelgadovargas/171324&snapshot=5186) (most recent call first): C++ CapturedTraceback: ROCm#4 0x5583bafd4db5: c10::Error::Error(c10::SourceLocation, std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>) from ??:0 ROCm#5 0x5583bafd2077: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0 ROCm#6 0x5583a3d50749: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, c10::detail::CompileTimeEmptyString) from ??:0 ROCm#7 0x5583bac90543: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::synchronize_and_free_events(std::__u::shared_ptr<c10::GatheredContext> const&, c10::cuda::CUDACachingAllocator::Native::(anonymous namespace)::PrivatePool*) from ??:0 ROCm#8 0x5583bac7f720: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::release_cached_blocks(std::__u::shared_ptr<c10::GatheredContext> const&, std::__u::pair<unsigned long long, unsigned long long>) from ??:0 ROCm#9 0x5583bac93670: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::emptyCache(std::__u::pair<unsigned long long, unsigned long long>) from ??:0 ROCm#10 0x5583bac6355f: c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::emptyCache(std::__u::pair<unsigned long long, unsigned long long>) from ??:0 ROCm#11 0x5583a9758035: at::cuda::MemPool::~MemPool() from ??:0 ``` This change reorders the logic in `CUDAGraph::capture_end()` to call the allocator's "end capture" notification before checking the error status of the CUDA call. This ensures that the allocator's internal state is always synchronized with the actual state of the CUDA stream, preventing "zombie" captures from leaking and causing crashes in unrelated code paths. Pull Request resolved: pytorch#180395 Approved by: https://github.com/eee4017, https://github.com/Skylion007 (cherry picked from commit 6fa359b)

rocm-repo-management-api · 2026-06-23T14:14:20Z

Jenkins build for 9d4d043fd0a194b97dad3d733d55be11abae4751 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

dnikolaev-amd · 2026-06-24T13:21:18Z

!cherry-pick --onto release/2.11 release/2.10

rocm-repo-management-api-6 · 2026-06-24T13:37:16Z

Created branch autogenerated/release/2.11_cherry-pick_pr-3357 and #3367. It contains a merge conflict. Please resolve it

Created branch autogenerated/release/2.10_cherry-pick_pr-3357 and #3368. It contains a merge conflict. Please resolve it

Comment processed by Build

dnikolaev-amd requested a review from pragupta June 23, 2026 14:01

dnikolaev-amd changed the title ~~fix: leak in CUDAGraph::capture_end (#180395)~~ [release/2.12] fix leak in CUDAGraph::capture_end (#180395) Jun 23, 2026

pragupta merged commit 08b5c32 into ROCm:release/2.12 Jun 23, 2026
0 of 2 checks passed

This was referenced Jun 24, 2026

[AUTOGENERATED] [release/2.11] fix leak in CUDAGraph::capture_end (#180395) #3367

Closed

[AUTOGENERATED] [release/2.10] fix leak in CUDAGraph::capture_end (#180395) #3368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release/2.12] fix leak in CUDAGraph::capture_end (#180395)#3357

[release/2.12] fix leak in CUDAGraph::capture_end (#180395)#3357
pragupta merged 1 commit into
ROCm:release/2.12from
dnikolaev-amd:dnikolaev/fix_leak_cuda_graph_capture_end_2.12

dnikolaev-amd commented Jun 23, 2026 •

edited by rocm-repo-management-api-6 Bot

Loading

Uh oh!

rocm-repo-management-api Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

dnikolaev-amd commented Jun 24, 2026

Uh oh!

rocm-repo-management-api-6 Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

dnikolaev-amd commented Jun 23, 2026 • edited by rocm-repo-management-api-6 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dnikolaev-amd commented Jun 24, 2026

Uh oh!

rocm-repo-management-api-6 Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dnikolaev-amd commented Jun 23, 2026 •

edited by rocm-repo-management-api-6 Bot

Loading

rocm-repo-management-api Bot commented Jun 23, 2026 •

edited

Loading