Skip to content

[release/2.12] fix leak in CUDAGraph::capture_end (#180395)#3357

Merged
pragupta merged 1 commit into
ROCm:release/2.12from
dnikolaev-amd:dnikolaev/fix_leak_cuda_graph_capture_end_2.12
Jun 23, 2026
Merged

[release/2.12] fix leak in CUDAGraph::capture_end (#180395)#3357
pragupta merged 1 commit into
ROCm:release/2.12from
dnikolaev-amd:dnikolaev/fix_leak_cuda_graph_capture_end_2.12

Conversation

@dnikolaev-amd

@dnikolaev-amd dnikolaev-amd commented Jun 23, 2026

Copy link
Copy Markdown

CUDAGraph::capture_end() previously performed a CUDA error check (AT_CUDA_CHECK(endCaptureErr)) immediately after calling cudaStreamEndCapture. If this CUDA call returned an error, the check would throw an exception, bypassing the subsequent calls to:

  • c10::cuda::CUDACachingAllocator::endAllocateToPool
  • at::getHostAllocator(at::kCUDA)->end_allocate_to_pool

This left the CUDACachingAllocator in a state where it believed a capture was still underway for that specific memory pool. Any subsequent attempt to synchronize (e.g., during garbage collection of a MemPool object) would then trigger the captures_underway.empty() assertion failure and crash the process.

libc++abi: terminating due to uncaught exception of type c10::Error: captures_underway.empty() INTERNAL ASSERT FAILED at "third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp":3941, please report a bug to PyTorch.
Exception raised from synchronize_and_free_events at [third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp:3941](https://cs.corp.google.com/piper///depot/google3/third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp?l=3941&ws=ddelgadovargas/171324&snapshot=5186) (most recent call first):
C++ CapturedTraceback:
#4 0x5583bafd4db5: c10::Error::Error(c10::SourceLocation, std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>) from ??:0
#5 0x5583bafd2077: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#6 0x5583a3d50749: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, c10::detail::CompileTimeEmptyString) from ??:0
#7 0x5583bac90543: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::synchronize_and_free_events(std::__u::shared_ptr<c10::GatheredContext> const&, c10::cuda::CUDACachingAllocator::Native::(anonymous namespace)::PrivatePool*) from ??:0
#8 0x5583bac7f720: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::release_cached_blocks(std::__u::shared_ptr<c10::GatheredContext> const&, std::__u::pair<unsigned long long, unsigned long long>) from ??:0
#9 0x5583bac93670: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::emptyCache(std::__u::pair<unsigned long long, unsigned long long>) from ??:0
#10 0x5583bac6355f: c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::emptyCache(std::__u::pair<unsigned long long, unsigned long long>) from ??:0
#11 0x5583a9758035: at::cuda::MemPool::~MemPool() from ??:0

This change reorders the logic in CUDAGraph::capture_end() to call the allocator's "end capture" notification before checking the error status of the CUDA call. This ensures that the allocator's internal state is always synchronized with the actual state of the CUDA stream, preventing "zombie" captures from leaking and causing crashes in unrelated code paths.

Pull Request resolved: pytorch#180395
Approved by: https://github.com/eee4017, https://github.com/Skylion007

(cherry picked from commit 6fa359b)

Minimal reproducer is just those two tests in one process:

  python test_cuda.py -v \r
    TestCuda.test_graph_rng_after_failed_capture \r
    TestMemPool.test_deleted_mempool_not_used_on_oom

Cherry-picked to release/2.11 branch via #3367

Cherry-picked to release/2.10 branch via #3368

`CUDAGraph::capture_end()` previously performed a CUDA error check (`AT_CUDA_CHECK(endCaptureErr)`) immediately after calling `cudaStreamEndCapture`. If this CUDA call returned an error, the check would throw an exception, bypassing the subsequent calls to:

- `c10::cuda::CUDACachingAllocator::endAllocateToPool`
- `at::getHostAllocator(at::kCUDA)->end_allocate_to_pool`

This left the `CUDACachingAllocator` in a state where it believed a capture was still underway for that specific memory pool. Any subsequent attempt to synchronize (e.g., during garbage collection of a `MemPool` object) would then trigger the `captures_underway.empty()` assertion failure and crash the process.

```
libc++abi: terminating due to uncaught exception of type c10::Error: captures_underway.empty() INTERNAL ASSERT FAILED at "third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp":3941, please report a bug to PyTorch.
Exception raised from synchronize_and_free_events at [third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp:3941](https://cs.corp.google.com/piper///depot/google3/third_party/py/torch/c10/cuda/CUDACachingAllocator.cpp?l=3941&ws=ddelgadovargas/171324&snapshot=5186) (most recent call first):
C++ CapturedTraceback:
ROCm#4 0x5583bafd4db5: c10::Error::Error(c10::SourceLocation, std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>) from ??:0
ROCm#5 0x5583bafd2077: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
ROCm#6 0x5583a3d50749: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, c10::detail::CompileTimeEmptyString) from ??:0
ROCm#7 0x5583bac90543: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::synchronize_and_free_events(std::__u::shared_ptr<c10::GatheredContext> const&, c10::cuda::CUDACachingAllocator::Native::(anonymous namespace)::PrivatePool*) from ??:0
ROCm#8 0x5583bac7f720: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::release_cached_blocks(std::__u::shared_ptr<c10::GatheredContext> const&, std::__u::pair<unsigned long long, unsigned long long>) from ??:0
ROCm#9 0x5583bac93670: c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::emptyCache(std::__u::pair<unsigned long long, unsigned long long>) from ??:0
ROCm#10 0x5583bac6355f: c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::emptyCache(std::__u::pair<unsigned long long, unsigned long long>) from ??:0
ROCm#11 0x5583a9758035: at::cuda::MemPool::~MemPool() from ??:0
```

This change reorders the logic in `CUDAGraph::capture_end()` to call the allocator's "end capture" notification before checking the error status of the CUDA call. This ensures that the allocator's internal state is always synchronized with the actual state of the CUDA stream, preventing "zombie" captures from leaking and causing crashes in unrelated code paths.

Pull Request resolved: pytorch#180395
Approved by: https://github.com/eee4017, https://github.com/Skylion007

(cherry picked from commit 6fa359b)
@dnikolaev-amd dnikolaev-amd requested a review from pragupta June 23, 2026 14:01
@dnikolaev-amd dnikolaev-amd changed the title fix: leak in CUDAGraph::capture_end (#180395) [release/2.12] fix leak in CUDAGraph::capture_end (#180395) Jun 23, 2026
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 23, 2026

Copy link
Copy Markdown

Jenkins build for 9d4d043fd0a194b97dad3d733d55be11abae4751 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@pragupta pragupta merged commit 08b5c32 into ROCm:release/2.12 Jun 23, 2026
0 of 2 checks passed
@dnikolaev-amd

Copy link
Copy Markdown
Author

!cherry-pick --onto release/2.11 release/2.10

@rocm-repo-management-api-6

Copy link
Copy Markdown

Created branch autogenerated/release/2.11_cherry-pick_pr-3357 and #3367. It contains a merge conflict. Please resolve it

Created branch autogenerated/release/2.10_cherry-pick_pr-3357 and #3368. It contains a merge conflict. Please resolve it

Comment processed by Build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants