Skip to content

[DO NOT MERGE] [ROCm] device sync on pytorch module exit#3369

Draft
dnikolaev-amd wants to merge 1 commit into
ROCm:release/2.12from
dnikolaev-amd:dnikolaev/fix_core_dump_on_pytorch_exit_2.12
Draft

[DO NOT MERGE] [ROCm] device sync on pytorch module exit#3369
dnikolaev-amd wants to merge 1 commit into
ROCm:release/2.12from
dnikolaev-amd:dnikolaev/fix_core_dump_on_pytorch_exit_2.12

[ROCm] sync on pytorch module exit

0173828
Select commit
Loading
Failed to load commit list.
ROCm Repo Management API / Tests / Tests / Test Distributed / Run pytorch_distributed_1 failed Jun 24, 2026 in 0s

failed: 2, skipped: 82, passed: 252

failed: 2, skipped: 82, passed: 252

Details

TestDistBackendWithSpawn.test_ddp_apply_optim_in_backward_ignored_params

AssertionError: Scalars are not equal!

Expected 0 but got -6.
Absolute difference: 6
Relative difference: inf
Expected exit code 0 but got -6 for pid: 2609177
Stack trace
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_distributed.py", line 816, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_distributed.py", line 1086, in _join_processes
    self._check_return_codes(fn, elapsed_time)
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_distributed.py", line 1162, in _check_return_codes
    self.assertEqual(
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 4441, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: Scalars are not equal!

Expected 0 but got -6.
Absolute difference: 6
Relative difference: inf
Expected exit code 0 but got -6 for pid: 2609177
Standard error
/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/backends/cudnn/__init__.py:175: UserWarning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (Triggered internally at /var/lib/jenkins/pytorch/torch/csrc/cuda/Module.cpp:1990.)
  torch._C._cuda_set_cudnn_benchmark_limit(_benchmark_limit)

TestDistBackendWithSpawn.test_ddp_apply_optim_in_backward_ignored_params

AssertionError: Scalars are not equal!

Expected 0 but got -6.
Absolute difference: 6
Relative difference: inf
Expected exit code 0 but got -6 for pid: 2612682
Stack trace
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_distributed.py", line 816, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_distributed.py", line 1086, in _join_processes
    self._check_return_codes(fn, elapsed_time)
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_distributed.py", line 1162, in _check_return_codes
    self.assertEqual(
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 4441, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: Scalars are not equal!

Expected 0 but got -6.
Absolute difference: 6
Relative difference: inf
Expected exit code 0 but got -6 for pid: 2612682
Standard error
/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/backends/cudnn/__init__.py:175: UserWarning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (Triggered internally at /var/lib/jenkins/pytorch/torch/csrc/cuda/Module.cpp:1990.)
  torch._C._cuda_set_cudnn_benchmark_limit(_benchmark_limit)