Skip to content

Replace rsmi_init with amdsmi_init (via dlsym) in intra_node_comm#3299

Open
adam360x wants to merge 62 commits into
developfrom
users/adam360x/fix-rsmi-init-interposition
Open

Replace rsmi_init with amdsmi_init (via dlsym) in intra_node_comm#3299
adam360x wants to merge 62 commits into
developfrom
users/adam360x/fix-rsmi-init-interposition

Replace all rsmi_* usage with AMDSMI (via dlsym) in intra_node_comm

edd012e
Select commit
Loading
Failed to load commit list.
ROCm Repo Management API / Tests / Tests / Test Distributed / Run pytorch_distributed_1 failed Jun 15, 2026 in 0s

TestDistBackendWithSpawn.test_ddp_apply_optim_in_backward_ignored_params failed

TestDistBackendWithSpawn.test_ddp_apply_optim_in_backward_ignored_params failed

Details

TestDistBackendWithSpawn.test_ddp_apply_optim_in_backward_ignored_params

RuntimeError: Process 0 terminated or timed out after 305.0152750015259 seconds
Stack trace
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_distributed.py", line 787, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_distributed.py", line 1057, in _join_processes
    self._check_return_codes(fn, elapsed_time)
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_distributed.py", line 1102, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 305.0152750015259 seconds
Standard error
/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/backends/cudnn/__init__.py:175: UserWarning: cuDNN Benchmark limit is not supported in MIOpen and will have no effect. (Triggered internally at /var/lib/jenkins/pytorch/torch/csrc/cuda/Module.cpp:2002.)
  torch._C._cuda_set_cudnn_benchmark_limit(_benchmark_limit)
Standard out
Timing out after 300 seconds and killing subprocesses.