Replace rsmi_init with amdsmi_init (via dlsym) in intra_node_comm#3299
Replace rsmi_init with amdsmi_init (via dlsym) in intra_node_comm#3299adam360x wants to merge 1 commit into
Conversation
|
Jenkins build for df95aa5c0d12e2db9f9f91644fc4e3e323910e5f commit finished as FAILURE Detected error during Pytorch building: |
df95aa5 to
3315c81
Compare
|
Jenkins build for 3315c812c921428a48ba172144c7a356a7f1d411 commit finished as FAILURE Detected error during Pytorch building: |
3315c81 to
9cf4004
Compare
|
Jenkins build for 9cf4004684351e9dad64cc3b329d8d7355cb710e commit finished as FAILURE |
9cf4004 to
7094c1a
Compare
|
Jenkins build for 7094c1a6333859e38c13263e84c6a5dcd11320e2 commit finished as FAILURE |
|
Failing tests seem unrelated to PR |
libtorch_hip.so referenced rsmi_init and rsmi_is_P2P_accessible as undefined symbols without listing librocm_smi64.so as a NEEDED dependency. When libamd_smi.so (which also exports these symbols for backward compatibility) was loaded with RTLD_GLOBAL, the dynamic linker interposed them over libamd_smi.so's internal copies. This caused AMDSMI's RSMI singleton to remain uninitialized, resulting in zero devices or sentinel values (e.g. gfxffffffffffffffff) when amdsmi_init() was called after torch. Replace all rsmi_* calls (rsmi_init, rsmi_is_P2P_accessible) with their AMDSMI equivalents (amdsmi_init, amdsmi_is_P2P_accessible), resolved at runtime via dlsym. This: - removes all rsmi_*/amdsmi_* undefined symbols from libtorch_hip.so - avoids any link-time NEEDED dependency on libamd_smi.so - allows libamd_smi.so to drop the rsmi_* exports entirely - removes the #include <rocm_smi/rocm_smi.h> dependency - gracefully degrades if libamd_smi.so is not loaded
7094c1a to
edd012e
Compare
|
Jenkins build for edd012e63bcedf4261f65d44de95d87b8659a596 commit finished as FAILURE |
Summary
rsmi_init(0)call withdlsym(RTLD_DEFAULT, "amdsmi_init")inintra_node_comm.cpprsmi_initundefined symbol fromlibtorch_hip.soProblem
libtorch_hip.socallsrsmi_init()but does not listlibrocm_smi64.soas aNEEDEDdependency —rsmi_initis left as an undefined symbol resolved at runtime.libamd_smi.soexportsrsmi_initfor backward compatibility, so when it gets loaded withRTLD_GLOBAL(via the ROCm SDK dependency chain), the dynamic linker interposed thatrsmi_initoverlibamd_smi.so's own internal copy. This left AMDSMI's RSMI singleton uninitialized, resulting in zero devices or sentinel values (e.g.gfxffffffffffffffff) when users calledamdsmi_init()after importing torch.Reproducer:
import torchbeforeamdsmi_init()on navi31/navi48 with pip-installed amdsmi and ROCm 7.12/7.13 nightlies.Root cause verification
Confirmed via isolated ctypes tests on a Radeon RX 7900 XT (gfx1100):
Fix
Use
dlsym(RTLD_DEFAULT, "amdsmi_init")to callamdsmi_initat runtime instead of linkingrsmi_initdirectly. This:rsmi_initundefined symbol fromlibtorch_hip.so, eliminating the interposition vectorNEEDEDdependency onlibamd_smi.solibamd_smi.sois not loadedrsmi_is_P2P_accessibleremains unchanged —amdsmi_initinitializes the RSMI layer internally, so existingrsmi_*query calls continue to work.Test results
Build 4: 103,130 passed, 3 failed, 30,322 skipped. All 3 failures (
test_ddp_apply_optim_in_backward_ignored_params,test_host_memory_stats,test_cuda_graph_tensor_item_not_allowed) are pre-existing on develop (develop build #25 has 31 failures including the same tests).