Fix process exit crash in Clang builds by using RTLD_NODELETE for shared libraries#564
Fix process exit crash in Clang builds by using RTLD_NODELETE for shared libraries#564kbrameld wants to merge 1 commit into
Conversation
Plugin code is routinely referenced from heap-resident objects that outlive the dlopen/dlclose cycle (shared_ptr/weak_ptr control-block vtables, std::function captures). If dlclose actually unmaps the library, any later virtual call or callback through those references faults. Under gcc this is hidden because STB_GNU_UNIQUE symbols pin the .so; under clang there is no such pinning and the use-after-free crashes reliably. NODELETE makes the behavior consistent across compilers. Reproducer and full analysis: https://github.com/kbrameld/plugin_service_bug Signed-off-by: Kenji Brameld <kbrameld@traclabs.com>
fujitatomoya
left a comment
There was a problem hiding this comment.
@kbrameld thank you very much for providing the PR and the detailed rep.
AFAIS, this PR is just a workaround, that makes the symptom go away by ensuring the unmapped memory is never actually unmapped, but it doesn't address the underlying design problem?
my guess for proper fix is to change the destruction order. the class loader shouldn't be destroyed before the node that holds references to plugin-owned objects. if the node (or its callback groups) had a strong reference to the class loader, (or to a "plugin keepalive" handle), the loader would outlive the node and dlclose would happen after ~Node finished. this is conceptually the cleanest fix, but requires some changes.
i can think of 2 possible downside on this,
- long-running processes that load many plugins over time will accumulate mappings indefinitely. (it is so unlikely that this could be a problem, i guess)
- plugin hot-reload becomes impossible. (this one could be major???) if we dlclose and then dlopen a newer version of the same .so, the old version's pages are still mapped and its symbols may still be visible via existing handles.
after all, i think having this work-around with notice would be just fine, that actually brings Clang and GCC into agreement. i would like to have a 2nd review on this from other maintainers.
@ahcorde @jmachowinski @mjcarroll wdyt?
|
Note if we take this fix, backport to downstream distros. |
|
|
|
The explanation of what is going on is wrong. |
|
You are right that the service
So the issue I am pointing at is not the plugin retaining a service In general, a Later, when the callback group is destroyed, its In the reproducer, the unloaded plugin It also contains the matching vtable: So if the plugin is unloaded before the last weak reference is gone, the later weak-pointer cleanup can still leave a virtual call path into code/vtable data that lived in the unloaded plugin library. |
|
@kbrameld Has it ever occurred to you that just copy & pasting chatbot answers is just plain rude ? Apart from that your chatbot answer is slightly off from what is really happening and mixing up stuff. |
Description
Adds
RTLD_NODELETEto thedlopencall inrcutils_load_shared_libraryso plugin libraries stay mapped afterdlclose. The refcount still drops normally — only the kernel mapping is preserved.This fixes a crash where a
pluginlibplugin that registers a service on the host node causes a segfault at process exit when built with Clang (e.g., when compiling withexport CC=clangandexport CXX=clang++).This issue is not specific to Kilted and Rolling; it also affects older ROS 2 distributions such as Jazzy.
Original report and reproducer:
This is most likely ignored by most users of clang - but such segmentation faults will cause test failures in CI systems during destruction.
Is this user-facing behavior change?
STB_GNU_UNIQUEbehavior..sos will now safely stay mapped for the process lifetime instead of being unmapped ondlclose, preventing a latent segmentation fault at shutdown.Did you use Generative AI?
Yes — I used Claude Code to help debug the crash, identify the
STB_GNU_UNIQUEmechanism, and prepare the PR. The patch and PR text were reviewed by me.Additional Information
The Root Cause: Destruction Order & Cross-DSO UAF
The crash is a cross-DSO
shared_ptruse-after-free (UAF) that occurs due to a common teardown sequence: the class loader is destructed before therclcpp::Nodeis.node->create_service<T>(...), the resulting control block has its vtable inside the plugin.so(the templated factory is instantiated in the plugin's translation unit).rclcpp::CallbackGroupholds aweak_ptr<ServiceBase>to that control block.dlcloseis invoked and the plugin is unmapped.rclcpp::Node(~Node→~CallbackGroup→~weak_ptr), a virtual_M_destroy()is triggered on the control block. Because the plugin memory has already been unmapped, reading the vtable causes a segmentation fault.Why This Only Happens with Clang
Under GCC, this latent bug is hidden. GCC emits
STB_GNU_UNIQUEsymbols for ordinary template instantiations (which the plugin picks up viastd::variantinrclcpp/any_service_callback.hpp). By default,glibcrefuses to unmap any DSO containing bound unique symbols. Consequently, GCC-built plugins have effectively been receivingRTLD_NODELETEbehavior by accident.Clang does not emit unique symbols by default. Therefore,
dlclosesuccessfully unmaps the library, reliably triggering the UAF crash at process exit.Verification
Confirmed the fix using the provided reproducer compiled with Clang. The plugin no longer segfaults at exit, and GDB verifies that the plugin pages remain mapped through the entire
~Nodeexecution.