Skip to content

Fix process exit crash in Clang builds by using RTLD_NODELETE for shared libraries#564

Open
kbrameld wants to merge 1 commit into
ros2:rollingfrom
traclabs:plugin-rtld-nodelete
Open

Fix process exit crash in Clang builds by using RTLD_NODELETE for shared libraries#564
kbrameld wants to merge 1 commit into
ros2:rollingfrom
traclabs:plugin-rtld-nodelete

Conversation

@kbrameld
Copy link
Copy Markdown

Description

Adds RTLD_NODELETE to the dlopen call in rcutils_load_shared_library so plugin libraries stay mapped after dlclose. The refcount still drops normally — only the kernel mapping is preserved.

This fixes a crash where a pluginlib plugin that registers a service on the host node causes a segfault at process exit when built with Clang (e.g., when compiling with export CC=clang and export CXX=clang++).

This issue is not specific to Kilted and Rolling; it also affects older ROS 2 distributions such as Jazzy.

Original report and reproducer:

This is most likely ignored by most users of clang - but such segmentation faults will cause test failures in CI systems during destruction.

Is this user-facing behavior change?

  • For GCC builds: No. GCC-built plugins have effectively always stayed mapped due to implicit STB_GNU_UNIQUE behavior.
  • For Clang builds: Yes, but as a bug fix/stability improvement. Plugin .sos will now safely stay mapped for the process lifetime instead of being unmapped on dlclose, preventing a latent segmentation fault at shutdown.

Did you use Generative AI?

Yes — I used Claude Code to help debug the crash, identify the STB_GNU_UNIQUE mechanism, and prepare the PR. The patch and PR text were reviewed by me.


Additional Information

The Root Cause: Destruction Order & Cross-DSO UAF

The crash is a cross-DSO shared_ptr use-after-free (UAF) that occurs due to a common teardown sequence: the class loader is destructed before the rclcpp::Node is.

  1. When a plugin calls node->create_service<T>(...), the resulting control block has its vtable inside the plugin .so (the templated factory is instantiated in the plugin's translation unit).
  2. rclcpp::CallbackGroup holds a weak_ptr<ServiceBase> to that control block.
  3. Because the class loader is destroyed first, dlclose is invoked and the plugin is unmapped.
  4. Finally, during the destructor of the rclcpp::Node (~Node~CallbackGroup~weak_ptr), a virtual _M_destroy() is triggered on the control block. Because the plugin memory has already been unmapped, reading the vtable causes a segmentation fault.

Why This Only Happens with Clang

Under GCC, this latent bug is hidden. GCC emits STB_GNU_UNIQUE symbols for ordinary template instantiations (which the plugin picks up via std::variant in rclcpp/any_service_callback.hpp). By default, glibc refuses to unmap any DSO containing bound unique symbols. Consequently, GCC-built plugins have effectively been receiving RTLD_NODELETE behavior by accident.

Clang does not emit unique symbols by default. Therefore, dlclose successfully unmaps the library, reliably triggering the UAF crash at process exit.

Verification

Confirmed the fix using the provided reproducer compiled with Clang. The plugin no longer segfaults at exit, and GDB verifies that the plugin pages remain mapped through the entire ~Node execution.

Plugin code is routinely referenced from heap-resident objects that outlive
the dlopen/dlclose cycle (shared_ptr/weak_ptr control-block vtables,
std::function captures). If dlclose actually unmaps the library, any later
virtual call or callback through those references faults.

Under gcc this is hidden because STB_GNU_UNIQUE symbols pin the .so; under
clang there is no such pinning and the use-after-free crashes reliably.
NODELETE makes the behavior consistent across compilers.

Reproducer and full analysis: https://github.com/kbrameld/plugin_service_bug

Signed-off-by: Kenji Brameld <kbrameld@traclabs.com>
Copy link
Copy Markdown
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbrameld thank you very much for providing the PR and the detailed rep.

AFAIS, this PR is just a workaround, that makes the symptom go away by ensuring the unmapped memory is never actually unmapped, but it doesn't address the underlying design problem?
my guess for proper fix is to change the destruction order. the class loader shouldn't be destroyed before the node that holds references to plugin-owned objects. if the node (or its callback groups) had a strong reference to the class loader, (or to a "plugin keepalive" handle), the loader would outlive the node and dlclose would happen after ~Node finished. this is conceptually the cleanest fix, but requires some changes.

i can think of 2 possible downside on this,

  • long-running processes that load many plugins over time will accumulate mappings indefinitely. (it is so unlikely that this could be a problem, i guess)
  • plugin hot-reload becomes impossible. (this one could be major???) if we dlclose and then dlopen a newer version of the same .so, the old version's pages are still mapped and its symbols may still be visible via existing handles.

after all, i think having this work-around with notice would be just fine, that actually brings Clang and GCC into agreement. i would like to have a 2nd review on this from other maintainers.

@ahcorde @jmachowinski @mjcarroll wdyt?

@fujitatomoya
Copy link
Copy Markdown
Collaborator

Note

if we take this fix, backport to downstream distros.

@kbrameld kbrameld marked this pull request as draft May 20, 2026 13:38
@kbrameld
Copy link
Copy Markdown
Author

kbrameld commented May 20, 2026

Converting to draft to test a couple of things.

@kbrameld kbrameld marked this pull request as ready for review May 20, 2026 14:49
@jmachowinski
Copy link
Copy Markdown

The explanation of what is going on is wrong.
The test example you provided does not safe the shared_ptr of the service :
https://github.com/kbrameld/plugin_service_bug/blob/57f427ea7d1f5a0acae9aead7fa5268030fffc2a/plugin_service_bug_plugin/src/bar_plugin.cpp#L15
Therefore it should be deleted immediately and basically have no effect.
Therefore I doubt that the suspected reason for the segfault is correct, and something else is going on.

@kbrameld
Copy link
Copy Markdown
Author

You are right that the service shared_ptr itself is not stored by the plugin, but create_service() stores a service weak_ptr in the node's callback group, which is the problem. Here's the sequence for that:

  1. rclcpp::create_service() registers the service with NodeServices
  2. NodeServices::add_service() forwards it to the callback group
  3. CallbackGroup::add_service() stores it in service_ptrs_
  4. service_ptrs_ is a vector of ServiceBase::WeakPtr

So the issue I am pointing at is not the plugin retaining a service shared_ptr. It is that creating the service from inside the plugin can leave behind an expired weak_ptr in the node callback group.

In general, a weak_ptr is allowed to outlive the shared_ptr. The important part is that the weak_ptr keeps the shared-pointer control block alive after the service object is gone. A shared_ptr control block is the separate bookkeeping object used by shared_ptr and weak_ptr; it stores the strong reference count, weak reference count, and the type-specific cleanup/deallocation logic. The managed service object can already be destroyed while the control block remains alive, as long as at least one weak_ptr still exists.

Later, when the callback group is destroyed, its std::vector<std::weak_ptr<ServiceBase>> is destroyed, which destroys each stored weak_ptr. That calls std::__weak_count::~__weak_count(), which calls _M_weak_release() on the control block. If this is the last weak reference, _M_weak_release() calls the control block's virtual _M_destroy() method.

In the reproducer, the unloaded plugin .so contains the concrete control-block virtual target:

std::_Sp_counted_ptr_inplace<
  rclcpp::Service<std_srvs::srv::Trigger>,
  std::allocator<void>,
  (__gnu_cxx::_Lock_policy)2
>::_M_destroy()

It also contains the matching vtable:

vtable for std::_Sp_counted_ptr_inplace<
  rclcpp::Service<std_srvs::srv::Trigger>,
  std::allocator<void>,
  (__gnu_cxx::_Lock_policy)2
>

So if the plugin is unloaded before the last weak reference is gone, the later weak-pointer cleanup can still leave a virtual call path into code/vtable data that lived in the unloaded plugin library.

@jmachowinski
Copy link
Copy Markdown

@kbrameld Has it ever occurred to you that just copy & pasting chatbot answers is just plain rude ?
It also make me question, if you are in violation with out AI guideline and have properly reviewed anything.

Apart from that your chatbot answer is slightly off from what is really happening and mixing up stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants