Skip to content

Predictable raft::resources#3005

Draft
achirkin wants to merge 4 commits intorapidsai:mainfrom
achirkin:enh-predictable-resources
Draft

Predictable raft::resources#3005
achirkin wants to merge 4 commits intorapidsai:mainfrom
achirkin:enh-predictable-resources

Conversation

@achirkin
Copy link
Copy Markdown
Contributor

@achirkin achirkin commented Apr 29, 2026

Make the raft::resources resource initialization semantics more predictable.

  1. All resources still are still initialized lazily on-demand (no change), but behave as-if they were initialized during the raft::resources handle construction: all copies of the raft::resources handle point to the same resources (not a breaking change, fixes the re-initialization issue).
  2. set_resource changes to non-const semantics (breaking change).
    Before: all set_xxx resource-updating calls were operating on const handle
    Now: all set_xxx resources require a non-const handle.

Thread-safety

First and foremost, thread-safety of using a specific resource depends on that resource. Here we discuss the thread-safety of using raft::resources handle itself (get_resource and set_resource functions).

Accessing the same resource by const ref

Updates (set_resource) are not possible (a user should copy a handle and modify the new one). All concurrent get_resource calls are atomic and safe, even if they initialize the factories and resources under the hood. The worst can happen is the same resource being initialized concurrently in two threads but only one being stored in the handle (the other one is discarded).

Accessing the same resource by non-const ref

Using the same object by non-const ref from multiple threads is always unsafe.

Accessing copies of the same resource

The resources and factories are updated atomically. Modifying any resource doesn't propagate to the copies. Accessing a resource while another threads access or modifies the same resource via another handle is thus safe.

Implementation details

The PR adds one more layer of indirection (one extra shared_ptr) to each resource, which may cause an extra runtime overhead. This is unavoidable if we want to allow lazy-initialized resources back-propagate across handles.
On the other hand, the PR removes the handle mutex in favor of C++20 std::atomic<shared_ptr>, which may reduce the runtime overheads a little bit.

Breaking changes

  • All resource setters change the function signature - this is a big breaking change. However, all known use-sites already call the resources setters on non-const raft::resources handles.
  • std::atomic<shared_ptr> requires enforcing C++20 for all dependent libraries. An alternative would be to put a mutex per raft::resource_cell.

@achirkin achirkin self-assigned this Apr 29, 2026
@achirkin achirkin added improvement Improvement / enhancement to an existing function breaking Breaking change labels Apr 29, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 29, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@achirkin
Copy link
Copy Markdown
Contributor Author

/ok to test

@achirkin
Copy link
Copy Markdown
Contributor Author

/ok to test

@achirkin achirkin moved this to In Progress in Unstructured Data Processing Apr 29, 2026
Comment thread cpp/tests/core/handle.cpp
// Move construct
raft::resources b(std::move(a));
ASSERT_TRUE(b.has_resource_factory(resource::resource_type::CUDA_STREAM_VIEW));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have another block like

// check moved-from object
ASSERT_FALSE(a.has_resource_factory(resource::resource_type::CUDA_STREAM_VIEW));

and similarly, after b has been moved into c?

My concern is that if raft::resources lacked move constructor and assignment altogether, this test would still pass as the construction and assignment can fall back to copying.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, moved-from should have valid but unspecified state, so technically we cannot rely on ASSERT_FALSE. Since we know our implementation is std::vector<std::shared_ptr>, it is indeed false (the pointers become empty after move), but... does it really matter for us?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought it would ensure that a move, not a copy, has in fact taken place.

Comment on lines +75 to +76
* Creates a new resource_cell with the given factory. Other copies of this
* handle continue to point at the old cell, so the change does not propagate.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to enforce that add_resource_factory is only used with copies of a handle? Perhaps we could add a flag is_copy in resources that is default-initialized to false but gets set to true by the copy constructor, then add a RAFT_EXPECTS(is_copy) in add_resource_factory. This way, there's no risk of accidentally overwriting the resources in the original handle:

  • ensure_resource_factory for when you want to update the original handle (works on either original or copy).
  • add_resource_factory for when you want to make an isolated change to a copy of a handle (does not work on original).

Copy link
Copy Markdown
Contributor Author

@achirkin achirkin Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't distinguish between the "original" and "copied" handles - it doesn't really matter who's the first. Also it's not possible to overwrite a resource from one copy to another: add_resource_factory simply replaces the outer shared_ptr to a one that points to a new resource_cell (NB: copy constructor of the raft::resources copies the contents of the cells_ vector same as before; the outer shared_ptrs are not shared among resources handles).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay it's probably an overreach to make that distinction. Thanks for elaborating.

@achirkin
Copy link
Copy Markdown
Contributor Author

Because of requiring C++20, this PR is blocked on completing the migration of cuVS to C++20 (see PRs #1795, #1796, #1799).

@achirkin achirkin moved this from In Progress to Blocked in Unstructured Data Processing Apr 29, 2026
@tarang-jain
Copy link
Copy Markdown
Contributor

Note that rapidsai/cuvs#1796 is blocked. https://nvbugspro.nvidia.com/bug/5356084 is the existing issue for the 12.9 compiler issue. Nothing we can do as the fix was merged into 13.1.

@achirkin
Copy link
Copy Markdown
Contributor Author

achirkin commented May 6, 2026

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change improvement Improvement / enhancement to an existing function

Projects

Status: Blocked

Development

Successfully merging this pull request may close these issues.

3 participants