Closes #21326: Defer global search cache updates to a background job#22481
Closes #21326: Defer global search cache updates to a background job#22481jnovinger wants to merge 10 commits into
Conversation
Previously the global search cache (CachedValue) was updated synchronously
inside the post_save/post_delete signal handlers, adding latency to every write
and, for bulk operations, one synchronous re-index per object.
The signal handlers now buffer dirty objects and defer the work to a background
job that runs after the transaction commits:
* netbox/netbox/search/deferred.py coalesces dirty objects per (database alias,
transaction). The buffer lives inside a transaction.on_commit callback's
closure, so a single flush is scheduled per alias per transaction (collapsing
repeated operations; a deletion supersedes a pending create/update). Because
Django clears run_on_commit on both commit and rollback, no buffered state can
survive a rolled-back transaction. In autocommit (no open transaction), the
indexing runs inline immediately.
* On flush, when an RQ worker is available the work is dispatched as a
SearchCacheJob (netbox/netbox/search/tasks.py); otherwise it runs inline, so
installs without a running worker behave as before.
* The database alias is captured from the signal's `using` kwarg and replayed on
the deferred read and write via .using(alias). This keeps cache entries routed
to the originating schema (e.g. a branch schema under netbox-branching) even
though the worker has no active routing context, without core depending on the
plugin. Only primitives (the alias string and {object_type_id: [pk]} maps)
cross the job boundary.
* SearchBackend.cache()/remove() gain a `using` parameter, and a shared
_remove_by_id() performs the raw delete by content type and object IDs. The
worker re-fetches live instances on `using`, removes stale entries, and
re-indexes within a single transaction so an object is never left with no
cache rows. A missing schema (e.g. a branch merged/deprovisioned between
enqueue and execution) is skipped; other database errors propagate so the job
fails visibly.
remove() now resolves the object type via the indexer's (concrete) model rather
than ContentType.objects.get_for_model(instance), matching the content type that
cache() writes under so the two cannot diverge for a proxy/MTI searchable model.
No registered search index is currently a proxy model, so this is not an
observable change today.
Cover the deferred indexing behavior introduced for #21326: * The existing save/delete tests now drive the post-commit flush via captureOnCommitCallbacks; with no RQ worker registered they exercise the synchronous fallback. * Coalescing: a bulk save schedules a single flush carrying every object, and a create-then-delete in one transaction collapses to a removal. * Rollback safety: a rolled-back transaction leaves no scheduled flush, and a commit on the same connection after a rollback still indexes (no stale buffer state suppresses it). * Worker path: when a worker is available the flush enqueues a SearchCacheJob, and the originating database alias is forwarded to the job. (Cross-schema routing itself is covered by the netbox-branching test suite; here we assert only that the alias is captured and forwarded.) * The deferred task tolerates an object deleted between enqueue and execution. * AutocommitCachingTestCase uses TransactionTestCase so saves run outside an atomic block, exercising the inline-indexing branch that TestCase masks.
Search index updates are now applied by a background task after the response is sent, so a newly created or changed object may not appear in search results for a brief period. Update the global search documentation accordingly (and note the synchronous fallback when no background worker is running).
|
Some additional design context for reviewers (and the bot). Collapsed since it's deeper than most reviews will need. Design detailsCoalescing buffer
Coalescing collapses repeated operations on the same object, and a deletion supersedes a pending create/update (the object ends up removed from the index). Rollback safetyDjango clears a connection's In autocommit (no open transaction), Schema routing without a plugin dependency
Rather than have core know about branching, the database alias is captured from the signal's Worker-gated fallbackOn flush, if a Why a
|
|
@claude /review Only report on issues of potential concern; we don't need commentary on anything that doesn't require further consideration. |
|
Claude finished @jnovinger's task in 2m 49s —— View job Review
1. Redis outage turns a successful save into a 500
transaction.on_commit(flush, using=alias)
# Should be:
transaction.on_commit(flush, using=alias, robust=True)
2.
|
* Register the on_commit flush with robust=True so a failure while flushing (e.g. Redis being unreachable when checking for an available worker) is logged rather than propagated, preventing a committed save from returning a 500. * Guard removal_handler's get_for_model() lookup with the same ProgrammingError handling as caching_handler, for deletes that fire during migrations. * Add a test asserting Django's connection.run_on_commit entry shape (savepoint_ids, func, robust), so a future Django change fails with a clear pointer to _pending_batch() rather than an opaque unpack error.
Fixed. The
Fixed.
Agreed it's a Django internal with no public alternative (short of module-level state, which we avoid for rollback safety). Added a test that asserts the
Staying as-is. |
|
@claude /review Only report on issues of potential concern; we don't need commentary on anything that doesn't require further consideration. See previous comments for review response. |
|
Claude finished @jnovinger's task in 3m 18s —— View job Review
Previous issues #1 and #2 are correctly fixed. Issue #3 is acknowledged with a test. Issue #4 is staying as-is (noted). One remaining concern: Autocommit path lacks the same Redis-error protection as the transactional path
if not connection.in_atomic_block:
_flush({(object_type_id, pk): op}, alias)
returnSuggested fix — wrap the autocommit flush to match the transactional path's intent: if not connection.in_atomic_block:
try:
_flush({(object_type_id, pk): op}, alias)
except Exception:
logger.exception("Error flushing search cache (autocommit)")
return |
Make _flush the single guarded dispatcher so a search cache update can never turn an already-committed save into an error, on either the transactional or the autocommit path: * Wrap the worker-availability probe and the job enqueue (both of which touch Redis, and a worker can die between them) in one try/except RedisError, falling back to inline indexing. The search index lives in PostgreSQL, so the inline fallback has no Redis dependency. * Guard the autocommit branch of mark_dirty() so a flush failure there is logged rather than propagated to the post_save/post_delete caller. The transactional path keeps transaction.on_commit(..., robust=True) as defense in depth. * Document that a broker failure mid-enqueue can leave a stranded PENDING Job row (the index is still written inline; the row ages out via housekeeping). * Note why the removal loop is not wrapped in a transaction (single DELETE per content type) and that a dropped update is recovered by reindex, not retry. Tests: add coverage for the inline fallback when the broker is unreachable and when enqueue itself fails after the probe succeeds (pinning the decision to guard the whole dispatch, not just the probe).
|
@claude /review Only report on issues of potential concern; we don't need commentary on anything that doesn't require further consideration. Since the last review, the autocommit-path Redis gap you flagged is fixed: |
|
Claude finished @jnovinger's task in 4m 5s —— View job Review
All previous issues are resolved. No remaining concerns. The three structural fixes hold up:
The psycopg |
Move the search deferral imports to module top level wherever possible, leaving a single function-local import for the one genuinely cycle-bound case: * backends.py imports OP_CACHE/OP_REMOVE/mark_dirty from .deferred at module level (the handlers no longer each import them locally). * deferred.py imports RQ_QUEUE_DEFAULT and any_workers_for_queue at module level. * The only remaining function-local import is _flush's `from netbox.search.tasks import SearchCacheJob, update_search_cache`, which is required because tasks.py imports the search_backend singleton defined at the bottom of backends.py (which itself imports this module). A cleaner fix (extracting the singleton) is tracked as a follow-up. Because any_workers_for_queue is now bound in deferred's namespace, the tests patch netbox.search.deferred.any_workers_for_queue (where it is looked up) rather than utilities.rqworker.any_workers_for_queue.
* Rename mark_dirty() to mark_for_deferred_indexing(), which says what it does (schedule deferred search indexing) rather than borrowing vague cache- invalidation vocabulary. Update the docstring's "dirty" wording to match. * Expand the comment on the on_commit robust=True flag to record why it is required rather than optional: Django runs on_commit callbacks synchronously as the atomic block exits (after COMMIT), so an exception escaping the flush would surface as a 500 on an already-committed write. _flush handles the recoverable Redis fault itself; robust=True keeps any other failure from becoming that post-commit 500.
|
@claude /review Only report on issues of potential concern; we don't need commentary on anything that doesn't require further consideration. Since the last review, a few follow-up items identified during local review have been addressed:
Full detail is in the commit messages and the collapsed design-details comment above; please consult them as needed. |
|
Claude finished @jnovinger's task in 2m 18s —— View job Review
All previous issues have been resolved. The three changes described in the trigger comment are all correct:
No remaining concerns. |
pheus
left a comment
There was a problem hiding this comment.
Thanks! The direction looks good to me.
I do think this needs a bit more adjustment before merge, mainly around transaction correctness and code organization.
Nested atomic() rollback behavior
I think the current batching logic can leak rolled-back work from a nested transaction into an outer batch.
_pending_batch() appears to find an existing pending callback by database alias only. That means an operation performed inside an inner atomic() block can be added to a batch that belongs to an outer on_commit() callback.
For example:
with transaction.atomic():
site_a.save() # schedules an outer search-cache batch
try:
with transaction.atomic():
site_b.delete() # added to the existing outer batch
raise RuntimeError()
except RuntimeError:
pass
# outer transaction commitsAt the database level, site_b still exists because the inner savepoint rolled back. But because the delete was appended to the outer batch, the search cache update can still remove site_b after the outer transaction commits.
Could we make the pending batch lookup savepoint-aware, or otherwise avoid reusing a batch registered outside the current savepoint scope?
I think this should have a regression test covering:
- schedule a search-cache update in an outer transaction;
- delete or update another searchable object inside an inner
atomic(); - roll back the inner savepoint;
- commit the outer transaction;
- confirm the rolled-back inner operation did not affect the search cache.
Code organization / ownership
I also think the new deferred search-cache path would be easier to maintain with clearer ownership boundaries.
Right now the import/dependency shape feels a bit circular:
search.backends -> search.deferred -> search.tasks -> search.backends
The local import avoids the runtime circular import, but I think it is a sign that the responsibilities are not quite separated yet. deferred.py is handling transaction batching, callback coalescing, enqueue-vs-inline behavior, and dispatching. Then the task/job code reaches back into backend-specific cache internals.
I would prefer something closer to:
netbox/search/
├── backends.py # search backend implementations
├── deferred.py # transaction/on_commit batching and coalescing
└── jobs.py # SearchCacheJob(JobRunner)
In particular, I think SearchCacheJob(JobRunner) should live in netbox/search/jobs.py, matching the usual NetBox job organization. The job itself can stay very thin and delegate the actual cache mutation to the backend.
For example:
# netbox/search/jobs.py
class SearchCacheJob(JobRunner):
class Meta:
name = "Search cache update"
def run(self, using=None, cache_groups=None, remove_groups=None, **kwargs):
from netbox.search.backends import search_backend
search_backend.apply_deferred_updates(
using=using,
cache_groups=cache_groups,
remove_groups=remove_groups,
log=self.logger,
)Then the CachedValue-specific behavior can remain owned by CachedValueSearchBackend:
class CachedValueSearchBackend(SearchBackend):
def apply_deferred_updates(self, using=None, cache_groups=None, remove_groups=None, log=None):
# owns CachedValue lookups, removals, cache(..., using=...), etc.
...That would also make the custom backend contract clearer. Before this change, custom SEARCH_BACKEND implementations mostly needed to provide the public cache() / remove() behavior used by the signal handlers. With the deferred path, the job now appears to assume CachedValueSearchBackend internals such as _remove_by_id() and a cache(..., using=...) signature. Keeping the deferred implementation on CachedValueSearchBackend would avoid making those assumptions globally.
Comments
Small style note: I think some of the comments in this path could be trimmed once the responsibilities are split up.
The deferred search-cache buffer used a single on_commit callback per (database alias, transaction), found by scanning run_on_commit for the alias tag and mutating its batch. An op buffered inside a nested atomic() that later rolled back was appended to the outer callback's batch, which Django's savepoint_rollback never inspects (it prunes callbacks by the savepoint-id set captured at registration, not by their closure). The rolled-back op survived and flushed on the outer commit, for example removing from the cache an object whose delete was rolled back. Scope each flush callback to the savepoint stack active when it is registered (tuple(connection.savepoint_ids)); _pending_batch now matches on alias and scope. Each savepoint scope gets its own callback and batch, so a rolled-back savepoint's callback is pruned by Django automatically. Coalescing still holds within a scope; cross-scope ops produce one job per scope, and correctness is preserved because the worker re-fetches live rows at flush time. Add regression coverage for the nested-rollback leak plus committed nested scopes, cross-scope save-then-delete, sibling savepoints, and deep nesting with a middle rollback.
Relocate SearchCacheJob into netbox/search/jobs.py, matching the per-app job convention (core/jobs.py, extras/jobs.py); tasks.py was the lone violation. The job is now thin: it delegates to the backend. Move the deferred-update logic (the remove loop, the re-fetch + atomic remove-then-cache loop, and the missing-schema skip) from the free function update_search_cache onto CachedValueSearchBackend as apply_deferred_updates(), with _is_missing_schema and the Postgres SQLSTATE set as private members of that class. The CachedValue specifics (_remove_by_id, cache(..., using=...)) are now reached only from within the backend that owns them, rather than from the job and the inline fallback reaching into backend internals. SearchBackend gains an abstract apply_deferred_updates() (raising NotImplementedError like its siblings) so the deferred contract is explicit for custom SEARCH_BACKEND classes. deferred._flush now dispatches via SearchCacheJob (worker path) or search_backend.apply_deferred_updates() (inline fallback). This renames and relocates the backends/deferred/jobs import cycle but does not break it; the single function-local import in _flush remains, and the clean fix (extracting the search_backend singleton) is still tracked in #22485.
|
Both addressed, in two commits now pushed. Nested Fix: scope each flush callback to the savepoint stack active when it is registered ( Code organization ( One thing this does not fully resolve: it renames and relocates the |
pheus
left a comment
There was a problem hiding this comment.
Thanks for the updates! This looks very close.
I have two small follow-ups.
For the delete/removal tests, could we make sure the object PK is captured before calling delete() wherever we assert against CachedValue.object_id? Assertions using site.pk after site.delete() can accidentally query for object_id=None instead of the deleted object’s original ID.
It would also be useful for at least one delete/removal test to seed existing CachedValue rows before the delete and then assert that rows for the captured PK are removed.
One other API question: do we want deferred indexing to become part of the base SearchBackend contract? apply_deferred_updates() makes sense for CachedValueSearchBackend, but it is a new expectation for custom SEARCH_BACKEND implementations that may previously have only implemented cache(), remove(), search(), and clear().
Other than that, this looks good to me. Thanks again for iterating on it.
Closes: #21326
Updates to the global search cache (
CachedValue) currently happen synchronously inside thepost_save/post_deletesignal handlers, adding latency to every write. This defers that work to a background job that runs after the response is sent.SearchCacheJob, rather than running inline in the signal handlers.netbox-branchingwithout core depending on the plugin.docs/features/search.mdis updated to note that search results are now eventually consistent (a newly created or changed object may not appear for a brief moment).I'll follow up with a comment covering the design details (the coalescing buffer, the alias capture/replay, and the worker-gated fallback).