fix(plugin-multi-tenant): forward req in afterTenantDelete cascade#16505
Open
philipdaveby wants to merge 1 commit intopayloadcms:mainfrom
Open
fix(plugin-multi-tenant): forward req in afterTenantDelete cascade#16505philipdaveby wants to merge 1 commit intopayloadcms:mainfrom
philipdaveby wants to merge 1 commit intopayloadcms:mainfrom
Conversation
The afterTenantDelete hook receives `req` from the caller (which carries
the parent transaction's `transactionID`) but does not forward it into
its three internal Payload calls — the cascade `payload.delete` per
tenant-scoped collection, the users-find, and the users-update.
Each cascade call therefore grabs a fresh pool connection and runs
OUTSIDE the caller's transaction. When the caller (e.g. a Kafka
consumer that opens a transaction to delete a tenant) holds row locks
via its open tx, the cascade's sibling tx tries to acquire the same
locks and blocks. The parent awaits the cascade's promise; the cascade
waits on the parent's locks; PostgreSQL can't detect the cycle because
the parent connection is at `ClientRead` (waiting on Node, not on a
Lock) — producing multi-minute idle-in-transaction leaks visible in
`pg_stat_activity`.
We hit this in our staging/dev environment when deleting a tenant
(QPortal instance) via a Kafka consumer that opens its own tx and calls
`payload.delete(tenantsSlug, id, req)`. The leak shape is:
pid X (idle in tx, ClientRead) — last query
DELETE FROM payload_preferences WHERE key IN ($1)
pid Y (active, blocked on transactionid Lock) — query
DELETE FROM static_assets WHERE id = $1
pid X is the parent (its last visible statement is Payload's
`deleteUserPreferences` cleanup at the end of the deleteByID flow,
right before the multi-tenant `afterDelete` hook fires). pid Y is one
of the cascade's sibling txs. Postgres sees the lock cycle but reports
the parent as `ClientRead` rather than `Lock`, so the cycle isn't
broken automatically — the deadlock persists until something external
(idle_in_transaction_session_timeout, NAT eviction, or our own
process-level watchdog) terminates a connection.
Fix: pass the caller's `req` into all three Payload calls in
afterTenantDelete so the cascade and the users-cleanup share the
parent's transactionID. Three single-line additions.
The bug has been present in every shipped 3.x version (verified by
extracting `dist/hooks/afterTenantDelete.js` from npm tarballs for
3.30.0, 3.50.0, 3.64.0, 3.84.1).
Beyond fixing the deadlock, sharing the tx also restores
transactional consistency: a half-failed cascade no longer commits
"tenant deleted, child rows orphaned" or "child rows deleted, tenant
still present".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What?
@payloadcms/plugin-multi-tenant'safterTenantDeletehook (packages/plugin-multi-tenant/src/hooks/afterTenantDelete.ts) receivesreqfrom the caller but does not forward it into the three internal Payload calls inside the hook:req.payload.delete({ collection: slug, where: {...} })— per tenant-scoped collection (cascade)req.payload.find({ collection: usersSlug, ... })— users-with-tenant lookupreq.payload.update({ collection: usersSlug, ... })— per affected userEach call therefore grabs a fresh pool connection and runs outside the caller's transaction.
Why?
When the caller (e.g. a Kafka consumer that opens a transaction to delete a tenant) holds row locks via its open tx, the cascade's sibling tx tries to acquire the same locks and blocks. The parent awaits the cascade's promise; the cascade waits on the parent's locks; Postgres can't detect the cycle because the parent connection is at
ClientRead(waiting on Node, not on a Lock) — producing multi-minuteidle in transactionleaks visible inpg_stat_activity.The leak shape we observed:
```
pid X (idle in tx, ClientRead) — last query:
DELETE FROM payload_preferences WHERE key IN ($1)
pid Y (active, blocked on transactionid Lock) — query:
DELETE FROM static_assets WHERE id = $1
```
pid Xis the parent (its last visible statement is Payload'sdeleteUserPreferencescleanup at the end of thedeleteByIDflow, right before the multi-tenantafterDeletehook fires).pid Yis one of the cascade's sibling txs. The cycle persists until something external (idle_in_transaction_session_timeout, NAT eviction, or a process-level watchdog) terminates a connection.Beyond fixing the deadlock, sharing the tx also restores transactional consistency: a half-failed cascade no longer commits "tenant deleted, child rows orphaned" or "child rows deleted, tenant still present".
How?
Pass the caller's
reqinto all three Payload calls so the cascade and the users-cleanup share the parent'stransactionID. Three single-line additions:```diff
cleanupPromises.push(
req.payload.delete({
collection: slug,
)
```
```diff
const usersWithTenant = (await req.payload.find({
collection: usersSlug,
depth: 0,
limit: 0,
where: { ... },
})) as PaginatedDocs
```
```diff
cleanupPromises.push(
req.payload.update({
id: user.id,
collection: usersSlug,
data: { ... },
```
Reproduction notes
dist/hooks/afterTenantDelete.jsfrom npm tarballs for 3.30.0, 3.50.0, 3.64.0 (current popular pin), and 3.84.1 (latest at time of writing) — all identical.payload.deletetouches a row that the parent's tx has already touched (or vice versa).payload.delete(tenantsSlug, id, req)afterpayload.db.beginTransaction().Test plan
addTenantCleanupwith areqcarrying atransactionID, assert the innerpayload.delete/find/updatecalls receive the sametransactionID(e.g. by spying on the payload mock).I'm happy to add the regression test if helpful — let me know which test file you'd prefer it in (
packages/plugin-multi-tenant/test/looks like the right place, but I haven't audited the existing test layout).Risk
Minimal. The change only thread the existing
reqthrough. Behaviour is unchanged when the caller doesn't open a transaction (req.transactionIDis undefined and Payload's adapter treats it as a plain pool acquisition).