db: add DeleteSuffixRange API by dt · Pull Request #6046 · cockroachdb/pebble

dt · 2026-05-13T16:31:53Z

DB.DeleteSuffixRange(ctx, span, lower, upper) deletes all point keys and range key entries within a key span whose suffixes fall within the range (lower, upper], as determined by the key schema. Affected keys are hidden from all future iterators and eventually dropped during compaction.

Internally, the deletion is implemented by attaching a suffix mask to the metadata of overlapping SSTables. If the memtable overlaps the span, it is flushed first. Fully-contained SSTs are replaced with virtual tables that have the mask set. Straddling SSTs are split into virtual tables at the span boundary so the mask applies only to the portion within the span.

Also adds DB.FlushIfOverlapping(span) which flushes the memtable only if any flushable in the queue possibly overlaps the given key range.

cockroach-teamcity · 2026-05-13T16:32:05Z

This change is

RaduBerinde · 2026-05-13T17:16:01Z

The API is equivalent to deleting all existing keys within the span with suffix between A and B, correct? We should name and define the API in terms of the functionality, not how it's implemented (e.g. we could also choose to implement via a new special range del/key).

@sumeerbhola what do you think about the implementation strategy? The high-level operation is closer to what we use range keys for, though I can't see a big problem with this method (given that all our snapshots are EFOS now). In fact, would it make things a lot simpler if we used the same idea to provide the functionality we currently use range keys for?

dt · 2026-05-13T17:36:09Z

deleting all existing keys within the span with suffix between A and B

Correct. Will do.

if we used the same idea to provide the functionality we currently use range keys for

I had a similar thought. For pebble rangedels I'd argue we already sorta did in fact: excise is basically the same idea of doing this by file in range and splitting straddlers using vtables to act on only the contained halves, but excise then gets to just drop the in-span tables entirely rather than needing to attach metadata to them, but in some sense, dropped is metadata.

CRDB's MVCC deletion rangekeys though I think are legitimately more like "values" that the caller has set and expects to read back, since they need be "visible" as a value in iterators over any part of the span, eg for backups or rangefeeds, so keeping them as values that get put in blocks in files as opposed to file metadata doesn't feel too wrong.

dt · 2026-05-13T17:38:25Z

We should name and define the API in terms of the functionality

DeleteBySuffix() ? DeleteSuffixRange ?

github-actions · 2026-05-13T17:44:22Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

sumeerbhola

The high-level operation is closer to what we use range keys for, though I can't see a big problem with this method (given that all our snapshots are EFOS now). In fact, would it make things a lot simpler if we used the same idea to provide the functionality we currently use range keys for?

I haven't look at the implementation yet, but this is a destructive operation, yes? In that the LSM is immediately altered. Which is different from MVCC range keys that someone can read below until MVCC GC runs.

@sumeerbhola made 1 comment.
Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on RaduBerinde).

github-actions · 2026-05-15T13:34:33Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

dt · 2026-05-15T16:15:19Z

this is a destructive operation, yes? In that the LSM is immediately altered

Yes, correct, it is not an mvcc write and feels distinct from writing an mvcc rangekey.

I do think this is similar to how we have replaced/are replacing pebble rangedels / crdb's ClearRange with Excise though where we clip SSTs down to the parts we do and don't care about via virtual ssts and then act on the parts we do care about via manifest edits instead of by putting new values in blocks in new files.

github-actions · 2026-05-19T03:38:02Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

github-actions · 2026-05-19T14:07:59Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

github-actions · 2026-05-19T14:27:29Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

github-actions · 2026-05-19T15:04:57Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

github-actions · 2026-05-19T15:13:53Z

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

RaduBerinde

The approach looks good to me overall.

We need to add metamorphic test support for the new operation. I would add a randomized configuration flag as to whether we actually use the operation, or we do an equivalent scan-and-delete operation (resulting in a very powerful cross-check between configurations).

Some of these commits have glaring test lapses in the relevant package, and I don't think you would have skipped those if you weren't using AI. Please review the changes and make sure they are up to your standards beforehand. Even just asking Claude to review the commits one more time would improve things a lot, for example here's what it had to say about the row-level SuffixMask filtering in DataBlockIter commit: https://gist.github.com/RaduBerinde/9ecd4cec116843768e77d76aa22e62ed

@RaduBerinde made 12 comments.
Reviewable status: 0 of 21 files reviewed, 11 unresolved discussions (waiting on dt).

db.go line 2043 at r10 (raw file):

// possibly overlaps with the given key range. If no flushable overlaps, this
// is a no-op.
func (d *DB) FlushIfOverlapping(span KeyRange) error {

[nit] Add an explanation of why someone might want to call this.

internal/manifest/table_metadata.go line 909 at r11 (raw file):

	}
	if m.SuffixMask.IsSet() {
		fmt.Fprintf(&b, " suffix-mask-lower:%x suffix-mask-upper:%x", m.SuffixMask.Lower, m.SuffixMask.Upper)

[nit] suffix-mask: [%s-%s) and pass `format(Lower), format(Upper)?

We should update ParseTableMetadataDebug to parse this to keep it consistent (might help with testing).

internal/manifest/version_edit.go line 95 at r11 (raw file):

	// customTagBlobReferences2 contains BackingValueSize for each BlobReference.
	customTagBlobReferences2 = 70
	// customTagSuffixMask encodes the lower and upper bounds of the suffix

We need a new format major version.

internal/manifest/version_edit.go line 501 at r11 (raw file):

						}

					case customTagSuffixMask:

This need some test coverage in version_edit_test.go

sstable/blockiter/transforms.go line 15 at r11 (raw file):

)

// SuffixMask defines a range of suffixes to mask during iteration. Lower and

Explain how this interacts with synthetic suffix.

sstable/blockiter/transforms.go line 16 at r11 (raw file):

// SuffixMask defines a range of suffixes to mask during iteration. Lower and
// Upper are in ascending suffix magnitude (e.g. wall-time order for MVCC

Pebble code is agnostic to how things are sorted in the comparer. We should swap these fields, Lower should be <= Upper according to the comparer, and the masked range should be [Lower, Upper).

sstable/blockiter/transforms.go line 19 at r11 (raw file):

// timestamps), NOT in ComparePointSuffixes order. Keys whose suffix is
// strictly greater than Lower and at most Upper (in suffix magnitude) are
// hidden. Suffixless keys are never hidden. Both bounds must be set.

Leave a TODO here to do what we did for SyntheticPrefixAndSuffix to reduce the size of the TableMetadata.

sstable/colblk/data_block.go line 195 at r12 (raw file):

}

// SuffixMaskChecker is an optional interface that KeySeeker

In my experience, this pattern ends up being regretted most of the time. It doesn't allow neat wrapping, and is very subtle to someone reading the code for the first time. I think any KeySeeker should implement this anyway. Definitely we want it in the defaulte key schema for easy testing.

sstable/colblk/data_block.go line 1597 at r12 (raw file):

		}
	}
	if i.transforms.SuffixMask.IsSet() {

I don't see any tests for these changes..

sstable/rowblk/rowblk_iter.go line 501 at r13 (raw file):

			i.ikv.K.SetSeqNum(base.SeqNum(n))
		}
		if !hiddenPoint && i.transforms.SuffixMask.IsSet() && i.suffixCmp != nil {

No tests?

suffix_mask_test.go line 1 at r14 (raw file):

// Copyright 2026 The LevelDB-Go and Pebble Authors. All rights reserved. Use

Much of the test code in this file is very dense and inscrutable. Please make a pass over it and see how much of it is necessary vs how much of it belongs in a lower level. I'd consider replacing many of these with a single randomized test that uses a simple crosscheck (e.g. two stores, one in which we just Scan and Delete the relevant keys).

`looseLeftTableBounds` and `looseRightTableBounds` synthesize loose bounds for the post-excise virtual tables when the caller chose not to consult the original SST for tight bounds. Two latent issues: 1. Both helpers unconditionally extended bounds for each key type present on the original table, even when all keys of that type fell entirely outside the excised portion. The synthesized largest (left table, clamped at `exciseSpanStart`) or smallest (right table, clamped at `exciseSpanEnd`) could then end up on the wrong side of the original's preserved bound, producing a table where `smallest > largest` and failing `TableMetadata.Validate`. This change gates each key-type contribution on whether the original actually extends past the excise boundary on the relevant side. 2. `looseRightTableBounds` copied the original's smallest internal key verbatim, which carries the original's trailer (often a tiny `seqnum=0, kind=DELSIZED`). A real range tombstone in the file at the same user key with a larger trailer then violated the rangeDelIter's lower-bound assertion in invariants builds. Since the bound is loose either way, synthesize the smallest with the maximum non-sentinel trailer (`SeqNumMax-1`, `InternalKeyKindMaxForSSTable`); strictly safer. Testdata in `testdata/excise` and `testdata/excise_bounds` is updated to reflect the new synthesized trailer in the right-table-loose case.

`determineExcisedTableBlobReferences` scales each blob reference's `ValueSize` proportionally from the original table to the excised table. The scaling factor (`excisedTable.Size / originalSize`) is normally <= 1, but size-estimate noise can produce a factor > 1 — for example when `excisedTable.Size` is the loose estimate from `determineExcisedTableSize` and exceeds the (also-estimated) `originalSize`. In that case the scaled `ValueSize` exceeds the physical blob file's `ValueSize`, tripping the invariant check in `MakeBlobReference` on manifest replay. Cap the scaled value at the original reference's `ValueSize`. The excised table is a sub-range of the original; its share of any referenced blob file cannot legitimately exceed the original's share, so the cap is a strictly-tighter bound rather than a behavior change.

Adds a private `flushIfOverlapping(span, extraBoundsLocked)` helper for features that mutate file metadata in a key range and need any data still in the memtable for that range to be moved to SSTs first. `extraBoundsLocked`, if non-nil, is invoked with `d.mu` held and may read from `DB.mu`-protected state to produce additional overlap-check bounds — for example, a caller can extend the overlap check with snapshot-protected ranges so that a snapshot reader's view of the LSM is preserved across the metadata mutation. The callback returns internal `bounded` values, so the helper is package-private. The helper itself has no knowledge of its callers' semantics; the callback is the seam. Foundation for `DeleteSuffixRange` (queued in this branch) and the planned cloning-with-prefix-replacements feature, both of which act on file metadata and need a memtable flush as a prerequisite. Each will pass its own `extraBoundsLocked` callback as appropriate. Tests in `flush_test.go` cover the `nil` callback path (basic span-only overlap check against point keys, range deletes, range keys, and boundary semantics), the non-nil callback path (extra bounds trigger a flush when the span itself wouldn't), and error paths (closed DB, read-only, invalid `KeyRange`).

Add `SuffixMask` (a `[Lower, Upper)` range over the comparer's suffix ordering) and an ordered list `SuffixMasks` to `blockiter.Transforms`, `blockiter.FragmentTransforms`, and `manifest.TableMetadata`. Keys whose suffix falls in any of the listed ranges are hidden during iteration; keys with no suffix are never hidden. The list shape lets repeated `DeleteSuffixRange` calls accumulate per file: contiguous ranges merge in place via `expandSuffixMask`, disjoint ranges append. The list and the byte slices it holds are immutable by convention — any mutator clones first. This is documented on each field. `Lower`/`Upper` refer to the comparer's ordering, not wall-time magnitude. For a comparer like CockroachDB's that sorts newest-first, `Lower` is the newest timestamp (inclusive) and `Upper` is the oldest (exclusive). Encoding: a single new `customTagSuffixMask` (tag 71) on the existing `NewFile4` record carries one (Lower, Upper) pair. The encoder emits one tag per list entry; the decoder appends each occurrence. No on-wire structure change beyond the new tag. The new `FormatSuffixMask` format major version gates writers from emitting the tag at all. `DebugString` / `ParseTableMetadataDebug` render each entry as `suffixmask:[loHex-upHex)`; the field is one token because `-` is a debug-parser separator. Tests: `TestVersionEditRoundTripSuffixMask` covers a hand-picked set plus a randomized roundtrip generating 0..N masks per file (catches length-prefix arithmetic and asymmetric bound sizes). A `suffixmask` case is added to `TestTableMetadata_ParseRoundTrip`.

Wires the runtime portion of `SuffixMask`: a data-block iterator (row-based or columnar) configured with `Transforms.SuffixMasks` hides every row whose effective suffix falls within any mask's `[Lower, Upper)` range, as ordered by `ComparePointSuffixes`. Rows with an empty suffix are never hidden, matching the documented DSR contract. Three components: 1. Per-row check in both data-block iterators. The columnar path (`colblk/data_block.go::isSuffixMasked`) delegates to a new `KeySeeker.IsMaskedBySuffixMask` method; the row-based path (`rowblk/rowblk_iter.go`) inlines the comparison at four positioning sites in lockstep with the file's existing manually- inlined hot-path style. Both correctly handle `SyntheticSuffix`: the effective suffix of a non-empty stored suffix is the synthetic one, so the mask check operates against the substituted suffix. 2. Interleaving with `HideObsoletePoints`. `skipHiddenForward`/ `skipHiddenBackward` (colblk) advance past mask-hidden rows while also respecting the obsolete-point predicate, with the per-call `nextObsoletePoint` cursor reinitialized on mask-induced jumps so the two filters compose correctly. 3. IndexIter must NOT apply `SuffixMasks`. The row-based index iter (`rowblk/rowblk_index_iter.go`) wraps `rowblk.Iter`; without this fix it forwarded `Transforms.SuffixMasks` to the underlying iter, which then evaluated the mask against the index block's separator keys and silently dropped data blocks whose separator's suffix happened to fall in the masked range. `Init`/`InitHandle` now clear `SuffixMasks` before delegating. `TestIndexIterIgnoresSuffixMask` is the regression test. A nil-check is added in `rowblk_iter.go::SeekLT`'s synthetic-suffix correction branch: `First()` can now return nil if every row in the block is hidden (by the mask or by `HideObsoletePoints`), and the prior code unconditionally dereferenced the returned key. Init-time assertions in `Iter.Init`/`Iter.InitHandle` require a non-nil `ComparePointSuffixes` whenever `SuffixMasks` is non-empty — silently leaking masked keys due to a missing comparer would be a correctness regression rather than a missing feature. Oracle tests in `data_block_test.go::TestDataBlockIterSuffixMaskOracle` and `rowblk_iter_test.go::TestBlockIterSuffixMaskOracle` cross-product mask × obsolete × all positioning methods; cockroachkvs gains a fast- path `IsMaskedBySuffixMask` that inline-decodes MVCC suffix bounds.

Implement SuffixMask filtering for range key and range deletion fragment iterators in both columnar (`keyspan.go`) and row-oriented (`rowblk_fragment_iter.go`) paths. Entries whose suffix falls within `[Lower, Upper)` are removed from spans during iteration. If all entries in a span are masked, the span is skipped entirely. `keyspanIter.init` now takes the suffix comparer alongside the key comparer (previously it was assigned separately by `NewKeyspanIter`). Both `keyspanIter.init` and `rowblk.NewFragmentIter` fail loudly when a `SuffixMask` is configured without a non-nil `ComparePointSuffixes`; this prevents the mask from silently leaking range key entries that should be hidden. Add `suffix-mask-{lower,upper}` directives to `TestKeyspanBlock`'s data-driven harness, plus testdata cases that exercise: - partial mask filtering within a span, - fully-masked-span skip in both `Next`/`Prev` directions, - `SeekGE` / `SeekLT` landing inside a fully-masked span, - `RANGEKEYDEL` (suffixless) preservation alongside masked `RANGEKEYSET` entries, - all-keys-masked across every span. Add the parallel multi-span / mixed-kind cases to the row-oriented `testdata/rowblk_fragment_iter`.

Adds `DB.DeleteSuffixRange(ctx, span, lower, upper)`, which hides every point key and range-key entry within `span` whose suffix falls in `[lower, upper)` per `ComparePointSuffixes`. The masked entries are invisible to all subsequent iterators and are physically dropped during compaction. The operation is metadata-only at call time (no row-level writes); per-row filtering is applied at iteration via the `Transforms.SuffixMasks` machinery added in prior commits. This commit lands the core shape: a public API that, for every overlapping SST, attaches the new mask to the (possibly virtual) inside portion of the file. Files fully contained in the span gain a mask directly; straddling files are split via `exciseTable` into up to three virtual tables so the mask applies only to the inside one. Repeated calls on the same file accumulate masks in a per-table list (`expandSuffixMask` merges contiguous ranges; disjoint ranges append). A new comment in `file_cache.go::newPointIter` documents why no block- or file-level property filter is installed for `SuffixMask` (the collector doesn't currently track suffixless-key presence, so a filter could incorrectly skip blocks containing unfiltered suffixless keys). A later commit adds a file-level skip optimization gated on a caller- supplied per-file aggregate check. Several follow-up layers are deliberately deferred to subsequent commits so each can be reviewed in isolation as an enhancement to the core: - `SyntheticSuffix` file-level excise optimization (the file's point keys all share an effective suffix, so the per-row path can be short-circuited to an excise when the synth suffix is in mask). - EFOS-protected-range overlap handling (DSR must transition any pre-file-only EFOS that overlaps the span before attaching its mask, otherwise the EFOS observes a post-DSR view). - In-progress compaction cancellation (a compaction picked against the pre-DSR version may hit a manifest blob-ref assertion when it tries to delete a table DSR has replaced with a virtual one). - Copy-compaction `SuffixMasks` propagation (copy-compactions don't iterate through the per-row filter, so the mask must ride along on the destination metadata). - Block-property-aggregate file-level skip optimization. - `SuffixMasks` on `ExternalFile`/`SharedSSTMeta` so DSR-attached masks survive shared/external `Replicate`. Tests in `suffix_mask_*_test.go` cover validation, the `expandSuffixMask` semantics across same-length and variable-length suffix encodings, the per-row filter machinery via SSTable-direct iteration (rowblk + colblk), oracle correctness against a user-space implementation of the same predicate, and the straddling-file splitting via `exciseTable`. The data-driven test disables automatic compactions to keep test sequencing deterministic; later commits add explicit cancellation regression tests with compactions enabled. `excise.go` propagates `SuffixMasks` onto the left and right virtual tables produced by `exciseTable` so that masks attached to a file survive that file being split — required for both the DSR straddling path here and any subsequent `Excise` on a DSR-masked file.

A file with a non-empty `SyntheticSuffix` substitutes the synth in for every point key's effective suffix at iteration time. `RangeKeySet` entries with non-empty original suffix get the same substitution; entries with empty original suffix retain empty (per the `SyntheticSuffix` contract — empty-suffix `RangeKeySet` entries are never masked, matching the broader DSR contract for suffixless keys). `RangeKeyDelete` entries have no per-key suffix and are also never masked. `RangeKeyUnset` is not allowed on a `SyntheticSuffix` file at all (see the assertion in `rowblk_fragment_iter.go::applySpanTransforms`). The mask answer for such a file is therefore uniform across nearly every row: every non-empty stored suffix maps to the same effective suffix. Three cases: 1. synth NOT in mask: no row in the file matches the mask. Skip the file entirely — no version edit, no mask attachment. 2. synth IN mask AND no range keys: every point key is uniformly masked. Excise the file's overlap; a per-row mask would produce the same observable behavior but at iteration-time cost. The new `exciseOverlap` helper handles the fully-contained vs straddling cases. 3. synth IN mask AND has range keys: cannot safely excise — excising would drop `RangeKeyDelete` entries and any empty-suffix `RangeKeySet` entries (both never-masked per the DSR contract). Fall through to per-row mask attachment, which honors empty suffixes via the existing filter machinery. The optimization is an enhancement on top of the core DSR API added in the prior commit: that commit's per-row attachment is correct (and the Path B per-row check in `colblk,cockroachkvs,rowblk` is synth-suffix-aware), so this commit could land later as a pure perf improvement. It lands here because it's tightly coupled to the documented file-level decision tree and is small enough to merit inclusion alongside the API. The conservative fall-through on `HasRangeKeys` is wider than strictly necessary — a synth-suffix file with range keys but no suffixless entries could still be excised — but distinguishing those cases requires `TableMetadata` bits that don't exist yet. A TODO records the optimization opportunity. `TestDeleteSuffixRangeExciseOverlap` covers the case-2 excise path end-to-end; `TestDeleteSuffixRangeSyntheticSuffixWithEmptyRangeKey` in `suffix_mask_synth_rangekey_test.go` regression-tests case 3, verifying that an empty-suffix `RangeKeySet` entry on a synth-suffix file in the mask range remains visible after DSR (i.e., we fell through to per-row instead of excising).

A pre-file-only `EventuallyFileOnlySnapshot` (EFOS) reads via `snapshotIterOpts{seqNum: es.seqNum}` — the live LSM filtered by seqnum, NOT a pinned `*Version`. The seqnum filter is not a defense against metadata changes like mask attachment (those don't bump seqnums), so a pre-file-only EFOS whose protected range overlaps a DSR span would observe our mask immediately, violating the EFOS contract that protected-range reads remain consistent across non-additive LSM mutations. The fix has two parts: 1. Extend `flushIfOverlapping`'s overlap check with the protected ranges of any pre-file-only EFOS whose protected ranges overlap our span. This forces a flush of any EFOS-overlapping memtable content, and the flush completion's `maybeTransitionSnapshotsToFileOnlyLocked` transitions the EFOS to file-only — pinning a pre-DSR `*Version` — before we attach the mask. 2. Inside DSR's `UpdateVersionLocked` updateFn, call `maybeTransitionSnapshotsToFileOnlyLocked` explicitly. The flush path covers EFOSes that needed a flush to unblock the transition; this explicit call covers EFOSes that didn't need a flush (no blocking memtable content) but still need to transition before our VE attaches the mask. Without the explicit call, such an EFOS would remain pre-file-only past our `UpdateVersionLocked` and transition later (triggered by an unrelated flush), at which point the version it pins already includes our mask. Both parts together close the EFOS-vs-DSR race the metamorphic test surfaced as cross-config divergence (one config's EFOS transitioned before our mask, another after, producing different read results for the same op stream). Classic `Snapshot` observers cannot be protected the same way (no protected ranges, no transition mechanism); their post-DSR reads observe the masked LSM, matching the documented excise-with-classic-snapshot behavior. A deterministic single-process unit test for this race is awkward — it requires interleaving an EFOS-blocking flush with a DSR's version edit application, which depends on goroutine scheduling. Reverting either change in isolation causes the metamorphic `TestMetaCockroachKVs` to flake at a ~37% rate within ~3 iterations; both together produce ≥120 consecutive clean runs across all metamorphic variants. The metamorphic coverage is the regression test.

A compaction picked against the pre-DSR version holds a `*TableMetadata` pointer for each input file. When that compaction later applies its version edit, the manifest's blob-reference tracker (keyed by pointer in `internal/manifest/blob_metadata.go`) looks up each deleted table's blob references. If DSR has, in the meantime, replaced the input file with a virtual table — a different `*TableMetadata` pointer — the tracker no longer has an entry for the original pointer and fires: pebble: deleted table NNNN's reference to blob file BNNN not known Fix: inside DSR's `UpdateVersionLocked` updateFn (which holds the manifest log lock), iterate `d.mu.compact.inProgress` and call `Cancel()` on any compaction whose `Bounds()` overlap the DSR span. The cancelled compaction's own apply step checks `c.cancel` under the same log lock and returns `ErrCancelledCompaction` instead of applying its conflicting version edit. The retry loop in `Compact` re-picks against the post-DSR version. An earlier shape waited for `compactingCount == 0` before entering `UpdateVersionLocked`. That's insufficient because `UpdateVersionLocked`'s `logLock` may release `DB.mu` while waiting for another writer to finish, during which a new compaction may be scheduled (incrementing `compactingCount`). The cancellation pattern is correct and mirrors what `IngestAndExcise` does. This commit also adds a test-only hook `Options.private.testingDuringCompactionIOFunc`, invoked from a default table compaction's I/O phase (after the compaction has written its outputs and constructed its `VersionEdit`, but before re-acquiring `DB.mu` to apply it). The hook provides a deterministic window for tests to inject a DSR call that races with the in-progress compaction. `TestDeleteSuffixRangeCancelsOverlappingCompaction` uses the hook to pause a compaction mid-I/O, runs DSR over the same span, and verifies that the compaction's apply observes the cancel cleanly (no manifest assertion fires). Without the cancel logic in this commit, the test deadlocks via the original `compactingCount` wait OR (with the wait removed) trips the `deleted table reference to blob file not known` assertion, depending on which intermediate shape is being exercised.

`runCopyCompaction` produces an output table whose backing is a byte-for-byte copy of an input table; it does NOT iterate over the input through the per-row `SuffixMask` filter. If the input has `SuffixMasks` attached, those masks must ride along on the output's `TableMetadata` — otherwise the copy would silently un-hide rows that the source had masked. This affects the `Download` path (external -> local), which is the only caller of `runCopyCompaction` in production. This is in contrast to `runDefaultTableCompaction`, which iterates inputs through the standard iterator stack (including the per-row mask filter) and so produces a post-filter output that must NOT carry the mask forward. A direct unit test is awkward because `runCopyCompaction` is only chosen for a narrow combination of conditions (external backing, no per-row work). Coverage is via the metamorphic test in the test infrastructure stage, where `Download` interleaved with DSR exercises the path.

`DeleteSuffixRange` previously attached a `SuffixMask` to every sstable whose user-key bounds overlapped the requested span, even when the file demonstrably contained no suffixes in the mask's `[lower, upper)` range. The masks were observationally no-ops but incurred a manifest edit, a virtual-table allocation, and per-row filter overhead for the lifetime of the file. Add an optional `Options.SuffixRangeIntersects` hook consulted on each overlapping file before a mask is attached. When the hook returns false, the file is skipped entirely (no version edit, no mask, no excise). When unset, `DeleteSuffixRange` behaves as before. `cockroachkvs.SuffixRangeIntersectsTable` is the ready-to-use implementation: it decodes the file's `MVCCTimeInterval` block- property aggregate (already collected by `cockroachkvs.BlockPropertyCollectors`) and tests it against the mask's wall-time range. It fails open (assumes intersection) on missing aggregates, undecodable inputs, or non-MVCC suffixes. Under invariants builds, a 25% metamorphic bypass attaches a no-op mask anyway, keeping the unfiltered code path exercised even when the optimization would normally elide it. A package-level `suffixMaskSkipBypassDisabled` toggle lets focused tests disable the randomness for deterministic assertions. Tests: - `cockroachkvs.TestSuffixRangeIntersectsTable`: synthesizes aggregates and verifies intersection at boundary cases. - `TestDeleteSuffixRangeSkipsNonOverlappingFiles`: builds files with disjoint wall-time ranges and asserts only the matching file receives a mask (with the bypass disabled). - `TestDeleteSuffixRangeSkipBypassExercisesPath`: statistically verifies the 25% bypass eventually fires and that the resulting no-op mask doesn't change visible scan output.

`DeleteSuffixRange` attaches per-file `SuffixMasks` to a vsst's `TableMetadata` to hide rows whose suffix falls in a configured range. Masks are a read-time presentation of the file (analogous to `SyntheticSuffix`), not a property of the backing bytes — the destination of a `Replicate` must reconstruct the same masks on its vsst to observe the source's view. Previously, the source-side emission paths (`SharedSSTMeta.cloneFromFileMeta` and the `ExternalFile` builder in `ScanInternal`) dropped masks, and the destination-side ingest paths (`ingestSynthesizeShared`, `ingestLoad1External`) had nowhere to receive them. A shared/external Replicate of a masked vsst silently un-hid the masked rows on the destination — a real correctness gap that the metamorphic test worked around with a test-side punt (`t.dsrInOps`) disabling shared/external Replicate when DSR was in the op stream. This change adds `SuffixMasks []sstable.SuffixMask` to both `SharedSSTMeta` and `ExternalFile`, deep-clones the source's mask list on emission, and propagates it onto the destination's `TableMetadata` on ingest. The destination's `IngestExternalFiles`/`IngestAndExcise` paths gain an FMV gate refusing non-empty `SuffixMasks` when the destination is below `FormatSuffixMask` — silent loss of masking would be a correctness regression rather than a missing feature. A new test (`TestExternalFileSuffixMasksRoundTrip`) exercises the full emit-then-ingest cycle: source DB ingests + applies DSR, `ScanInternal` emits an `ExternalFile` carrying the mask, destination DB ingests via `IngestAndExcise` and reproduces the source's hidden/visible pattern. A companion test (`TestExternalFileSuffixMasksFMVGate`) verifies the FMV refusal. Note: no CockroachDB caller currently sets these fields. The feature is added because it conceptually completes the DSR API surface (a vsst property must survive cross-instance Replicate to preserve the user's view) and because it lets the metamorphic test stop punting on shared/external Replicate when DSR is present — see the follow-up commit that removes the `dsrInOps` gate.

Wires `DeleteSuffixRange` into the metamorphic harness as `deleteSuffixRangeOp`: parser/formatter, generator, key-manager update hook, and execution. The op is enabled at weight 5 in `DefaultOpConfig`, `ReadOpConfig`, and `WriteOpConfig` — typical parity with the other span-shaped DB ops. DSR requires `FormatSuffixMask` to execute. Configs in the metamorphic test vary in their starting FMV, so the generator emits a paired `RatchetFormatMajorVersion(FormatSuffixMask)` op immediately before each DSR op. Without the ratchet, configs at lower FMVs would error on DSR while configs at higher FMVs would succeed; the cross-config compare would then trip on the diverging histories. The ratchet is a no-op on configs already at or above `FormatSuffixMask`. A later commit broadens this to ratchet every DB instance (not just the DSR target) once `SuffixMasks` can flow across DBs via Replicate. DSR mutates the LSM non-additively (attaches per-file masks). Classic `Snapshot` observers cannot pin a pre-DSR `*Version` and so are unsafe in DSR's presence, similar to the documented hazard around excise. The `dsrInOps` flag (set in `test.go` if the op stream contains any DSR) is consumed by `newSnapshotOp.run` to upgrade every snapshot to an EFOS — EFOSes can transition to file-only before DSR runs (via the EFOS-protected-range overlap handling in `DB.DeleteSuffixRange`). `keyManager.update` treats DSR as a no-op for the per-key state it tracks: DSR hides at iteration time, not at the physical-record level, so the SingleDelete eligibility classification (which is about physical SET / DELETE records) is unaffected by which keys DSR has masked from view. Tests: - `delete_suffix_range_test.go`: parser format roundtrip, generator invariants, and a focused execution test exercising basic DSR semantics through the metamorphic op harness. - `testdata/parser`: format/parse roundtrip cases.

The metamorphic test previously masked the shared/external Replicate + DSR cross product two ways: 1. `replicateOp.run` had a `t.dsrInOps` gate that forced the iter-based replicate fallback whenever any `DeleteSuffixRange` op appeared in the stream. The iter-based path filters source-masked rows at iter time, so masks were effectively baked into the shipped data and the destination never had to know about masks. This worked but meant the production shared/external `Replicate` paths were never exercised when DSR was in play. 2. `dbDeleteSuffixRange` emitted a `RatchetFormatMajorVersion(FormatSuffixMask)` op only for the DB that DSR runs on. In a two-instance test, DSR on `db1` ratched only `db1` while `db2` stayed at its config's starting FMV. A subsequent `Replicate` from `db1` (with masks) to `db2` (at FMV < `FormatSuffixMask`) was unreachable behind the gate above. With `SuffixMasks` now plumbed through `ExternalFile`/`SharedSSTMeta` (see prior commit), the workaround in (1) is no longer needed. This change removes the `t.dsrInOps` gate so the shared and external replicate paths actually run when DSR is in the op stream — surfacing any future bugs in the wire-format propagation. The dependency in (2) becomes load-bearing: now that the destination DB's ingest enforces an FMV gate on `SuffixMasks`, a `Replicate` to a non-ratched destination would error. `dbDeleteSuffixRange` is updated to ratchet every DB in `g.dbs` before the DSR. The ratchet is a no-op on configs already at `FormatSuffixMask` or above, and the change is necessary regardless of how many DB instances the test happens to create — masks may flow to any DB via `Replicate`. The `t.dsrInOps` variable itself is retained because the EFOS-forcing logic in `newSnapshotOp` still uses it (DSR mutates the LSM non-additively, so any snapshot taken in a stream containing DSR must be an EFOS to have any chance of pinning a pre-DSR view).

Adds a `useScanDeleteForDSR` test option, randomized per config at 50% frequency, that routes each `deleteSuffixRangeOp` through an explicit scan-and-delete equivalent instead of calling `DB.DeleteSuffixRange`. The recorded op history is identical in both modes (the op records as a DSR regardless), so the metamorphic harness's existing cross-config compare exercises the implementation equivalence as a property: any divergence in subsequent reads catches a bug in either DSR or in the scan-and-delete baseline. The equivalent path scans the span with `IterKeyTypePointsAndRanges`, issues `Delete(key)` per visible point key whose suffix lies in `[lower, upper)` per `ComparePointSuffixes`, and `RangeKeyUnset(start, end, suffix)` per visible range-key entry with a matching non-empty suffix. Empty-suffix entries and suffixless point keys are left alone, matching the documented DSR contract. Equivalence is at the observable-read level, not the LSM-state level: real DSR attaches per-file metadata; the scan-and-delete path writes physical tombstones at a new seqnum. Both implementations should hide exactly the same set of rows from every subsequent read; that's what the cross-config compare verifies. Some implementation-level differences that intentionally do NOT cause divergence (because they don't surface in op results): - LSM file layout: DSR's masked files vs scan+delete's tombstoned rows compact differently. Compaction op results are recorded as success/failure, not LSM contents. - Compaction cancellation: real DSR cancels overlapping in-progress compactions; scan+delete doesn't. Visible behavior agrees because cancelled compactions retry. - Snapshot semantics: a snapshot taken before either DSR or scan+delete sees the pre-mutation state in both. A snapshot taken after either should see the post-mutation state (modulo the EFOS file-only-transition machinery, which DSR handles explicitly and scan+delete gets for free because point tombstones don't perturb snapshots). The equivalent path is approximate for `SyntheticSuffix` files: scan returns synth-substituted keys, so `Delete(iter.Key())` deletes by the substituted user key. DSR with synth-in-mask excises the whole file's overlap. Observable outcome agrees (all matching rows are gone) but the LSM state diverges substantially. This has not produced spurious divergences across the runs exercised so far (11 iterations across TestMeta + TestMetaCockroachKVs, ~50/50 split on the toggle, all clean), but it's the most likely source of any future flake.

dt · 2026-06-01T21:59:15Z

+}
+
+// randNewPCG returns a closure-based PCG-style RNG. We avoid the
+// math/rand/v2 import to keep this test file's import surface minimal —


why are you like this claude

dt · 2026-06-01T22:02:38Z

+// method's correctness must match the oracle row-for-row; the test would
+// catch any positioning method that forgets to apply the mask check, any
+// boundary-semantics divergence, and any obsolete×mask interaction bug.
+func TestBlockIterSuffixMaskOracle(t *testing.T) {


let's clarify this name, e.g. "TestRandomBlockIterSuffixMaskAgainstNaiveFilter" -- we can use the comment to spell out that we'll push a randomly generated test-case through suffix masking and through a naive filter that serves as our "oracle" against which to compare the mask result.

Second question: How is this getting us that much coverage though if the seed is hard-coded?

dt requested review from RaduBerinde and sumeerbhola May 13, 2026 16:31

dt requested a review from a team as a code owner May 13, 2026 16:31

dt changed the title ~~Suffix upper bound~~ @dt db: add ApplySuffixMask API May 13, 2026

dt changed the title ~~@dt db: add ApplySuffixMask API~~ db: add ApplySuffixMask API May 13, 2026

dt force-pushed the suffix-upper-bound branch from 0514fde to 64b43c1 Compare May 13, 2026 17:38

dt force-pushed the suffix-upper-bound branch from 64b43c1 to 96d83fc Compare May 13, 2026 17:41

github-actions Bot added the O-AI-Review-Potential-Issue-Detected label May 13, 2026

dt force-pushed the suffix-upper-bound branch from 96d83fc to c608ef7 Compare May 13, 2026 17:59

dt changed the title ~~db: add ApplySuffixMask API~~ db: add DeleteSuffixRange API May 13, 2026

dt force-pushed the suffix-upper-bound branch from c608ef7 to 5956b42 Compare May 15, 2026 13:28

sumeerbhola reviewed May 15, 2026

View reviewed changes

dt added the O-AI-Review-Not-Helpful label May 15, 2026

dt force-pushed the suffix-upper-bound branch from 5956b42 to 2c77df3 Compare May 19, 2026 03:30

dt force-pushed the suffix-upper-bound branch from 2c77df3 to c40a55a Compare May 19, 2026 13:59

dt force-pushed the suffix-upper-bound branch from c40a55a to 9d66978 Compare May 19, 2026 14:21

dt force-pushed the suffix-upper-bound branch from 9d66978 to df13c90 Compare May 19, 2026 14:57

dt force-pushed the suffix-upper-bound branch from df13c90 to 51891bb Compare May 19, 2026 15:08

RaduBerinde reviewed May 19, 2026

View reviewed changes

dt force-pushed the suffix-upper-bound branch 4 times, most recently from aa5b6ed to c3d718e Compare May 24, 2026 17:58

dt added 16 commits May 24, 2026 18:52

dt force-pushed the suffix-upper-bound branch from c3d718e to cb65f01 Compare May 24, 2026 18:58

dt commented Jun 1, 2026

View reviewed changes

Conversation

dt commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 13, 2026

Uh oh!

RaduBerinde commented May 13, 2026

Uh oh!

dt commented May 13, 2026

Uh oh!

dt commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Potential Bug(s) Detected

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 15, 2026

Potential Bug(s) Detected

Uh oh!

dt commented May 15, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Potential Bug(s) Detected

Uh oh!

github-actions Bot commented May 19, 2026

Potential Bug(s) Detected

Uh oh!

github-actions Bot commented May 19, 2026

Potential Bug(s) Detected

Uh oh!

github-actions Bot commented May 19, 2026

Potential Bug(s) Detected

Uh oh!

github-actions Bot commented May 19, 2026

Potential Bug(s) Detected

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

dt Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

dt Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dt commented May 13, 2026 •

edited

Loading

dt commented May 13, 2026 •

edited

Loading