Skip to content

db: add DeleteSuffixRange API#6046

Open
dt wants to merge 16 commits into
cockroachdb:masterfrom
dt:suffix-upper-bound
Open

db: add DeleteSuffixRange API#6046
dt wants to merge 16 commits into
cockroachdb:masterfrom
dt:suffix-upper-bound

Conversation

@dt

@dt dt commented May 13, 2026

Copy link
Copy Markdown
Contributor

DB.DeleteSuffixRange(ctx, span, lower, upper) deletes all point keys and range key entries within a key span whose suffixes fall within the range (lower, upper], as determined by the key schema. Affected keys are hidden from all future iterators and eventually dropped during compaction.

Internally, the deletion is implemented by attaching a suffix mask to the metadata of overlapping SSTables. If the memtable overlaps the span, it is flushed first. Fully-contained SSTs are replaced with virtual tables that have the mask set. Straddling SSTs are split into virtual tables at the span boundary so the mask applies only to the portion within the span.

Also adds DB.FlushIfOverlapping(span) which flushes the memtable only if any flushable in the queue possibly overlaps the given key range.

@dt dt requested review from RaduBerinde and sumeerbhola May 13, 2026 16:31
@dt dt requested a review from a team as a code owner May 13, 2026 16:31
@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

@dt dt changed the title Suffix upper bound @dt db: add ApplySuffixMask API May 13, 2026
@dt dt changed the title @dt db: add ApplySuffixMask API db: add ApplySuffixMask API May 13, 2026
@RaduBerinde

Copy link
Copy Markdown
Member

The API is equivalent to deleting all existing keys within the span with suffix between A and B, correct? We should name and define the API in terms of the functionality, not how it's implemented (e.g. we could also choose to implement via a new special range del/key).

@sumeerbhola what do you think about the implementation strategy? The high-level operation is closer to what we use range keys for, though I can't see a big problem with this method (given that all our snapshots are EFOS now). In fact, would it make things a lot simpler if we used the same idea to provide the functionality we currently use range keys for?

@dt

dt commented May 13, 2026

Copy link
Copy Markdown
Contributor Author

deleting all existing keys within the span with suffix between A and B

Correct. Will do.

if we used the same idea to provide the functionality we currently use range keys for

I had a similar thought. For pebble rangedels I'd argue we already sorta did in fact: excise is basically the same idea of doing this by file in range and splitting straddlers using vtables to act on only the contained halves, but excise then gets to just drop the in-span tables entirely rather than needing to attach metadata to them, but in some sense, dropped is metadata.

CRDB's MVCC deletion rangekeys though I think are legitimately more like "values" that the caller has set and expects to read back, since they need be "visible" as a value in iterators over any part of the span, eg for backups or rangefeeds, so keeping them as values that get put in blocks in files as opposed to file metadata doesn't feel too wrong.

@dt dt force-pushed the suffix-upper-bound branch from 0514fde to 64b43c1 Compare May 13, 2026 17:38
@dt

dt commented May 13, 2026

Copy link
Copy Markdown
Contributor Author

We should name and define the API in terms of the functionality

DeleteBySuffix() ? DeleteSuffixRange ?

@dt dt force-pushed the suffix-upper-bound branch from 64b43c1 to 96d83fc Compare May 13, 2026 17:41
@github-actions

Copy link
Copy Markdown

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@dt dt force-pushed the suffix-upper-bound branch from 96d83fc to c608ef7 Compare May 13, 2026 17:59
@dt dt changed the title db: add ApplySuffixMask API db: add DeleteSuffixRange API May 13, 2026
@dt dt force-pushed the suffix-upper-bound branch from c608ef7 to 5956b42 Compare May 15, 2026 13:28

@sumeerbhola sumeerbhola left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high-level operation is closer to what we use range keys for, though I can't see a big problem with this method (given that all our snapshots are EFOS now). In fact, would it make things a lot simpler if we used the same idea to provide the functionality we currently use range keys for?

I haven't look at the implementation yet, but this is a destructive operation, yes? In that the LSM is immediately altered. Which is different from MVCC range keys that someone can read below until MVCC GC runs.

@sumeerbhola made 1 comment.
Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on RaduBerinde).

@github-actions

Copy link
Copy Markdown

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@dt

dt commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

this is a destructive operation, yes? In that the LSM is immediately altered

Yes, correct, it is not an mvcc write and feels distinct from writing an mvcc rangekey.

I do think this is similar to how we have replaced/are replacing pebble rangedels / crdb's ClearRange with Excise though where we clip SSTs down to the parts we do and don't care about via virtual ssts and then act on the parts we do care about via manifest edits instead of by putting new values in blocks in new files.

@github-actions

Copy link
Copy Markdown

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@dt dt force-pushed the suffix-upper-bound branch from 2c77df3 to c40a55a Compare May 19, 2026 13:59
@github-actions

Copy link
Copy Markdown

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@dt dt force-pushed the suffix-upper-bound branch from c40a55a to 9d66978 Compare May 19, 2026 14:21
@github-actions

Copy link
Copy Markdown

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@dt dt force-pushed the suffix-upper-bound branch from 9d66978 to df13c90 Compare May 19, 2026 14:57
@github-actions

Copy link
Copy Markdown

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@dt dt force-pushed the suffix-upper-bound branch from df13c90 to 51891bb Compare May 19, 2026 15:08
@github-actions

Copy link
Copy Markdown

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@RaduBerinde RaduBerinde left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me overall.

We need to add metamorphic test support for the new operation. I would add a randomized configuration flag as to whether we actually use the operation, or we do an equivalent scan-and-delete operation (resulting in a very powerful cross-check between configurations).

Some of these commits have glaring test lapses in the relevant package, and I don't think you would have skipped those if you weren't using AI. Please review the changes and make sure they are up to your standards beforehand. Even just asking Claude to review the commits one more time would improve things a lot, for example here's what it had to say about the row-level SuffixMask filtering in DataBlockIter commit: https://gist.github.com/RaduBerinde/9ecd4cec116843768e77d76aa22e62ed

@RaduBerinde made 12 comments.
Reviewable status: 0 of 21 files reviewed, 11 unresolved discussions (waiting on dt).


db.go line 2043 at r10 (raw file):

// possibly overlaps with the given key range. If no flushable overlaps, this
// is a no-op.
func (d *DB) FlushIfOverlapping(span KeyRange) error {

[nit] Add an explanation of why someone might want to call this.


internal/manifest/table_metadata.go line 909 at r11 (raw file):

	}
	if m.SuffixMask.IsSet() {
		fmt.Fprintf(&b, " suffix-mask-lower:%x suffix-mask-upper:%x", m.SuffixMask.Lower, m.SuffixMask.Upper)

[nit] suffix-mask: [%s-%s) and pass `format(Lower), format(Upper)?

We should update ParseTableMetadataDebug to parse this to keep it consistent (might help with testing).


internal/manifest/version_edit.go line 95 at r11 (raw file):

	// customTagBlobReferences2 contains BackingValueSize for each BlobReference.
	customTagBlobReferences2 = 70
	// customTagSuffixMask encodes the lower and upper bounds of the suffix

We need a new format major version.


internal/manifest/version_edit.go line 501 at r11 (raw file):

						}

					case customTagSuffixMask:

This need some test coverage in version_edit_test.go


sstable/blockiter/transforms.go line 15 at r11 (raw file):

)

// SuffixMask defines a range of suffixes to mask during iteration. Lower and

Explain how this interacts with synthetic suffix.


sstable/blockiter/transforms.go line 16 at r11 (raw file):

// SuffixMask defines a range of suffixes to mask during iteration. Lower and
// Upper are in ascending suffix magnitude (e.g. wall-time order for MVCC

Pebble code is agnostic to how things are sorted in the comparer. We should swap these fields, Lower should be <= Upper according to the comparer, and the masked range should be [Lower, Upper).


sstable/blockiter/transforms.go line 19 at r11 (raw file):

// timestamps), NOT in ComparePointSuffixes order. Keys whose suffix is
// strictly greater than Lower and at most Upper (in suffix magnitude) are
// hidden. Suffixless keys are never hidden. Both bounds must be set.

Leave a TODO here to do what we did for SyntheticPrefixAndSuffix to reduce the size of the TableMetadata.


sstable/colblk/data_block.go line 195 at r12 (raw file):

}

// SuffixMaskChecker is an optional interface that KeySeeker

In my experience, this pattern ends up being regretted most of the time. It doesn't allow neat wrapping, and is very subtle to someone reading the code for the first time. I think any KeySeeker should implement this anyway. Definitely we want it in the defaulte key schema for easy testing.


sstable/colblk/data_block.go line 1597 at r12 (raw file):

		}
	}
	if i.transforms.SuffixMask.IsSet() {

I don't see any tests for these changes..


sstable/rowblk/rowblk_iter.go line 501 at r13 (raw file):

			i.ikv.K.SetSeqNum(base.SeqNum(n))
		}
		if !hiddenPoint && i.transforms.SuffixMask.IsSet() && i.suffixCmp != nil {

No tests?


suffix_mask_test.go line 1 at r14 (raw file):

// Copyright 2026 The LevelDB-Go and Pebble Authors. All rights reserved. Use

Much of the test code in this file is very dense and inscrutable. Please make a pass over it and see how much of it is necessary vs how much of it belongs in a lower level. I'd consider replacing many of these with a single randomized test that uses a simple crosscheck (e.g. two stores, one in which we just Scan and Delete the relevant keys).

@dt dt force-pushed the suffix-upper-bound branch 4 times, most recently from aa5b6ed to c3d718e Compare May 24, 2026 17:58
dt added 16 commits May 24, 2026 18:52
`looseLeftTableBounds` and `looseRightTableBounds` synthesize loose
bounds for the post-excise virtual tables when the caller chose not to
consult the original SST for tight bounds. Two latent issues:

1. Both helpers unconditionally extended bounds for each key type
present on the original table, even when all keys of that type fell
entirely outside the excised portion. The synthesized largest (left
table, clamped at `exciseSpanStart`) or smallest (right table, clamped
at `exciseSpanEnd`) could then end up on the wrong side of the
original's preserved bound, producing a table where `smallest >
largest` and failing `TableMetadata.Validate`. This change gates each
key-type contribution on whether the original actually extends past
the excise boundary on the relevant side.

2. `looseRightTableBounds` copied the original's smallest internal key
verbatim, which carries the original's trailer (often a tiny
`seqnum=0, kind=DELSIZED`). A real range tombstone in the file at the
same user key with a larger trailer then violated the rangeDelIter's
lower-bound assertion in invariants builds. Since the bound is loose
either way, synthesize the smallest with the maximum non-sentinel
trailer (`SeqNumMax-1`, `InternalKeyKindMaxForSSTable`); strictly safer.

Testdata in `testdata/excise` and `testdata/excise_bounds` is updated to
reflect the new synthesized trailer in the right-table-loose case.
`determineExcisedTableBlobReferences` scales each blob reference's
`ValueSize` proportionally from the original table to the excised
table. The scaling factor (`excisedTable.Size / originalSize`) is
normally <= 1, but size-estimate noise can produce a factor > 1 — for
example when `excisedTable.Size` is the loose estimate from
`determineExcisedTableSize` and exceeds the (also-estimated)
`originalSize`. In that case the scaled `ValueSize` exceeds the
physical blob file's `ValueSize`, tripping the invariant check in
`MakeBlobReference` on manifest replay.

Cap the scaled value at the original reference's `ValueSize`. The
excised table is a sub-range of the original; its share of any
referenced blob file cannot legitimately exceed the original's share,
so the cap is a strictly-tighter bound rather than a behavior change.
Adds a private `flushIfOverlapping(span, extraBoundsLocked)` helper for
features that mutate file metadata in a key range and need any data
still in the memtable for that range to be moved to SSTs first.

`extraBoundsLocked`, if non-nil, is invoked with `d.mu` held and may
read from `DB.mu`-protected state to produce additional overlap-check
bounds — for example, a caller can extend the overlap check with
snapshot-protected ranges so that a snapshot reader's view of the LSM
is preserved across the metadata mutation. The callback returns
internal `bounded` values, so the helper is package-private.

The helper itself has no knowledge of its callers' semantics; the
callback is the seam. Foundation for `DeleteSuffixRange` (queued in
this branch) and the planned cloning-with-prefix-replacements feature,
both of which act on file metadata and need a memtable flush as a
prerequisite. Each will pass its own `extraBoundsLocked` callback as
appropriate.

Tests in `flush_test.go` cover the `nil` callback path (basic span-only
overlap check against point keys, range deletes, range keys, and
boundary semantics), the non-nil callback path (extra bounds trigger a
flush when the span itself wouldn't), and error paths (closed DB,
read-only, invalid `KeyRange`).
Add `SuffixMask` (a `[Lower, Upper)` range over the comparer's suffix
ordering) and an ordered list `SuffixMasks` to `blockiter.Transforms`,
`blockiter.FragmentTransforms`, and `manifest.TableMetadata`. Keys
whose suffix falls in any of the listed ranges are hidden during
iteration; keys with no suffix are never hidden.

The list shape lets repeated `DeleteSuffixRange` calls accumulate per
file: contiguous ranges merge in place via `expandSuffixMask`, disjoint
ranges append. The list and the byte slices it holds are immutable by
convention — any mutator clones first. This is documented on each
field.

`Lower`/`Upper` refer to the comparer's ordering, not wall-time
magnitude. For a comparer like CockroachDB's that sorts newest-first,
`Lower` is the newest timestamp (inclusive) and `Upper` is the oldest
(exclusive).

Encoding: a single new `customTagSuffixMask` (tag 71) on the existing
`NewFile4` record carries one (Lower, Upper) pair. The encoder emits
one tag per list entry; the decoder appends each occurrence. No
on-wire structure change beyond the new tag. The new `FormatSuffixMask`
format major version gates writers from emitting the tag at all.

`DebugString` / `ParseTableMetadataDebug` render each entry as
`suffixmask:[loHex-upHex)`; the field is one token because `-` is a
debug-parser separator.

Tests: `TestVersionEditRoundTripSuffixMask` covers a hand-picked set
plus a randomized roundtrip generating 0..N masks per file (catches
length-prefix arithmetic and asymmetric bound sizes). A `suffixmask`
case is added to `TestTableMetadata_ParseRoundTrip`.
Wires the runtime portion of `SuffixMask`: a data-block iterator
(row-based or columnar) configured with `Transforms.SuffixMasks`
hides every row whose effective suffix falls within any mask's
`[Lower, Upper)` range, as ordered by `ComparePointSuffixes`. Rows
with an empty suffix are never hidden, matching the documented DSR
contract.

Three components:

1. Per-row check in both data-block iterators. The columnar path
   (`colblk/data_block.go::isSuffixMasked`) delegates to a new
   `KeySeeker.IsMaskedBySuffixMask` method; the row-based path
   (`rowblk/rowblk_iter.go`) inlines the comparison at four
   positioning sites in lockstep with the file's existing manually-
   inlined hot-path style. Both correctly handle `SyntheticSuffix`:
   the effective suffix of a non-empty stored suffix is the synthetic
   one, so the mask check operates against the substituted suffix.

2. Interleaving with `HideObsoletePoints`. `skipHiddenForward`/
   `skipHiddenBackward` (colblk) advance past mask-hidden rows while
   also respecting the obsolete-point predicate, with the per-call
   `nextObsoletePoint` cursor reinitialized on mask-induced jumps so
   the two filters compose correctly.

3. IndexIter must NOT apply `SuffixMasks`. The row-based index iter
   (`rowblk/rowblk_index_iter.go`) wraps `rowblk.Iter`; without this
   fix it forwarded `Transforms.SuffixMasks` to the underlying iter,
   which then evaluated the mask against the index block's separator
   keys and silently dropped data blocks whose separator's suffix
   happened to fall in the masked range. `Init`/`InitHandle` now
   clear `SuffixMasks` before delegating. `TestIndexIterIgnoresSuffixMask`
   is the regression test.

A nil-check is added in `rowblk_iter.go::SeekLT`'s synthetic-suffix
correction branch: `First()` can now return nil if every row in the
block is hidden (by the mask or by `HideObsoletePoints`), and the
prior code unconditionally dereferenced the returned key.

Init-time assertions in `Iter.Init`/`Iter.InitHandle` require a
non-nil `ComparePointSuffixes` whenever `SuffixMasks` is non-empty —
silently leaking masked keys due to a missing comparer would be a
correctness regression rather than a missing feature.

Oracle tests in `data_block_test.go::TestDataBlockIterSuffixMaskOracle`
and `rowblk_iter_test.go::TestBlockIterSuffixMaskOracle` cross-product
mask × obsolete × all positioning methods; cockroachkvs gains a fast-
path `IsMaskedBySuffixMask` that inline-decodes MVCC suffix bounds.
Implement SuffixMask filtering for range key and range deletion
fragment iterators in both columnar (`keyspan.go`) and row-oriented
(`rowblk_fragment_iter.go`) paths. Entries whose suffix falls within
`[Lower, Upper)` are removed from spans during iteration. If all
entries in a span are masked, the span is skipped entirely.

`keyspanIter.init` now takes the suffix comparer alongside the key
comparer (previously it was assigned separately by `NewKeyspanIter`).
Both `keyspanIter.init` and `rowblk.NewFragmentIter` fail loudly when a
`SuffixMask` is configured without a non-nil `ComparePointSuffixes`;
this prevents the mask from silently leaking range key entries that
should be hidden.

Add `suffix-mask-{lower,upper}` directives to `TestKeyspanBlock`'s
data-driven harness, plus testdata cases that exercise:
  - partial mask filtering within a span,
  - fully-masked-span skip in both `Next`/`Prev` directions,
  - `SeekGE` / `SeekLT` landing inside a fully-masked span,
  - `RANGEKEYDEL` (suffixless) preservation alongside masked
    `RANGEKEYSET` entries,
  - all-keys-masked across every span.

Add the parallel multi-span / mixed-kind cases to the row-oriented
`testdata/rowblk_fragment_iter`.
Adds `DB.DeleteSuffixRange(ctx, span, lower, upper)`, which hides every
point key and range-key entry within `span` whose suffix falls in
`[lower, upper)` per `ComparePointSuffixes`. The masked entries are
invisible to all subsequent iterators and are physically dropped during
compaction. The operation is metadata-only at call time (no row-level
writes); per-row filtering is applied at iteration via the
`Transforms.SuffixMasks` machinery added in prior commits.

This commit lands the core shape: a public API that, for every
overlapping SST, attaches the new mask to the (possibly virtual)
inside portion of the file. Files fully contained in the span gain a
mask directly; straddling files are split via `exciseTable` into up
to three virtual tables so the mask applies only to the inside one.
Repeated calls on the same file accumulate masks in a per-table list
(`expandSuffixMask` merges contiguous ranges; disjoint ranges append).

A new comment in `file_cache.go::newPointIter` documents why no block-
or file-level property filter is installed for `SuffixMask` (the
collector doesn't currently track suffixless-key presence, so a filter
could incorrectly skip blocks containing unfiltered suffixless keys).
A later commit adds a file-level skip optimization gated on a caller-
supplied per-file aggregate check.

Several follow-up layers are deliberately deferred to subsequent
commits so each can be reviewed in isolation as an enhancement to the
core:

  - `SyntheticSuffix` file-level excise optimization (the file's
    point keys all share an effective suffix, so the per-row path can
    be short-circuited to an excise when the synth suffix is in mask).
  - EFOS-protected-range overlap handling (DSR must transition any
    pre-file-only EFOS that overlaps the span before attaching its
    mask, otherwise the EFOS observes a post-DSR view).
  - In-progress compaction cancellation (a compaction picked against
    the pre-DSR version may hit a manifest blob-ref assertion when it
    tries to delete a table DSR has replaced with a virtual one).
  - Copy-compaction `SuffixMasks` propagation (copy-compactions don't
    iterate through the per-row filter, so the mask must ride along
    on the destination metadata).
  - Block-property-aggregate file-level skip optimization.
  - `SuffixMasks` on `ExternalFile`/`SharedSSTMeta` so DSR-attached
    masks survive shared/external `Replicate`.

Tests in `suffix_mask_*_test.go` cover validation, the
`expandSuffixMask` semantics across same-length and variable-length
suffix encodings, the per-row filter machinery via SSTable-direct
iteration (rowblk + colblk), oracle correctness against a user-space
implementation of the same predicate, and the straddling-file
splitting via `exciseTable`. The data-driven test disables automatic
compactions to keep test sequencing deterministic; later commits add
explicit cancellation regression tests with compactions enabled.

`excise.go` propagates `SuffixMasks` onto the left and right virtual
tables produced by `exciseTable` so that masks attached to a file
survive that file being split — required for both the DSR straddling
path here and any subsequent `Excise` on a DSR-masked file.
A file with a non-empty `SyntheticSuffix` substitutes the synth in for
every point key's effective suffix at iteration time. `RangeKeySet`
entries with non-empty original suffix get the same substitution;
entries with empty original suffix retain empty (per the
`SyntheticSuffix` contract — empty-suffix `RangeKeySet` entries are
never masked, matching the broader DSR contract for suffixless keys).
`RangeKeyDelete` entries have no per-key suffix and are also never
masked. `RangeKeyUnset` is not allowed on a `SyntheticSuffix` file at
all (see the assertion in `rowblk_fragment_iter.go::applySpanTransforms`).

The mask answer for such a file is therefore uniform across nearly
every row: every non-empty stored suffix maps to the same effective
suffix. Three cases:

  1. synth NOT in mask: no row in the file matches the mask. Skip the
     file entirely — no version edit, no mask attachment.
  2. synth IN mask AND no range keys: every point key is uniformly
     masked. Excise the file's overlap; a per-row mask would produce
     the same observable behavior but at iteration-time cost. The new
     `exciseOverlap` helper handles the fully-contained vs straddling
     cases.
  3. synth IN mask AND has range keys: cannot safely excise — excising
     would drop `RangeKeyDelete` entries and any empty-suffix
     `RangeKeySet` entries (both never-masked per the DSR contract).
     Fall through to per-row mask attachment, which honors empty
     suffixes via the existing filter machinery.

The optimization is an enhancement on top of the core DSR API added
in the prior commit: that commit's per-row attachment is correct
(and the Path B per-row check in `colblk,cockroachkvs,rowblk` is
synth-suffix-aware), so this commit could land later as a pure perf
improvement. It lands here because it's tightly coupled to the
documented file-level decision tree and is small enough to merit
inclusion alongside the API.

The conservative fall-through on `HasRangeKeys` is wider than
strictly necessary — a synth-suffix file with range keys but no
suffixless entries could still be excised — but distinguishing those
cases requires `TableMetadata` bits that don't exist yet. A TODO
records the optimization opportunity.

`TestDeleteSuffixRangeExciseOverlap` covers the case-2 excise path
end-to-end; `TestDeleteSuffixRangeSyntheticSuffixWithEmptyRangeKey`
in `suffix_mask_synth_rangekey_test.go` regression-tests case 3,
verifying that an empty-suffix `RangeKeySet` entry on a synth-suffix
file in the mask range remains visible after DSR (i.e., we fell
through to per-row instead of excising).
A pre-file-only `EventuallyFileOnlySnapshot` (EFOS) reads via
`snapshotIterOpts{seqNum: es.seqNum}` — the live LSM filtered by
seqnum, NOT a pinned `*Version`. The seqnum filter is not a defense
against metadata changes like mask attachment (those don't bump
seqnums), so a pre-file-only EFOS whose protected range overlaps a
DSR span would observe our mask immediately, violating the EFOS
contract that protected-range reads remain consistent across
non-additive LSM mutations.

The fix has two parts:

1. Extend `flushIfOverlapping`'s overlap check with the protected
   ranges of any pre-file-only EFOS whose protected ranges overlap
   our span. This forces a flush of any EFOS-overlapping memtable
   content, and the flush completion's
   `maybeTransitionSnapshotsToFileOnlyLocked` transitions the EFOS
   to file-only — pinning a pre-DSR `*Version` — before we attach
   the mask.

2. Inside DSR's `UpdateVersionLocked` updateFn, call
   `maybeTransitionSnapshotsToFileOnlyLocked` explicitly. The
   flush path covers EFOSes that needed a flush to unblock the
   transition; this explicit call covers EFOSes that didn't need a
   flush (no blocking memtable content) but still need to transition
   before our VE attaches the mask. Without the explicit call, such
   an EFOS would remain pre-file-only past our `UpdateVersionLocked`
   and transition later (triggered by an unrelated flush), at which
   point the version it pins already includes our mask.

Both parts together close the EFOS-vs-DSR race the metamorphic test
surfaced as cross-config divergence (one config's EFOS transitioned
before our mask, another after, producing different read results
for the same op stream).

Classic `Snapshot` observers cannot be protected the same way (no
protected ranges, no transition mechanism); their post-DSR reads
observe the masked LSM, matching the documented
excise-with-classic-snapshot behavior.

A deterministic single-process unit test for this race is awkward —
it requires interleaving an EFOS-blocking flush with a DSR's version
edit application, which depends on goroutine scheduling. Reverting
either change in isolation causes the metamorphic
`TestMetaCockroachKVs` to flake at a ~37% rate within ~3 iterations;
both together produce ≥120 consecutive clean runs across all
metamorphic variants. The metamorphic coverage is the regression
test.
A compaction picked against the pre-DSR version holds a
`*TableMetadata` pointer for each input file. When that compaction
later applies its version edit, the manifest's blob-reference
tracker (keyed by pointer in `internal/manifest/blob_metadata.go`)
looks up each deleted table's blob references. If DSR has, in the
meantime, replaced the input file with a virtual table — a different
`*TableMetadata` pointer — the tracker no longer has an entry for
the original pointer and fires:

    pebble: deleted table NNNN's reference to blob file BNNN not known

Fix: inside DSR's `UpdateVersionLocked` updateFn (which holds the
manifest log lock), iterate `d.mu.compact.inProgress` and call
`Cancel()` on any compaction whose `Bounds()` overlap the DSR span.
The cancelled compaction's own apply step checks `c.cancel` under
the same log lock and returns `ErrCancelledCompaction` instead of
applying its conflicting version edit. The retry loop in `Compact`
re-picks against the post-DSR version.

An earlier shape waited for `compactingCount == 0` before entering
`UpdateVersionLocked`. That's insufficient because
`UpdateVersionLocked`'s `logLock` may release `DB.mu` while waiting
for another writer to finish, during which a new compaction may be
scheduled (incrementing `compactingCount`). The cancellation
pattern is correct and mirrors what `IngestAndExcise` does.

This commit also adds a test-only hook
`Options.private.testingDuringCompactionIOFunc`, invoked from a
default table compaction's I/O phase (after the compaction has
written its outputs and constructed its `VersionEdit`, but before
re-acquiring `DB.mu` to apply it). The hook provides a deterministic
window for tests to inject a DSR call that races with the in-progress
compaction.

`TestDeleteSuffixRangeCancelsOverlappingCompaction` uses the hook to
pause a compaction mid-I/O, runs DSR over the same span, and verifies
that the compaction's apply observes the cancel cleanly (no manifest
assertion fires). Without the cancel logic in this commit, the test
deadlocks via the original `compactingCount` wait OR (with the wait
removed) trips the `deleted table reference to blob file not known`
assertion, depending on which intermediate shape is being exercised.
`runCopyCompaction` produces an output table whose backing is a
byte-for-byte copy of an input table; it does NOT iterate over the
input through the per-row `SuffixMask` filter. If the input has
`SuffixMasks` attached, those masks must ride along on the output's
`TableMetadata` — otherwise the copy would silently un-hide rows
that the source had masked. This affects the `Download` path
(external -> local), which is the only caller of `runCopyCompaction`
in production.

This is in contrast to `runDefaultTableCompaction`, which iterates
inputs through the standard iterator stack (including the per-row
mask filter) and so produces a post-filter output that must NOT
carry the mask forward.

A direct unit test is awkward because `runCopyCompaction` is only
chosen for a narrow combination of conditions (external backing,
no per-row work). Coverage is via the metamorphic test in the test
infrastructure stage, where `Download` interleaved with DSR
exercises the path.
`DeleteSuffixRange` previously attached a `SuffixMask` to every
sstable whose user-key bounds overlapped the requested span, even
when the file demonstrably contained no suffixes in the mask's
`[lower, upper)` range. The masks were observationally no-ops but
incurred a manifest edit, a virtual-table allocation, and per-row
filter overhead for the lifetime of the file.

Add an optional `Options.SuffixRangeIntersects` hook consulted on
each overlapping file before a mask is attached. When the hook
returns false, the file is skipped entirely (no version edit, no
mask, no excise). When unset, `DeleteSuffixRange` behaves as before.

`cockroachkvs.SuffixRangeIntersectsTable` is the ready-to-use
implementation: it decodes the file's `MVCCTimeInterval` block-
property aggregate (already collected by
`cockroachkvs.BlockPropertyCollectors`) and tests it against the
mask's wall-time range. It fails open (assumes intersection) on
missing aggregates, undecodable inputs, or non-MVCC suffixes.

Under invariants builds, a 25% metamorphic bypass attaches a no-op
mask anyway, keeping the unfiltered code path exercised even when
the optimization would normally elide it. A package-level
`suffixMaskSkipBypassDisabled` toggle lets focused tests disable
the randomness for deterministic assertions.

Tests:
  - `cockroachkvs.TestSuffixRangeIntersectsTable`: synthesizes
    aggregates and verifies intersection at boundary cases.
  - `TestDeleteSuffixRangeSkipsNonOverlappingFiles`: builds files
    with disjoint wall-time ranges and asserts only the matching
    file receives a mask (with the bypass disabled).
  - `TestDeleteSuffixRangeSkipBypassExercisesPath`: statistically
    verifies the 25% bypass eventually fires and that the resulting
    no-op mask doesn't change visible scan output.
`DeleteSuffixRange` attaches per-file `SuffixMasks` to a vsst's
`TableMetadata` to hide rows whose suffix falls in a configured range.
Masks are a read-time presentation of the file (analogous to
`SyntheticSuffix`), not a property of the backing bytes — the destination
of a `Replicate` must reconstruct the same masks on its vsst to observe
the source's view.

Previously, the source-side emission paths (`SharedSSTMeta.cloneFromFileMeta`
and the `ExternalFile` builder in `ScanInternal`) dropped masks, and the
destination-side ingest paths (`ingestSynthesizeShared`, `ingestLoad1External`)
had nowhere to receive them. A shared/external Replicate of a masked vsst
silently un-hid the masked rows on the destination — a real correctness
gap that the metamorphic test worked around with a test-side punt
(`t.dsrInOps`) disabling shared/external Replicate when DSR was in the
op stream.

This change adds `SuffixMasks []sstable.SuffixMask` to both
`SharedSSTMeta` and `ExternalFile`, deep-clones the source's mask list
on emission, and propagates it onto the destination's `TableMetadata`
on ingest. The destination's `IngestExternalFiles`/`IngestAndExcise`
paths gain an FMV gate refusing non-empty `SuffixMasks` when the
destination is below `FormatSuffixMask` — silent loss of masking would
be a correctness regression rather than a missing feature.

A new test (`TestExternalFileSuffixMasksRoundTrip`) exercises the full
emit-then-ingest cycle: source DB ingests + applies DSR, `ScanInternal`
emits an `ExternalFile` carrying the mask, destination DB ingests via
`IngestAndExcise` and reproduces the source's hidden/visible pattern.
A companion test (`TestExternalFileSuffixMasksFMVGate`) verifies the
FMV refusal.

Note: no CockroachDB caller currently sets these fields. The feature
is added because it conceptually completes the DSR API surface (a vsst
property must survive cross-instance Replicate to preserve the user's
view) and because it lets the metamorphic test stop punting on
shared/external Replicate when DSR is present — see the follow-up
commit that removes the `dsrInOps` gate.
Wires `DeleteSuffixRange` into the metamorphic harness as
`deleteSuffixRangeOp`: parser/formatter, generator, key-manager update
hook, and execution. The op is enabled at weight 5 in
`DefaultOpConfig`, `ReadOpConfig`, and `WriteOpConfig` — typical
parity with the other span-shaped DB ops.

DSR requires `FormatSuffixMask` to execute. Configs in the metamorphic
test vary in their starting FMV, so the generator emits a paired
`RatchetFormatMajorVersion(FormatSuffixMask)` op immediately before
each DSR op. Without the ratchet, configs at lower FMVs would error
on DSR while configs at higher FMVs would succeed; the cross-config
compare would then trip on the diverging histories. The ratchet is
a no-op on configs already at or above `FormatSuffixMask`. A later
commit broadens this to ratchet every DB instance (not just the DSR
target) once `SuffixMasks` can flow across DBs via Replicate.

DSR mutates the LSM non-additively (attaches per-file masks). Classic
`Snapshot` observers cannot pin a pre-DSR `*Version` and so are
unsafe in DSR's presence, similar to the documented hazard around
excise. The `dsrInOps` flag (set in `test.go` if the op stream
contains any DSR) is consumed by `newSnapshotOp.run` to upgrade
every snapshot to an EFOS — EFOSes can transition to file-only
before DSR runs (via the EFOS-protected-range overlap handling in
`DB.DeleteSuffixRange`).

`keyManager.update` treats DSR as a no-op for the per-key state it
tracks: DSR hides at iteration time, not at the physical-record
level, so the SingleDelete eligibility classification (which is
about physical SET / DELETE records) is unaffected by which keys
DSR has masked from view.

Tests:
- `delete_suffix_range_test.go`: parser format roundtrip, generator
  invariants, and a focused execution test exercising basic DSR
  semantics through the metamorphic op harness.
- `testdata/parser`: format/parse roundtrip cases.
The metamorphic test previously masked the shared/external Replicate +
DSR cross product two ways:

1. `replicateOp.run` had a `t.dsrInOps` gate that forced the iter-based
   replicate fallback whenever any `DeleteSuffixRange` op appeared in
   the stream. The iter-based path filters source-masked rows at iter
   time, so masks were effectively baked into the shipped data and the
   destination never had to know about masks. This worked but meant
   the production shared/external `Replicate` paths were never exercised
   when DSR was in play.

2. `dbDeleteSuffixRange` emitted a `RatchetFormatMajorVersion(FormatSuffixMask)`
   op only for the DB that DSR runs on. In a two-instance test, DSR on
   `db1` ratched only `db1` while `db2` stayed at its config's starting
   FMV. A subsequent `Replicate` from `db1` (with masks) to `db2` (at
   FMV < `FormatSuffixMask`) was unreachable behind the gate above.

With `SuffixMasks` now plumbed through `ExternalFile`/`SharedSSTMeta`
(see prior commit), the workaround in (1) is no longer needed. This
change removes the `t.dsrInOps` gate so the shared and external
replicate paths actually run when DSR is in the op stream — surfacing
any future bugs in the wire-format propagation.

The dependency in (2) becomes load-bearing: now that the destination
DB's ingest enforces an FMV gate on `SuffixMasks`, a `Replicate` to a
non-ratched destination would error. `dbDeleteSuffixRange` is updated
to ratchet every DB in `g.dbs` before the DSR. The ratchet is a no-op
on configs already at `FormatSuffixMask` or above, and the change is
necessary regardless of how many DB instances the test happens to
create — masks may flow to any DB via `Replicate`.

The `t.dsrInOps` variable itself is retained because the EFOS-forcing
logic in `newSnapshotOp` still uses it (DSR mutates the LSM
non-additively, so any snapshot taken in a stream containing DSR must
be an EFOS to have any chance of pinning a pre-DSR view).
Adds a `useScanDeleteForDSR` test option, randomized per config at 50%
frequency, that routes each `deleteSuffixRangeOp` through an explicit
scan-and-delete equivalent instead of calling `DB.DeleteSuffixRange`.
The recorded op history is identical in both modes (the op records as
a DSR regardless), so the metamorphic harness's existing cross-config
compare exercises the implementation equivalence as a property: any
divergence in subsequent reads catches a bug in either DSR or in the
scan-and-delete baseline.

The equivalent path scans the span with `IterKeyTypePointsAndRanges`,
issues `Delete(key)` per visible point key whose suffix lies in
`[lower, upper)` per `ComparePointSuffixes`, and `RangeKeyUnset(start,
end, suffix)` per visible range-key entry with a matching non-empty
suffix. Empty-suffix entries and suffixless point keys are left alone,
matching the documented DSR contract.

Equivalence is at the observable-read level, not the LSM-state level:
real DSR attaches per-file metadata; the scan-and-delete path writes
physical tombstones at a new seqnum. Both implementations should hide
exactly the same set of rows from every subsequent read; that's what
the cross-config compare verifies.

Some implementation-level differences that intentionally do NOT cause
divergence (because they don't surface in op results):

  - LSM file layout: DSR's masked files vs scan+delete's tombstoned
    rows compact differently. Compaction op results are recorded as
    success/failure, not LSM contents.
  - Compaction cancellation: real DSR cancels overlapping in-progress
    compactions; scan+delete doesn't. Visible behavior agrees because
    cancelled compactions retry.
  - Snapshot semantics: a snapshot taken before either DSR or
    scan+delete sees the pre-mutation state in both. A snapshot taken
    after either should see the post-mutation state (modulo the EFOS
    file-only-transition machinery, which DSR handles explicitly and
    scan+delete gets for free because point tombstones don't perturb
    snapshots).

The equivalent path is approximate for `SyntheticSuffix` files: scan
returns synth-substituted keys, so `Delete(iter.Key())` deletes by the
substituted user key. DSR with synth-in-mask excises the whole file's
overlap. Observable outcome agrees (all matching rows are gone) but
the LSM state diverges substantially. This has not produced spurious
divergences across the runs exercised so far (11 iterations across
TestMeta + TestMetaCockroachKVs, ~50/50 split on the toggle, all
clean), but it's the most likely source of any future flake.
@dt dt force-pushed the suffix-upper-bound branch from c3d718e to cb65f01 Compare May 24, 2026 18:58
}

// randNewPCG returns a closure-based PCG-style RNG. We avoid the
// math/rand/v2 import to keep this test file's import surface minimal —

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you like this claude

// method's correctness must match the oracle row-for-row; the test would
// catch any positioning method that forgets to apply the mask check, any
// boundary-semantics divergence, and any obsolete×mask interaction bug.
func TestBlockIterSuffixMaskOracle(t *testing.T) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's clarify this name, e.g. "TestRandomBlockIterSuffixMaskAgainstNaiveFilter" -- we can use the comment to spell out that we'll push a randomly generated test-case through suffix masking and through a naive filter that serves as our "oracle" against which to compare the mask result.

Second question: How is this getting us that much coverage though if the seed is hard-coded?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants