Skip to content

Identities, Collections, and Deduplication#286

Draft
jesserobbins wants to merge 20 commits intowesm:mainfrom
jesserobbins:jesse/dedup-v2
Draft

Identities, Collections, and Deduplication#286
jesserobbins wants to merge 20 commits intowesm:mainfrom
jesserobbins:jesse/dedup-v2

Conversation

@jesserobbins
Copy link
Copy Markdown
Contributor

@jesserobbins jesserobbins commented Apr 22, 2026

Summary

Implements the identity, collection, and deduplication system from #278.

New commands

  • list-identities — auto-discover sent-from addresses across accounts using three signals (From header, OAuth, config); prints likely identities for configuring the [identity].addresses list.
  • collections — manage named source groups (create, list, show, add, remove, delete); a default All collection is seeded automatically and tracks every source.
  • deduplicate — find and merge duplicates across sources for the same account.

deduplicate behavior

  • Groups by RFC822 Message-ID; optional --content-hash also groups by normalized raw MIME.
  • Picks a survivor (honoring --prefer <source-types> and identity-based sent-copy preference), unions labels, soft-deletes the pruned copies.
  • Read-only by default (--dry-run). --apply required to write.
  • Reversible: every run writes a per-source manifest; --undo <batch-id> restores hidden rows. --undo is repeatable to undo multiple batches.
  • Atomic snapshot before merge via SQLite VACUUM INTO (skip with --no-backup).
  • --delete-dups-from-source-server additionally stages pruned copies for remote deletion (destructive, opt-in; remote sources only).

Query layer

  • SourceIDs filter propagated through DuckDB and SQLite query paths, plus soft-delete exclusion everywhere.
  • TUI account selector lists collections and filters results to the collection's sources.

Closes #278.

Proposal and implementation plans for the identity discovery,
collections, and deduplication system.
Content-hash based duplicate detection across accounts with soft-delete
merging. Three-signal identity discovery (From header, OAuth, config).
CLI commands: deduplicate (dry-run default, --apply, --undo) and
list-identities. All query paths exclude dedup-soft-deleted rows.
Named collections grouping multiple sources with a default "All"
collection. SourceIDs filtering in both DuckDB and SQLite query paths.
CLI collections command with CRUD operations. Includes fixes for buffer
corruption in normalizeRawMIME, deep-copy in Clone(), and openStore
consolidation.
Table-driven tests for dedup engine, collections CRUD, identity
discovery, and source filter helpers. Incorporates review findings.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (e1e5be4)

I’ll consolidate the reports into one PR-ready comment, keeping only medium-or-higher findings and merging anything duplicated.
Medium-severity issues remain around DuckDB soft-delete filtering, default collection initialization, and TOML escaping.

Medium

  • internal/query/duckdb.go:682
    DuckDB aggregate/stat/filter query builders do not add deleted_at IS NULL, while SQLite paths now do. Until the cache is rebuilt, soft-deleted duplicates can still appear in DuckDB-backed aggregates, stats, and filtered views.
    Fix: Add the same soft-delete predicate to DuckDB buildWhereClause, buildFilterConditions, GetTotalStats, and search condition paths that read message rows.

  • internal/store/collections.go:57
    The default "All" collection is only created when EnsureDefaultCollection is called, but this change does not wire it into schema/store initialization or source creation. The CLI documents "All" as automatic, but normal installs can have no default collection, and newly created sources will not join it unless callers remember to invoke this method.
    Fix: Call EnsureDefaultCollection during schema initialization and ensure source creation/upsert paths add new sources to "All".

  • cmd/msgvault/cmd/list_identities.go:112
    writeIdentitiesTOML manually escapes only double quotes before printing archive-derived email addresses into a TOML config snippet. A crafted From: address containing a newline, backslash escape, or TOML syntax could break out of the string and inject additional config when pasted.
    Fix: Generate TOML with a TOML encoder, or use a correct TOML string escaping routine that handles quotes, backslashes, control characters, CR/LF, and non-printable bytes. Consider validating identity candidates as syntactically valid email addresses before including them in paste-ready config output.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- DuckDB: add deleted_at IS NULL predicate to buildWhereClause,
  buildFilterConditions, GetTotalStats, and buildSearchConditions
  so soft-deleted duplicates are excluded from Parquet-backed queries.
  Handle missing column in older Parquet files via parquetCTEs.

- Collections: call EnsureDefaultCollection during InitSchema so the
  "All" collection is always present. Add new sources to "All" in
  GetOrCreateSource so newly added accounts join automatically.

- TOML output: replace manual string escaping in writeIdentitiesTOML
  with BurntSushi/toml encoder to prevent injection via crafted
  From: addresses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (3080a0f)

I’ll merge these into one PR-review comment, deduplicating overlap and dropping the low-severity item as requested.
Summary verdict: High and Medium issues remain; the PR is not ready to merge until side effects, scoping, compile, and manifest path handling are fixed.

High

  • internal/dedup/dedup.go:184: Engine.Scan mutates the database before confirmation and during --dry-run by backfilling rfc822_message_id. It also performs the backfill globally through CountMessagesWithoutRFC822ID / BackfillRFC822IDs, even when deduplication is scoped to a specific account or collection.

    Fix: Make scan side-effect-free for dry runs and pre-confirmation flows, or move the backfill into the confirmed execution path. Scope the backfill/count to AccountSourceIDs.

  • internal/dedup/dedup.go in sanitizeAccount: Invalid rune literal ' @' will fail compilation with invalid character literal (more than one character).

    Fix: Change case r == ' @' || r == '.': to case r == '@' || r == '.':.

Medium

  • cmd/msgvault/cmd/deduplicate.go:112: A resolved collection with zero sources is treated the same as no --account, so deduplicate --account <empty-collection> deduplicates every source independently instead of doing nothing or erroring.

    Fix: Track whether --account was explicitly provided and return an error when a resolved collection/source scope expands to zero source IDs.

  • internal/query/source_filter.go:14: Empty SourceIDs falls through as “no filter,” so an explicitly empty collection passed into query/list/stat paths can show all accounts instead of no results.

    Fix: Distinguish nil from explicitly empty multi-source filters and add a 1=0 condition for explicitly empty scopes.

  • internal/dedup/dedup.go:615 and cmd/msgvault/cmd/deduplicate.go:203: batchID can include a raw source identifier in per-source dedup mode, and that value is later embedded into manifest.ID and used in filepath.Join(mgr.PendingDir(), manifest.ID+".json"). Source identifiers are user-controlled and may contain path separators, allowing manifest writes outside the intended pending directory or into unintended nested paths when --delete-dups-from-source-server is used.

    Fix: Sanitize the full manifest ID before using it as a filename, or construct manifest filenames only from a restricted character set. Also verify the cleaned path remains under mgr.PendingDir() after joining.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- Error when --account resolves to zero sources instead of silently
  falling through to per-source mode
- Sanitize src.Identifier in batchID to prevent path separators in
  manifest filenames
- Distinguish nil from empty SourceIDs in query filter (empty = match
  nothing, nil = no filter)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (b2aa98c)

Summary verdict: Medium-risk empty-scope and collection workflow issues remain; no Critical or High findings were reported.

Medium

  • internal/dedup/dedup.go:188
    Engine.Scan passes an empty AccountSourceIDs slice through to store duplicate queries as unscoped, so the documented empty/default config can group duplicates across all sources instead of strict per-source dedup. This can merge messages from different accounts for non-CLI callers.
    Fix: Reject empty AccountSourceIDs in Scan, or implement true per-source grouping when empty; reserve cross-source grouping for explicit source scopes.

  • cmd/msgvault/cmd/list_identities.go:50
    When --account resolves to a collection with zero sources, scope.SourceIDs() expands to an empty slice and ListLikelyIdentities(scopeIDs...) treats that as unscoped, returning identities from every source instead of none. Empty collections can occur after collections remove.
    Fix: Detect len(scopeIDs) == 0 after resolving an account/collection and return no candidates or a clear “zero sources” error; alternatively update the store API to distinguish empty scoped lists from unscoped calls.

  • cmd/msgvault/cmd/collections.go
    runCollectionsCreate and runCollectionsAdd only update collection associations and exit. The spec/implementation plan says creating a collection or adding accounts should automatically trigger deduplication.
    Fix: Invoke the dedup engine scoped to the collection’s sources after successful create/add operations.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

…lIDsByFilter

- Error when list-identities --account resolves to zero sources instead
  of returning unscoped results from all accounts
- Add missing deleted_at IS NULL filter in DuckDB GetGmailIDsByFilter
  fallback path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jesserobbins
Copy link
Copy Markdown
Contributor Author

Fixed (5 total):

  • --account zero-source guard in deduplicate
  • --account zero-source guard in list-identities
  • batchID path traversal sanitization
  • Empty SourceIDs nil-vs-empty distinction in query filter
  • Missing deleted_at IS NULL in DuckDB GetGmailIDsByFilter

Disagree with the following:

  • Invalid rune ' @' — code was already correct
  • Scan backfill mutates during dry-run — idempotent enrichment, required for
    results
  • Scan empty AccountSourceIDs for non-CLI callers — speculative, CLI handles
    correctly
  • Collections create/add should auto-trigger dedup — dedup is a separate
    explicit operation
  • EnsureDefaultCollection not wired in — already called in InitSchema
  • TOML escaping — already uses proper toml.NewEncoder

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (52f1cd4)

Deduplicated verdict: PR has 2 Medium findings around deduplication scope and unintended mutation; no High or Critical findings.

Medium

  • internal/dedup/dedup.go:204: Engine.Scan backfills rfc822_message_id before honoring dry-run behavior, and the backfill runs across the whole database even for scoped --account scans. This means msgvault deduplicate --dry-run or account-scoped scans can still mutate unrelated rows before confirmation or backup.

    Fix: Make the backfill respect DryRun, and pass AccountSourceIDs through CountMessagesWithoutRFC822ID / BackfillRFC822IDs so only scoped sources are touched.

  • internal/dedup/dedup.go:233: The engine contract says empty AccountSourceIDs means per-source deduplication, but Scan passes an empty source list to store methods, where it means “all sources.” A direct engine caller using the default config can therefore merge duplicates across accounts, contradicting the documented safety rule.

    Fix: Require a non-empty scope for engine scans, or implement true per-source grouping when AccountSourceIDs is empty.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

roborev: Combined Review (52f1cd4)

Deduplicated verdict: PR has 2 Medium findings around deduplication scope and unintended mutation; no High or Critical findings.

Medium

  • internal/dedup/dedup.go:204: Engine.Scan backfills rfc822_message_id before honoring dry-run behavior, and the backfill runs across the whole database even for scoped --account scans. This means msgvault deduplicate --dry-run or account-scoped scans can still mutate unrelated rows before confirmation or backup.
    Fix: Make the backfill respect DryRun, and pass AccountSourceIDs through CountMessagesWithoutRFC822ID / BackfillRFC822IDs so only scoped sources are touched.

Dry run, in my opinion, shouldn't do anything that writes or mutates... including backup. It should clearly disclose what it would have done, and what it can't do because of a potential mutate.

  • internal/dedup/dedup.go:233: The engine contract says empty AccountSourceIDs means per-source deduplication, but Scan passes an empty source list to store methods, where it means “all sources.” A direct engine caller using the default config can therefore merge duplicates across accounts, contradicting the documented safety rule.
    Fix: Require a non-empty scope for engine scans, or implement true per-source grouping when AccountSourceIDs is empty.

Okay, I agree with that.

…Scan

- Scan no longer backfills rfc822_message_id during dry-run; instead
  reports how many messages need backfill and notes they'll be included
  on --apply
- Backfill is now scoped to AccountSourceIDs, not global
- Engine.Scan requires non-empty AccountSourceIDs to prevent accidental
  cross-account grouping; CLI handles unscoped case via per-source
  iteration
- list-identities --account errors on zero-source resolution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (b5c3b86)

Verdict: One medium-severity issue should be fixed before merging; no critical or high findings were reported.

Medium

  • internal/config/config.go:132: IdentityAddressSet can insert blank or whitespace-only addresses into the identity set. That can make duplicate candidates with an empty FromEmail appear to match a configured identity, causing unknown or blank senders to influence survivor selection incorrectly.
    • Fix: Trim addresses first, skip empty values, then lowercase and insert them.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (5086043)

I’ll consolidate the review outputs into one PR-ready comment, keeping only medium-or-higher findings and merging overlaps without inventing missing line numbers.
Verdict: Changes need fixes before merge: one build-breaking issue and several medium-risk dedup/MIME behavior problems remain.

Critical

None.

High

  • internal/dedup/dedup.go, in SanitizeFilenameComponent
    • case r == ' @' || r == '.': appears to use an invalid multi-character rune literal, which will break the Go build.
    • Fix: Change ' @' to '@'.

Medium

  • internal/dedup/dedup.go:215

    • Scan backfills rfc822_message_id during non-dry-run mode before the CLI confirmation prompt, so aborting deduplication can still leave database changes behind.
    • Fix: Keep Scan read-only, prompt before backfill, or parse raw MIME in memory for reporting.
  • internal/dedup/dedup.go:209

    • Dry-run skips RFC822 backfill entirely, so it can report fewer duplicate groups than the subsequent real run will merge.
    • Fix: Make dry-run compute candidate Message-IDs from raw MIME without persisting them.
  • internal/dedup/dedup.go, in normalizeRawMIME

    • Header boundary detection checks for \r\n\r\n before \n\n; if LF-only headers contain CRLF delimiters later in the body, the body can be misclassified as headers.
    • Fix: Find both delimiter indices and use the smallest non-negative match.
  • internal/dedup/normalize_test.go and internal/store/dedup_test.go

    • Several raw MIME test literals appear to use broken line endings like rn instead of \r\n, plus obfuscated email addresses such as alice @example.com, which can break or mask MIME parsing behavior.
    • Fix: Restore proper MIME line endings and valid email addresses in test fixtures.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

jesserobbins and others added 2 commits April 22, 2026 08:16
When LF-only headers are followed by a body containing CRLF sequences,
the previous code could match \r\n\r\n in the body instead of \n\n at
the actual header boundary. Now finds both delimiters and uses the
earliest match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…SourceIDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (ef35ac0)

Medium: UndoDedup does not fully reverse merge side effects.

Medium

  • internal/store/dedup.go:383 - UndoDedup only clears deleted_at and delete_batch_id. It does not undo labels copied to the survivor or raw MIME backfilled during MergeDuplicates, so undo can leave altered query/filter behavior behind.
    • Suggested fix: record per-batch merge side effects and reverse them during undo, or make the undo behavior explicitly limited to restoring hidden duplicate rows.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Label union and raw MIME backfill are additive enrichment that leaves
survivors strictly better off. Reversing them would require tracking
per-merge deltas for no user benefit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (19caff5)

Deduplication feature has two Medium issues that should be addressed before merge.

Medium

  • [internal/dedup/dedup.go:204] Engine.Scan mutates the database during a normal scan by backfilling rfc822_message_id before CLI confirmation and before backup creation. If the user declines the prompt, the database has still changed, and the backup cannot restore the true pre-scan state.

    • Fix: Keep Scan read-only; move the backfill into the confirmed execution path after backup creation, or make it a separate confirmed preflight/backfill step.
  • [internal/dedup/dedup.go:259] The content-hash fallback excludes messages already found in an RFC822 duplicate group, which misses transitive duplicates where two copies share a Message-ID and a third has a rewritten or missing Message-ID but identical normalized MIME.

    • Fix: Include primary-pass survivors in the content-hash pass and merge overlapping duplicate groups, for example with a union-find over message IDs.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

jesserobbins and others added 7 commits April 22, 2026 18:50
When a user passes --log-file, the startup warning previously printed
LogsDir even though logs were actually going to FilePath. Report the
path that was actually used so the warning is actionable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add TestAppendSourceFilter cases for nil, empty, single, and multi-ID
inputs to pin the boundary behavior of the SQL builder. Add
TestEngine_Scan_RejectsEmptyAccountSourceIDs to ensure Engine.Scan
rejects an unscoped scan (both nil and empty-slice AccountSourceIDs).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Use strings.HasPrefix for the "(?i)" guard in list_identities.go
instead of byte slicing (safer on short inputs). Replace fmt.Sscanf
with strconv.ParseInt in parseInt64CSV so malformed rows like "123abc"
are rejected instead of silently accepted. Flag the SQLite-specific
GROUP_CONCAT in ListLikelyIdentities with a comment for the Postgres
dialect port.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrap AddSourcesToCollection and RemoveSourcesFromCollection in
s.withTx so per-source inserts/deletes are atomic — previously a
mid-loop failure could leave the membership table partially updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Route datetime('now') through s.dialect.Now() and INSERT OR IGNORE
through s.dialect.InsertOrIgnore(...) so dedup queries port cleanly
to PostgreSQL. Wrap the has_sent_label EXISTS column in CAST(... AS
INTEGER) so int scans are dialect-agnostic. Scan archived_at as
sql.NullTime in GetDuplicateGroupMessages and GetAllRawMIMECandidates
so the survivor tiebreaker no longer depends on a hard-coded
timestamp layout. Also JOIN message_raw in CountMessagesWithoutRFC822ID
so the reported count matches what BackfillRFC822IDs actually
processes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Report: copy report.SampleGroups instead of aliasing report.Groups to
prevent silent mutation via future appends. Add SkippedDecompressionErrors
and log a warning per failure in scanNormalizedHashGroups; surface the
count in FormatReport. Count empty normalized Message-IDs as failed in
BackfillRFC822IDs so updated+failed matches the number of candidates
processed.

Manifest IDs: key remote manifest grouping by (account, source_type)
so an account spanning multiple source types gets a per-type manifest
with the correct SourceType label. Disambiguate manifest IDs only
when an account contributes duplicates from more than one source
type (preserves existing single-type IDs). On filename truncation,
append a 4-byte hash suffix in SanitizeFilenameComponent so distinct
accounts with identical 40-char prefixes produce unique manifest IDs.

Undo: continue through all pending manifests in Engine.Undo, joining
cancellation errors with errors.Join, and document the best-effort
semantics in godoc.

Methodology doc: note in FormatMethodology that content-hash is
byte-sensitive below the header boundary (CRLF vs LF body differences
will not match) and that merge only backfills raw MIME — point users
to repair-encoding / full cache rebuild for missing parsed bodies.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the dedup-backup file copy (main + -wal + -shm) with
VACUUM INTO, giving an atomic point-in-time snapshot.

Accept --undo multiple times (StringArray) and, in per-source mode,
print a consolidated footer listing all batch IDs for a single undo
command. On --undo error, print the restored count and any
in-progress manifests before returning the wrapped error so
best-effort partial success is visible. Surface cancelErrs from
Engine.Undo to stderr instead of hiding them inside the wrapped
error. Reword the still-running warning to say in-progress manifests
"cannot be cancelled" and factor the print block into
printStillRunningWarning.

Append a random run-XXXXXXXX suffix to single-run batch IDs so they
can never be a prefix of per-source batch IDs generated in the same
second.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
InitSchema returned early from the FTS5 branch when the sqlite3 build
lacks the fts5 module, which meant EnsureDefaultCollection never ran.
Without the collections tables, GetOrCreateSource's "add to All"
insert silently failed and the "All" collection was absent from
listings — causing TestCollections_CRUD to fail with list=1, want 2.
Fall through instead so collection setup still runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 23, 2026

roborev: Combined Review (4c5ff50)

One medium-severity issue remains: survivor selection can ignore archived_at when only one candidate has it.

Medium

  • internal/dedup/dedup.go:775 - isBetter only compares ArchivedAt when both timestamps are non-zero. If the current survivor has a zero ArchivedAt and a later candidate has a real timestamp, the function falls through to lower ID instead of applying the documented “earlier archived_at” tiebreaker. This can make survivor choice depend on row insertion order when only one side has archived_at.

    Suggested fix: Define explicit ordering for zero timestamps, such as treating non-zero ArchivedAt as better than zero before comparing actual timestamp values.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancement Proposal: Identities, Collections, and Deduplication

1 participant