Identities, Collections, and Deduplication#286
Identities, Collections, and Deduplication#286jesserobbins wants to merge 20 commits intowesm:mainfrom
Conversation
Proposal and implementation plans for the identity discovery, collections, and deduplication system.
Content-hash based duplicate detection across accounts with soft-delete merging. Three-signal identity discovery (From header, OAuth, config). CLI commands: deduplicate (dry-run default, --apply, --undo) and list-identities. All query paths exclude dedup-soft-deleted rows.
Named collections grouping multiple sources with a default "All" collection. SourceIDs filtering in both DuckDB and SQLite query paths. CLI collections command with CRUD operations. Includes fixes for buffer corruption in normalizeRawMIME, deep-copy in Clone(), and openStore consolidation.
Table-driven tests for dedup engine, collections CRUD, identity discovery, and source filter helpers. Incorporates review findings.
roborev: Combined Review (
|
e1e5be4 to
68c6d34
Compare
- DuckDB: add deleted_at IS NULL predicate to buildWhereClause, buildFilterConditions, GetTotalStats, and buildSearchConditions so soft-deleted duplicates are excluded from Parquet-backed queries. Handle missing column in older Parquet files via parquetCTEs. - Collections: call EnsureDefaultCollection during InitSchema so the "All" collection is always present. Add new sources to "All" in GetOrCreateSource so newly added accounts join automatically. - TOML output: replace manual string escaping in writeIdentitiesTOML with BurntSushi/toml encoder to prevent injection via crafted From: addresses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
- Error when --account resolves to zero sources instead of silently falling through to per-source mode - Sanitize src.Identifier in batchID to prevent path separators in manifest filenames - Distinguish nil from empty SourceIDs in query filter (empty = match nothing, nil = no filter) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
…lIDsByFilter - Error when list-identities --account resolves to zero sources instead of returning unscoped results from all accounts - Add missing deleted_at IS NULL filter in DuckDB GetGmailIDsByFilter fallback path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Fixed (5 total):
Disagree with the following:
|
roborev: Combined Review (
|
Dry run, in my opinion, shouldn't do anything that writes or mutates... including backup. It should clearly disclose what it would have done, and what it can't do because of a potential mutate.
Okay, I agree with that. |
…Scan - Scan no longer backfills rfc822_message_id during dry-run; instead reports how many messages need backfill and notes they'll be included on --apply - Backfill is now scoped to AccountSourceIDs, not global - Engine.Scan requires non-empty AccountSourceIDs to prevent accidental cross-account grouping; CLI handles unscoped case via per-source iteration - list-identities --account errors on zero-source resolution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
When LF-only headers are followed by a body containing CRLF sequences, the previous code could match \r\n\r\n in the body instead of \n\n at the actual header boundary. Now finds both delimiters and uses the earliest match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…SourceIDs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
Label union and raw MIME backfill are additive enrichment that leaves survivors strictly better off. Reversing them would require tracking per-merge deltas for no user benefit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
When a user passes --log-file, the startup warning previously printed LogsDir even though logs were actually going to FilePath. Report the path that was actually used so the warning is actionable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add TestAppendSourceFilter cases for nil, empty, single, and multi-ID inputs to pin the boundary behavior of the SQL builder. Add TestEngine_Scan_RejectsEmptyAccountSourceIDs to ensure Engine.Scan rejects an unscoped scan (both nil and empty-slice AccountSourceIDs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Use strings.HasPrefix for the "(?i)" guard in list_identities.go instead of byte slicing (safer on short inputs). Replace fmt.Sscanf with strconv.ParseInt in parseInt64CSV so malformed rows like "123abc" are rejected instead of silently accepted. Flag the SQLite-specific GROUP_CONCAT in ListLikelyIdentities with a comment for the Postgres dialect port. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrap AddSourcesToCollection and RemoveSourcesFromCollection in s.withTx so per-source inserts/deletes are atomic — previously a mid-loop failure could leave the membership table partially updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Route datetime('now') through s.dialect.Now() and INSERT OR IGNORE
through s.dialect.InsertOrIgnore(...) so dedup queries port cleanly
to PostgreSQL. Wrap the has_sent_label EXISTS column in CAST(... AS
INTEGER) so int scans are dialect-agnostic. Scan archived_at as
sql.NullTime in GetDuplicateGroupMessages and GetAllRawMIMECandidates
so the survivor tiebreaker no longer depends on a hard-coded
timestamp layout. Also JOIN message_raw in CountMessagesWithoutRFC822ID
so the reported count matches what BackfillRFC822IDs actually
processes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Report: copy report.SampleGroups instead of aliasing report.Groups to prevent silent mutation via future appends. Add SkippedDecompressionErrors and log a warning per failure in scanNormalizedHashGroups; surface the count in FormatReport. Count empty normalized Message-IDs as failed in BackfillRFC822IDs so updated+failed matches the number of candidates processed. Manifest IDs: key remote manifest grouping by (account, source_type) so an account spanning multiple source types gets a per-type manifest with the correct SourceType label. Disambiguate manifest IDs only when an account contributes duplicates from more than one source type (preserves existing single-type IDs). On filename truncation, append a 4-byte hash suffix in SanitizeFilenameComponent so distinct accounts with identical 40-char prefixes produce unique manifest IDs. Undo: continue through all pending manifests in Engine.Undo, joining cancellation errors with errors.Join, and document the best-effort semantics in godoc. Methodology doc: note in FormatMethodology that content-hash is byte-sensitive below the header boundary (CRLF vs LF body differences will not match) and that merge only backfills raw MIME — point users to repair-encoding / full cache rebuild for missing parsed bodies. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the dedup-backup file copy (main + -wal + -shm) with VACUUM INTO, giving an atomic point-in-time snapshot. Accept --undo multiple times (StringArray) and, in per-source mode, print a consolidated footer listing all batch IDs for a single undo command. On --undo error, print the restored count and any in-progress manifests before returning the wrapped error so best-effort partial success is visible. Surface cancelErrs from Engine.Undo to stderr instead of hiding them inside the wrapped error. Reword the still-running warning to say in-progress manifests "cannot be cancelled" and factor the print block into printStillRunningWarning. Append a random run-XXXXXXXX suffix to single-run batch IDs so they can never be a prefix of per-source batch IDs generated in the same second. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
InitSchema returned early from the FTS5 branch when the sqlite3 build lacks the fts5 module, which meant EnsureDefaultCollection never ran. Without the collections tables, GetOrCreateSource's "add to All" insert silently failed and the "All" collection was absent from listings — causing TestCollections_CRUD to fail with list=1, want 2. Fall through instead so collection setup still runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
roborev: Combined Review (
|
Summary
Implements the identity, collection, and deduplication system from #278.
New commands
list-identities— auto-discover sent-from addresses across accounts using three signals (From header, OAuth, config); prints likely identities for configuring the[identity].addresseslist.collections— manage named source groups (create,list,show,add,remove,delete); a defaultAllcollection is seeded automatically and tracks every source.deduplicate— find and merge duplicates across sources for the same account.deduplicatebehaviorMessage-ID; optional--content-hashalso groups by normalized raw MIME.--prefer <source-types>and identity-based sent-copy preference), unions labels, soft-deletes the pruned copies.--dry-run).--applyrequired to write.--undo <batch-id>restores hidden rows.--undois repeatable to undo multiple batches.VACUUM INTO(skip with--no-backup).--delete-dups-from-source-serveradditionally stages pruned copies for remote deletion (destructive, opt-in; remote sources only).Query layer
SourceIDsfilter propagated through DuckDB and SQLite query paths, plus soft-delete exclusion everywhere.Closes #278.