Skip to content
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ For upstream changes, see [UPSTREAM-README.md](UPSTREAM-README.md).

### Added

- Retroactive secret sanitizer for already-indexed data (SESF-42). New `cleanup.py sanitize` CLI subcommand and `sanitize_index` MCP tool remove secrets already persisted in the Milvus `document` field, the FTS5 `content` column, and the embedding vector. Default `--dry-run` reports per-rule counts + affected turns and writes a value-free `0600` audit JSONL (`~/.sessionflow/audit/`) without touching any store; `--apply --yes` redacts in place and re-embeds the redacted text (throttled through the embedding budget, checkpointed/resumable); `--apply --yes --drop` deletes affected turns instead. New primitives: `secret_redaction.scan_spans` (per-occurrence, value-free audit spans, reusing the SESF-41 detector), `rag_engine.upsert_document` (Milvus upsert-by-PK + FTS metadata-preserving rewrite) and `delete_by_doc_id`. Both surfaces refuse to apply without explicit confirmation and never emit a secret value. **Redaction is irreversible and is not a substitute for rotation — rotate any key that was ever indexed.**
- Issue-ID extraction at ingestion: every turn is scanned for issue references (`[A-Z][A-Z0-9]+-\d+`, with a technical-standard prefix denylist) and tagged into a new `issue_ids` Milvus `VARCHAR(4096)` field + FTS5 metadata column (SESF-25)
- Optional `issue_id` filter on the `search_all_sessions` and `search_session` MCP tools — structured exact-token pre-filter, combinable with `provider` / `project_root` / `date_from` / `date_to` (SESF-25)
- `get_issue_timeline` MCP tool and `GET /timeline` HTTP endpoint — cross-harness chronological feed of all turns referencing an issue, merged from the structured field + FTS fallback, deduped by `doc_id`, sorted oldest-first, with `limit` / `provider` / `date_from` / `date_to` (SESF-25, SESF-26)
Expand Down
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Semantic search over Claude Code session transcripts. Independent project, origi
- **FTS5 thread affinity (SESF-13)** — `FTSIndex` keeps per-thread persistent connections (`threading.local`). Server-mode connections opened on the embed executor and request threads are isolated, and cross-thread `close_all()` is a no-op rather than a noisy WARN.
- **OpenCode timestamps (SESF-14)** — `provider_adapters.normalize_timestamp()` coerces int-ms epochs (and any other numeric/datetime input) to ISO-8601 strings before they hit Milvus's `VARCHAR(64)` timestamp field. All four provider adapters route timestamps through it.
- **Secret redaction guard (SESF-41)** — ingestion-time redaction hooked once in `add_turns` (covers the embedding, Milvus `document`, and FTS `content` sinks plus `add_turns_async`, and runs before `_extract_issue_ids`). Pure engine in `secret_redaction.py`: `redact(text, *, mode, allowlist) -> (redacted_text, hits)`. Config: `SESSIONFLOW_REDACT` (on/off, default **on**); `SESSIONFLOW_REDACT_MODE` (`enforce`|`report`, default **report** when unset — detect + count, store raw so operators can size the false-positive rate before enforcing); `SESSIONFLOW_REDACT_ALLOWLIST` (path to an operator regex allowlist, one pattern per line). Per-rule detection counts surface via `get_stats` under the `redaction` key. Tier-3 entropy is a length-gated Shannon scanner (detect-secrets' entropy plugins are unused — `scan_line` ignores their limit); GitHub/GitLab use custom regexes (detect-secrets reports a truncated prefix).
- **Retroactive sanitizer (SESF-42)** — removes secrets *already* indexed (the cleanup half of SESF-35; SESF-41 is the forward guard). Orchestrator in `sanitize.py` (`scan` dry-run / `apply` redact|drop), driven by `cleanup.py sanitize` (CLI) and the `sanitize_index` MCP tool. **Dry-run is the default and writes nothing**; `--apply`/`apply` **refuse without `--yes`/`confirm`**. Reuses the SESF-41 detector via `secret_redaction.scan_spans` (per-occurrence, value-free audit spans — snippets mask every detected span in-window, including advisory Tier-3 whose forcing keyword sits outside the ±24 window). New Milvus primitives `rag_engine.upsert_document` (upsert-by-PK + **metadata-preserving** FTS delete-then-insert; `new_vector=None` keeps the stored vector on FTS-only re-converge) and `delete_by_doc_id` (`DeleteResult(deleted, fts_ok)`). **FTS rewrite/delete failure keeps a turn unfinished/retryable** — FTS healing only hydrates *missing* doc_ids, so a stale row left in place would never be re-redacted. Per-run audit JSONL at `~/.sessionflow/audit/redaction-<runid>.jsonl` + checkpoint `~/.sessionflow/sanitize_state.json` (both `0700` dir / `0600` file, value-free). Sanitization is irreversible — **redaction ≠ rotation; rotate any indexed key**.

## Code Style

Expand Down
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,45 @@ python cleanup.py backfill enqueue --provider antigravity_cli --mode recent
Pause state and queued jobs persist on disk, so a restart (or LaunchAgent
re-launch) resumes the same plan.

## Retroactive secret sanitizer

If a secret was indexed before the ingestion-time redaction guard caught it,
`cleanup.py sanitize` finds and removes it from the Milvus `document` field, the
FTS5 `content` column, and the embedding vector derived from them. Detection
reuses the same engine as the ingestion guard, so what the sanitizer flags is
exactly what live ingestion would now redact.

**Dry-run is the default** — it reports per-rule counts, the affected-turn count,
and an audit path, and writes nothing to the index:

```bash
python cleanup.py sanitize # dry-run over the whole index
python cleanup.py sanitize --provider claude_code_cli # scope by provider
python cleanup.py sanitize --project /path/to/repo --since 2026-05-01
```

`--apply` rewrites the affected turns (redact the text, re-embed, overwrite the
row). With `--drop` it deletes the affected turns instead. Both require an
explicit `--yes` — there is **no interactive prompt**. `--apply` without `--yes`
refuses before any read or write and exits non-zero:

```bash
python cleanup.py sanitize --apply --yes # redact + re-embed in place
python cleanup.py sanitize --apply --yes --drop # delete affected turns
```

Scope flags (`--project`, `--provider`, `--session`, `--since`) apply to both
dry-run and apply. `--drop` is only valid with `--apply`.

Every run writes a **value-free** JSONL audit trail under `~/.sessionflow/audit/`
(0600) — rule names, tiers, integer offsets, and pre-masked snippets only, never
a raw secret value. Output to stdout is likewise counts-only.

> **Redaction is not safety — rotate the key.** Removing a secret from the index
> does not un-expose it. Once a credential has been written anywhere, treat it as
> compromised and rotate it at the source; the sanitizer warns about this on every
> apply but cannot perform the rotation for you.

## Hosted embeddings — deferred

Hosted/OpenAI embeddings are **deferred and not implemented in SESF-6**.
Expand Down
117 changes: 117 additions & 0 deletions cleanup.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,90 @@ def get_manager() -> BackfillManager:
return 0


def cmd_sanitize(args) -> int:
"""Retroactively scan or remove secrets already persisted in the index.

Thin CLI adapter over :mod:`sanitize` (SESF-42 Component 4). Default posture
is **dry-run**: report per-rule counts, the affected-turn count, and the audit
path, writing nothing. ``--apply`` rewrites (redact + re-embed) or, with
``--drop``, deletes the affected turns — but only after an explicit ``--yes``.

The confirmation gate is **refuse-before-writes**: ``--apply`` without ``--yes``
prints that confirmation is required, makes no read or write (``sanitize.apply``
is never called), and returns a non-zero exit code. This intentionally avoids
the interactive ``[y/N]`` prompt that ``reset``/``migrate-schema`` use, matching
the MCP ``apply && !confirm`` behavior (Requirement 2.4).

Args:
args: Parsed argparse namespace with ``apply``, ``drop``, ``yes``, and the
scope flags (``project``, ``provider``, ``session``, ``since``).

Returns:
A process exit code: ``0`` on success, non-zero when the confirmation gate
refuses an ``--apply`` run.
"""
import sanitize

if args.drop and not args.apply:
print("--drop is only valid with --apply.", file=sys.stderr)
return 2

if args.apply and not args.yes:
print(
"Refusing to apply: confirmation required. Re-run with --yes to "
"rewrite or drop the affected turns.",
file=sys.stderr,
)
return 1

scope = sanitize.Scope(
project_root=getattr(args, "project", None),
provider=getattr(args, "provider", None),
session_id=getattr(args, "session", None),
since=getattr(args, "since", None),
)

if not args.apply:
report = sanitize.scan(scope)
_print_sanitize_report(report)
return 0

report = sanitize.apply(scope, drop=args.drop, confirmed=True)
_print_sanitize_report(report)
print(
"\nWARNING: redaction is not rotation. The detected secret was already "
"exposed — rotate the affected key/credential at its source.",
)
return 0


def _print_sanitize_report(report) -> None:
"""Print a value-free summary of a sanitize ``report`` (counts only, no values).

Emits the mode, per-rule detection counts, affected/processed/incomplete-FTS
tallies, and the audit-file path. It deliberately reads only the report's
aggregate fields, never any document text or raw secret value.
"""
print(f"Mode: {report.mode}")
print(f"Affected turns: {report.affected_count}")
if report.mode != "dry-run":
print(f"Processed: {report.processed_count}")
if report.incomplete_fts:
print(f"Incomplete FTS: {report.incomplete_fts} (re-run to converge)")
print(f"Status: {report.status}")

counts = report.counts or {}
if counts:
print("\nDetected secrets by rule:")
for rule, count in sorted(counts.items(), key=lambda kv: kv[1], reverse=True):
print(f" {rule}: {count}")
else:
print("\nNo secrets found in scope.")

if report.audit_path:
print(f"\nAudit log: {report.audit_path}")


def build_parser():
"""Build the argparse parser for the cleanup CLI subcommands."""
parser = argparse.ArgumentParser(
Expand Down Expand Up @@ -418,6 +502,38 @@ def build_parser():
)
p_migrate.add_argument("--yes", "-y", action="store_true", help="Skip confirmation")

# sanitize
p_sanitize = subparsers.add_parser(
"sanitize",
help="Retroactively scan/remove secrets already indexed (dry-run by default)",
)
sanitize_mode = p_sanitize.add_mutually_exclusive_group()
sanitize_mode.add_argument(
"--dry-run",
action="store_true",
help="Report findings without writing (default)",
)
sanitize_mode.add_argument(
"--apply",
action="store_true",
help="Rewrite (redact + re-embed) or drop affected turns; requires --yes",
)
p_sanitize.add_argument(
"--drop",
action="store_true",
help="With --apply, delete affected turns instead of redacting them",
)
p_sanitize.add_argument(
"--yes",
"-y",
action="store_true",
help="Confirm an --apply run (required; no interactive prompt)",
)
p_sanitize.add_argument("--project", help="Restrict to a project root")
p_sanitize.add_argument("--provider", help="Restrict to a provider")
p_sanitize.add_argument("--session", help="Restrict to a session id")
p_sanitize.add_argument("--since", help="Restrict to turns at/after an ISO date")

# status
p_status = subparsers.add_parser("status", help="Show provider/backfill/embedding status")
p_status.add_argument("--project", help="Filter to a specific project root")
Expand Down Expand Up @@ -462,6 +578,7 @@ def main():
"reset": cmd_reset,
"stats": cmd_stats,
"migrate-schema": cmd_migrate_schema,
"sanitize": cmd_sanitize,
"status": cmd_status,
"backfill": cmd_backfill,
}
Expand Down
Loading
Loading