Skip to content

feat: import-pst — import Microsoft Outlook PST archives#284

Open
YourEconProf wants to merge 5 commits intowesm:mainfrom
YourEconProf:import-pst
Open

feat: import-pst — import Microsoft Outlook PST archives#284
YourEconProf wants to merge 5 commits intowesm:mainfrom
YourEconProf:import-pst

Conversation

@YourEconProf
Copy link
Copy Markdown
Contributor

Adds import-pst, a new CLI command to import Microsoft Outlook PST files into msgvault. Complements the existing MBOX, EMLX, and IMAP importers.

What's new

  • msgvault import-pst <identifier> <file.pst> — imports all email messages from a PST archive; calendar items, contacts, tasks, and notes are skipped automatically
  • PST folder structure is preserved as labels (e.g. Inbox, Sent Items)
  • Resumable: interrupt with Ctrl+C and rerun to continue from the last checkpoint; --no-resume to start fresh
  • --skip-folder flag to exclude folders (e.g. --skip-folder "Deleted Items")
  • --no-attachments flag to skip attachment import
  • Content-hash deduplication and cross-folder label merging consistent with other importers
  • MIME reconstruction from PST: uses TransportMessageHeaders verbatim when present (~80% of messages); synthesizes RFC 5322 headers from MAPI properties for drafts and Exchange-native sends

Security fixes included

  • CRLF injection prevention on synthesized headers (addresses MAPI properties written directly into RFC 5322 headers)
  • Path traversal sanitization on attachment filenames
  • ContentID sanitization to prevent MIME structure breakage

Dependencies

Adds github.com/mooijtech/go-pst/v6 (Apache 2.0, pure Go).

Usage

msgvault import-pst you@company.com /path/to/archive.pst
msgvault import-pst you@outlook.com backup.pst --skip-folder "Deleted Items"
msgvault import-pst you@outlook.com backup.pst --no-resume

YourEconProf and others added 4 commits April 21, 2026 14:11
Adds the ability to import Microsoft Outlook PST archives into msgvault,
complementing the existing MBOX, EMLX, and IMAP sources.

New files:
- internal/pst/reader.go: thin wrapper around mooijtech/go-pst v6 with
  folder traversal, message extraction, attachment reading, FILETIME→time.Time
  conversion, and Exchange DN resolution
- internal/pst/mime.go: reconstructs RFC 5322 MIME from PST messages —
  uses TransportMessageHeaders verbatim when present (~80% of messages),
  falls back to synthesizing headers from MAPI properties for drafts and
  Exchange-native sends
- internal/importer/pst_import.go: import orchestration following the MBOX
  importer pattern — batching (200 msg / 32 MiB), checkpoint/resume,
  content-hash dedup, cross-folder label merging
- cmd/msgvault/cmd/import_pst.go: CLI command with --skip-folder,
  --no-resume, --no-attachments flags and graceful Ctrl+C handling

Usage:
  msgvault import-pst you@company.com /path/to/archive.pst
  msgvault import-pst you@outlook.com backup.pst --skip-folder "Deleted Items"

Dependency: github.com/mooijtech/go-pst/v6 (Apache 2.0, pure Go)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copies support.pst (1 MB, 17 emails, real transport headers, attachments)
and 32-bit.pst (64 KB, format coverage) from the mooijtech/go-pst module
cache into internal/pst/testdata/ and adds two test suites:

internal/pst/reader_test.go (10 tests):
- Open/close for both PST variants
- WalkFolders: folder discovery, slash-separated path building
- ExtractMessage: known message properties (subject, sender, timestamps),
  non-email item filtering
- ReadAttachments: 2-attachment message content verification
- BuildRFC5322: MIME round-trip with and without attachments

internal/importer/pst_integration_test.go (7 tests):
- Full import of support.pst (17 messages, correct counts)
- Idempotent re-import (all 17 skipped on second run)
- Cross-folder accounting (added+skipped+updated == processed)
- --skip-folder filtering
- Context cancellation safety
- 32-bit PST handled gracefully

Also fixes two bugs uncovered by the tests:
- pst_import.go: use entry ID as sourceMsgID instead of content hash to
  ensure re-import idempotency (multipart boundaries are random per-build)
- reader.go: swallow GetSubFolders errors so 32-bit PSTs with unreadable
  sub-folder metadata don't abort the walk

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When importing a PST file the source's display_name was never populated,
causing get_stats / list-accounts to show an empty DisplayName for PST
sources. Now the base filename (e.g. archive.pst) is written as the
display_name immediately after GetOrCreateSource, matching the behaviour
of the IMAP importer which sets the IMAP username as display_name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Security: sanitize header values (CRLF injection), attachment filenames
(path traversal), and ContentID (MIME structure breakage) in mime.go.

Correctness: encode trailing whitespace in QP output (RFC 2045 §6.7);
treat negative FILETIME as zero; pre-check attachment size before
buffering to avoid exceeding memory limit with a single large attachment;
use bytes.Buffer instead of strings.Builder for binary attachment data;
reset checkpointBlocked after each batch so future batches can checkpoint;
validate resume folder path against saved path to detect ordering changes;
remove unused labelIDs param from flushPending; document os.Exit(130)
bypass of deferred cleanup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@YourEconProf YourEconProf requested a review from wesm as a code owner April 21, 2026 18:13
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (0eaebb3)

High confidence this PR still has one High and four Medium issues to fix before merge.

High

  • internal/importer/pst_import.go:377 and internal/importer/pst_import.go:318
    Checkpoint cursor tracking uses the outer loop variables (currentFolder, currentMsgIdx) instead of the message currently being ingested inside flushPending. If cancellation or a checkpoint interval happens during a flush, the saved checkpoint can point at the last queued message, or the first message of the next folder, causing earlier unflushed messages to be skipped permanently on resume.
    Fix: Store FolderIndex, FolderPath, and MsgIndex on each pending PST message and use those values when calling saveCp() during the flush loop.

Medium

  • internal/importer/pst_import.go:209
    ImportPst saves an initial checkpoint at folder/message 0 even when resuming an active sync. That overwrites the existing checkpoint before resumed work completes, so a crash before the next checkpoint can make the next run restart from the beginning.
    Fix: Only write the initial checkpoint for a new sync, or preserve the loaded resume position when summary.WasResumed is true.

  • internal/importer/pst_import.go:244
    Resume folder validation is skipped when resume.FolderIndex == 0, so a checkpoint inside the first folder can resume against the wrong folder if folder ordering or paths change. The import may skip the first MsgIndex messages from an unrelated folder.
    Fix: Validate whenever resuming with a non-empty FolderPath, including folder index 0.

  • internal/pst/mime.go:79 and internal/pst/mime.go:85
    att.MIMEType comes from PST data and is written into MIME part headers. If it contains CR/LF, it can inject additional MIME headers or corrupt the message structure.
    Fix: Validate or normalize attachment MIME types before use. Reject control characters, parse with mime.ParseMediaType, and fall back to application/octet-stream on invalid input.

  • internal/pst/reader.go:247
    The attachment memory limit can be bypassed when a corrupted or malicious PST reports size 0. att.WriteTo(&buf) may buffer the full payload into memory before the post-read size check runs, risking OOM.
    Fix: Wrap the buffer in a bounded io.Writer that returns an error as soon as the configured byte limit is exceeded.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

High: checkpoint cursor now stored per-message (FolderIndex/FolderPath/MsgIndex
on pendingPstMessage) so saveCp inside flushPending records the position of
the message being flushed, not the outer loop cursor.

Medium:
- Initial checkpoint no longer written when resuming, preventing the existing
  cursor from being overwritten with position 0 before any work completes.
- Resume folder validation now triggers whenever FolderPath is non-empty
  (previously skipped at FolderIndex==0, leaving first-folder path changes
  undetected).
- Attachment MIME types from PST data now validated with mime.ParseMediaType;
  invalid values (including those with CR/LF) fall back to application/octet-stream.
- Attachment reads now go through a limitWriter so a PST reporting size 0
  cannot bypass the per-message memory cap and exhaust memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (7d23b2c)

PST import support looks mostly solid, but there is one Medium issue that can silently ingest incomplete messages.

Medium

  • internal/importer/pst_import.go:478: When ReadAttachments fails, the importer logs the error but still imports the message without attachments. Size-limit truncation from ReadAttachments is also surfaced as success with only the attachments read before the limit, so oversized attachment sets can be silently imported as incomplete messages.
    • Fix: Treat attachment read/limit failures as a skipped message or hard error, or preserve a clear marker that attachments were intentionally omitted. Do not ingest reconstructed MIME that silently drops attachments.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant