feat: import-pst — import Microsoft Outlook PST archives#284
feat: import-pst — import Microsoft Outlook PST archives#284YourEconProf wants to merge 5 commits intowesm:mainfrom
Conversation
Adds the ability to import Microsoft Outlook PST archives into msgvault, complementing the existing MBOX, EMLX, and IMAP sources. New files: - internal/pst/reader.go: thin wrapper around mooijtech/go-pst v6 with folder traversal, message extraction, attachment reading, FILETIME→time.Time conversion, and Exchange DN resolution - internal/pst/mime.go: reconstructs RFC 5322 MIME from PST messages — uses TransportMessageHeaders verbatim when present (~80% of messages), falls back to synthesizing headers from MAPI properties for drafts and Exchange-native sends - internal/importer/pst_import.go: import orchestration following the MBOX importer pattern — batching (200 msg / 32 MiB), checkpoint/resume, content-hash dedup, cross-folder label merging - cmd/msgvault/cmd/import_pst.go: CLI command with --skip-folder, --no-resume, --no-attachments flags and graceful Ctrl+C handling Usage: msgvault import-pst you@company.com /path/to/archive.pst msgvault import-pst you@outlook.com backup.pst --skip-folder "Deleted Items" Dependency: github.com/mooijtech/go-pst/v6 (Apache 2.0, pure Go) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copies support.pst (1 MB, 17 emails, real transport headers, attachments) and 32-bit.pst (64 KB, format coverage) from the mooijtech/go-pst module cache into internal/pst/testdata/ and adds two test suites: internal/pst/reader_test.go (10 tests): - Open/close for both PST variants - WalkFolders: folder discovery, slash-separated path building - ExtractMessage: known message properties (subject, sender, timestamps), non-email item filtering - ReadAttachments: 2-attachment message content verification - BuildRFC5322: MIME round-trip with and without attachments internal/importer/pst_integration_test.go (7 tests): - Full import of support.pst (17 messages, correct counts) - Idempotent re-import (all 17 skipped on second run) - Cross-folder accounting (added+skipped+updated == processed) - --skip-folder filtering - Context cancellation safety - 32-bit PST handled gracefully Also fixes two bugs uncovered by the tests: - pst_import.go: use entry ID as sourceMsgID instead of content hash to ensure re-import idempotency (multipart boundaries are random per-build) - reader.go: swallow GetSubFolders errors so 32-bit PSTs with unreadable sub-folder metadata don't abort the walk Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When importing a PST file the source's display_name was never populated, causing get_stats / list-accounts to show an empty DisplayName for PST sources. Now the base filename (e.g. archive.pst) is written as the display_name immediately after GetOrCreateSource, matching the behaviour of the IMAP importer which sets the IMAP username as display_name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Security: sanitize header values (CRLF injection), attachment filenames (path traversal), and ContentID (MIME structure breakage) in mime.go. Correctness: encode trailing whitespace in QP output (RFC 2045 §6.7); treat negative FILETIME as zero; pre-check attachment size before buffering to avoid exceeding memory limit with a single large attachment; use bytes.Buffer instead of strings.Builder for binary attachment data; reset checkpointBlocked after each batch so future batches can checkpoint; validate resume folder path against saved path to detect ordering changes; remove unused labelIDs param from flushPending; document os.Exit(130) bypass of deferred cleanup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
High: checkpoint cursor now stored per-message (FolderIndex/FolderPath/MsgIndex on pendingPstMessage) so saveCp inside flushPending records the position of the message being flushed, not the outer loop cursor. Medium: - Initial checkpoint no longer written when resuming, preventing the existing cursor from being overwritten with position 0 before any work completes. - Resume folder validation now triggers whenever FolderPath is non-empty (previously skipped at FolderIndex==0, leaving first-folder path changes undetected). - Attachment MIME types from PST data now validated with mime.ParseMediaType; invalid values (including those with CR/LF) fall back to application/octet-stream. - Attachment reads now go through a limitWriter so a PST reporting size 0 cannot bypass the per-message memory cap and exhaust memory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
Adds
import-pst, a new CLI command to import Microsoft Outlook PST files into msgvault. Complements the existing MBOX, EMLX, and IMAP importers.What's new
msgvault import-pst <identifier> <file.pst>— imports all email messages from a PST archive; calendar items, contacts, tasks, and notes are skipped automaticallyInbox,Sent Items)--no-resumeto start fresh--skip-folderflag to exclude folders (e.g.--skip-folder "Deleted Items")--no-attachmentsflag to skip attachment importTransportMessageHeadersverbatim when present (~80% of messages); synthesizes RFC 5322 headers from MAPI properties for drafts and Exchange-native sendsSecurity fixes included
Dependencies
Adds
github.com/mooijtech/go-pst/v6(Apache 2.0, pure Go).Usage
msgvault import-pst you@company.com /path/to/archive.pst msgvault import-pst you@outlook.com backup.pst --skip-folder "Deleted Items" msgvault import-pst you@outlook.com backup.pst --no-resume