Skip to content

feat(format): inline recovery records (Reed-Solomon parity skippable frames) for self-healing .zst #358

@polaz

Description

@polaz

Summary

Add inline recovery records to the zstd stream: Reed-Solomon parity carried in skippable frames, so a .zst file can self-heal on-disk corruption (bit-rot, bad sectors) without an external .par2 sidecar. This is the RAR -rr / par2 idea native to zstd.

Standard zstd decoders (incl. C libzstd) skip these frames per RFC 8878 §3.1.2, so the feature is strictly drop-in: a .zst with recovery records decodes byte-identically on any compliant decoder, and ours additionally repairs corruption before decode.

Gated behind a Cargo feature (default off); the default-build cdylib stays strict drop-in for donor libzstd v1.5.7.

Why in structured-zstd (not at the consumer's layer)

For a standalone, unencrypted .zst, the zstd stream is the final on-disk byte layout, so parity over those bytes is exactly "ECC over the final on-disk bytes" at the only layer present. (For layered storage with encryption above zstd, e.g. lsm-tree, recovery must live at that outer layer over the ciphertext: lsm-tree's Page ECC, not this. This feature does not change or duplicate that; the two target different products and do not stack.)

No current internal consumer drives this; it is a product-surface bet ("zstd + recovery") for the broader CLI/library audience and the project's drop-in-plus positioning. It composes cleanly with the introspection and partial-decode primitives already shipped (#173 FrameEmitInfo, #174 block-precise errors, #175 block-subset partial decode).

Design

Carrier: skippable frame, magic 0x184D2A5F

Allocated from the high end of the skippable range (structured-zstd native), per the updated pointwise allocation policy (see "Registry change" below). The recovery skippable-frame payload is versioned from day one (it becomes a wire contract once shipped).

Payload header (sketch):

  • scheme_id / version
  • shard_size (sector-aligned, e.g. 4 KiB, to match the physical failure unit: bad sector / bit-rot is sector-sized)
  • RS parameters (k data shards, m parity shards); redundancy = m / k
  • coverage byte-range over the preceding data frames
  • a strong per-shard checksum (XXH3/XXH64) so the reader knows which shards are damaged (RS erasure decoding needs erasure positions)
  • the parity bytes

Parity scheme

Repair on read

  1. Verify each shard's checksum → damaged shards are erasures.
  2. If erasures ≤ m per stripe → RS-reconstruct the damaged bytes in place → decode normally.
  3. If erasures > m (unrecoverable) → either a typed error, or degrade gracefully via decode_blocks_partial (perf+feat(decoding): block-subset partial decode (range + recovery) + per-block decompressed byte ranges #175) to return the still-good prefix.

Encode

  • After producing the data frames, compute parity over the concatenated bytes at the configured redundancy and emit the recovery skippable frame(s).
  • Configurable redundancy percent.

RS implementation

  • Reuse a no_std + alloc RS crate (reed-solomon-simd / reed-solomon-erasure) with a scalar fallback; do not hand-roll the field arithmetic.

API (feature-gated)

  • Encode: a wrapper / FrameCompressor option with_recovery(redundancy_pct) that post-processes the frame stream to append parity frames.
  • Decode: a RecoveringDecoder wrapper that verifies + repairs, then delegates to FrameDecoder / StreamingDecoder.
  • CLI: --recovery=N% on encode → inline recovery; our -d auto-repairs; C zstd -d ignores the recovery frames and decodes (when uncorrupted).

Registry change (pointwise, dual-end allocation)

Update docs/SKIPPABLE_MAGIC_ALLOCATIONS.md to drop the "consumer owns a contiguous range" model in favour of pointwise allocation of concrete frame types, growing from both ends:

This frees the magic for RecoveryFrame and removes lsm-tree's blanket 16-variant reservation (it actively uses 2). lsm-tree needs a heads-up that its reservation narrows from 2A50..2A5F to the two variants it uses; nothing breaks (the released/unused variants were never emitted), but the registry is a wire-contract so the change is coordinated.

Acceptance criteria

  • Encode emits recovery skippable frame(s) at a configurable redundancy; a standard zstd decoder (C libzstd) decodes the uncorrupted output byte-identically (drop-in proof).
  • Fault injection: corrupt up to m shards per stripe → reader repairs and decodes to the original bytes; assert repair fired (metric/flag).
  • Corruption beyond budget (> m shards) → typed error, never silent wrong data; optionally returns the good prefix via perf+feat(decoding): block-subset partial decode (range + recovery) + per-block decompressed byte ranges #175.
  • Recovery-frame payload is versioned; an unknown version is rejected cleanly.
  • Happy path (no corruption) adds negligible decode latency; encode overhead matches the configured redundancy percent.
  • Feature-gated, default off; default cdylib build byte-identical and strict drop-in for donor v1.5.7.
  • docs/SKIPPABLE_MAGIC_ALLOCATIONS.md updated to the pointwise dual-end policy with 2A5F RecoveryFrame allocated and 2A52 released.
  • CLI --recovery=N% end-to-end (encode adds records, decode auto-repairs).

Out of scope

  • Interleaved (per-N-frame) recovery placement — follow-up after trailing placement lands.
  • Auto-heal rewrite (repair-on-read → rewrite the file) — separate concern.
  • Any ECC for encrypted/layered storage — that lives at the consumer's outer layer (lsm-tree Page ECC), not here.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions