Skip to content

feat: handle cross-batch schema evolution in ArrowToParquetWriter (#3…#3896

Open
AyushPatel101 wants to merge 2 commits intodlt-hub:develfrom
AyushPatel101:fix/3895-cross-batch-schema-promotion
Open

feat: handle cross-batch schema evolution in ArrowToParquetWriter (#3…#3896
AyushPatel101 wants to merge 2 commits intodlt-hub:develfrom
AyushPatel101:fix/3895-cross-batch-schema-promotion

Conversation

@AyushPatel101
Copy link
Copy Markdown
Contributor

Description

arrow_concat_promote_options currently only handles type mismatches within a single flush batch (via pa.concat_tables). But pyarrow.ParquetWriter locks its schema on the first write_table() call, so mismatches that span different flush batches crash with ArrowInvalid - even for safe promotions like float32 -> float64.

This makes correctness depend on data volume: a pipeline that works with 2000 rows per batch crashes with 6000 rows when batches land in separate flushes.

This PR extends ArrowToParquetWriter.write_data() to reconcile schemas across flush batches using pa.unify_schemas() with the same promote_options value already used for within-batch concat:

  • Incoming narrower than writer (e.g. float32 into float64 writer): cast up to match. Lossless, same file.
  • Incoming wider than writer (e.g. float64 into float32 writer): rotate to a new parquet file. Destinations already handle multiple files per table.

promote_options="none" (default) is completely unchanged.

Related Issues

Ayush Patel added 2 commits April 27, 2026 11:52
…t-hub#3895)

ParquetWriter locks schema on first write_table() call, rejecting
subsequent batches with different types even when arrow_concat_promote_options
is set to handle them. This extends type promotion to work across flush
batch boundaries by casting narrower types up or rotating to a new file
for wider types.
@rudolfix rudolfix self-assigned this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParquetWriter rejects cross-batch type mismatches that arrow_concat_promote_options should handle

2 participants