[GH-2886] Recognize Box2D columns as GeoParquet bbox covering columns#2921
Conversation
…olumns When a user has a Box2D-typed column in the schema being written, both the explicit covering option (geoparquet.covering[.geom]=<col>) and the auto-detect <geom>_bbox path now register it in GeoParquet metadata as the bbox covering column for the associated geometry. Box2DUDT.sqlType is struct<xmin, ymin, xmax, ymax: double> — the exact shape required by the GeoParquet 1.1 covering spec — so no data-path change is needed: the existing UDT->struct fallback already serializes correctly. Only the metadata path needed to learn that a UDT-wrapped struct is a valid covering source. Float32 + conservative outward rounding (Math.nextUp/Math.nextDown, matching apache/sedona-db's next_after) is intentionally deferred to pair with reader-side auto-materialization of covering columns as Box2D — without that pairing, writing Float32 would create a write/ read asymmetry where reads come back as struct<float> instead of Box2D, regressing user-visible types. Closes apache#2886.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support for treating Box2D-typed columns as valid GeoParquet 1.1 bbox covering columns by recognizing the UDT in the metadata path, with accompanying tests for both explicit covering configuration and <geom>_bbox auto-detection.
Changes:
- Recognize
Box2DUDTas a valid covering source by mapping it to its underlying struct-shapedsqlType. - Add tests verifying covering metadata is written correctly for explicit
geoparquet.covering.<geom>and auto-detected<geom>_bboxBox2D columns.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetMetaData.scala | Adds Box2DUDT handling so GeoParquet covering metadata can be derived from Box2D columns. |
| spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala | Adds tests covering explicit and auto-detected Box2D bbox covering metadata output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| schema(coveringColumnIndex).dataType match { | ||
| case coveringColumnType: StructType => | ||
| coveringColumnTypeToCovering(coveringColumnName, coveringColumnType) | ||
| case _: Box2DUDT => | ||
| // Box2DUDT exposes a struct<xmin, ymin, xmax, ymax: double> sqlType, which is the exact | ||
| // shape required by GeoParquet 1.1 bbox covering columns. Treat the underlying struct as | ||
| // the covering struct so users can write a Box2D column and have it referenced as a | ||
| // covering column in GeoParquet metadata without any manual struct construction. | ||
| coveringColumnTypeToCovering( | ||
| coveringColumnName, | ||
| new Box2DUDT().sqlType.asInstanceOf[StructType]) | ||
| case _ => |
There was a problem hiding this comment.
Done in 776c5b9 — bound the case to udt: Box2DUDT and use udt.sqlType with a match on StructType (no allocation, no asInstanceOf, clear failure mode if the sqlType shape ever changes).
On the second suggestion (generalize to any UserDefinedType whose sqlType is a StructType): I would push back on that. Other UDTs may have struct-shaped sqlTypes that are not valid bbox covering columns — e.g., a future ImageUDT or RasterMetadataUDT could have a struct of metadata fields, and the call to validateField("xmin") inside coveringColumnTypeToCovering would throw a confusing error instead of the clear "not a struct type" one. Keeping the match specific to Box2DUDT makes the contract explicit; we can add other UDTs case-by-case as they appear.
| val geoParquetSavePath = geoparquetoutputlocation + "/gp_with_box2d_covering.parquet" | ||
| df.write | ||
| .format("geoparquet") | ||
| .option("geoparquet.covering.geometry", "test_cov") | ||
| .mode("overwrite") | ||
| .save(geoParquetSavePath) |
There was a problem hiding this comment.
The hard-coded path under geoparquetoutputlocation is the existing convention for the GeoParquet covering tests in this file (line 992 onward, gp_with_covering_metadata.parquet etc.). Using a different pattern just for the new tests would be inconsistent. If we want to move the suite to per-test unique dirs, that should be a focused refactor across all of these tests rather than a one-off here.
| val geoParquetSavePath = geoparquetoutputlocation + "/gp_box2d_auto_covering.parquet" | ||
| df.write.format("geoparquet").mode("overwrite").save(geoParquetSavePath) |
There was a problem hiding this comment.
Same as the previous comment — sticking with the existing hard-coded geoparquetoutputlocation convention used by the other covering tests in this file. Refactoring to per-test unique dirs would be a separate suite-wide change.
Avoids the per-call allocation of new Box2DUDT() and the asInstanceOf[StructType] cast. Falls back to a clear IllegalStateException if Box2DUDT.sqlType ever changes shape.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Emit Box2D columns as GeoParquet 1.1 bbox covering columns #2886What changes were proposed in this PR?
When a user has a
Box2D-typed column in the schema being written, both the explicit covering option (geoparquet.covering[.geom]=<col>) and the auto-detect<geom>_bboxpath now register it in GeoParquet metadata as the bbox covering column for the associated geometry.Box2DUDT.sqlTypeisstruct<xmin, ymin, xmax, ymax: double>— exactly the shape required by the GeoParquet 1.1 covering spec — so no data-path change is needed: the existingUserDefinedType→ struct fallback inGeoParquetWriteSupport.makeWriteralready serializes correctly. Only the metadata path needed to learn that a UDT-wrapped struct is a valid covering source.The change is one new case in
GeoParquetMetaData.createCoveringColumnMetadatathat recognizesBox2DUDTand dispatches to the existing struct-shaped covering builder.Why Float32 + conservative rounding is deferred
The original issue scoped Float32 +
Math.nextUp/Math.nextDownoutward rounding for size and bit-compatibility withapache/sedona-db'snext_afterwriter. Shipping that without a paired reader-side change would create a write/read asymmetry: a writtenBox2Dcolumn comes back as a genericstruct<float>(not aBox2D) because the on-disk Float32 schema no longer matchesBox2DUDT.sqlType(which is Float64), so PySpark's UDT resolver doesn't claim it.That asymmetry would regress the user-visible type round-trip we just shipped across #2890–#2906. Pairing Float32 with reader auto-materialization is its own follow-up; this PR delivers the metadata side cleanly without committing the read-path regression.
How was this patch tested?
geoparquetIOTests:geoparquet.covering.geometry; verifies the GeoParquet metadata has matchingcovering.bbox.{xmin,ymin,xmax,ymax}paths.<geom>_bboxcolumn" — auto-detect path: a Box2D-typedgeometry_bboxcolumn gets registered without explicit configuration.Did this PR include necessary documentation updates?