[IcebergIO] Support hash distribution mode when writing rows#38061
[IcebergIO] Support hash distribution mode when writing rows#38061ahmedabu98 merged 17 commits intoapache:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the IcebergIO sink by adding an optional feature to group rows by partition before writing them to the destination. This change is designed to optimize performance and reduce the creation of small files in partitioned Iceberg tables. The implementation introduces new transforms and utility classes to handle the grouping and writing logic, while also updating the existing API and test suites to support this new configuration. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Assigning reviewers: R: @claudevdm for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #38061 +/- ##
=============================================
+ Coverage 54.61% 58.51% +3.89%
- Complexity 1689 15428 +13739
=============================================
Files 1067 2851 +1784
Lines 168152 280076 +111924
Branches 1226 12332 +11106
=============================================
+ Hits 91835 163873 +72038
- Misses 74118 109777 +35659
- Partials 2199 6426 +4227
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Reminder, please take a look at this pr: @claudevdm @chamikaramj |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new distribution mode for Iceberg writes, allowing rows to be shuffled by partition key before writing to reduce the number of small files. It includes the implementation of AssignDestinationsAndPartitions and WriteToPartitions transforms, along with a BeamRowWrapper for Iceberg's StructLike interface. Feedback focuses on critical serialization issues in AssignDoFn, where non-serializable maps must be marked transient and initialized in @Setup. Additionally, improvements are suggested for resource management in WritePartitionedRowsToFiles using try-finally blocks and optimizing the table cache with double-checked locking.
| } | ||
|
|
||
| /** | ||
| * Groups incoming rows by partition before sending to writes, ensuring that a given bundle is |
There was a problem hiding this comment.
Probably summarize the three potential values here and mention the default.
| getDirectWriteByteLimit())); | ||
|
|
||
| switch (getDistributionMode()) { | ||
| case NONE: |
There was a problem hiding this comment.
We don't support "RANGE" ?
There was a problem hiding this comment.
Not yet. We can add it in the future if there's demand
|
|
||
| @SchemaFieldDescription( | ||
| "Defines distribution of write data. Supported distributions:" | ||
| + "\n- none: don't shuffle rows (default)" |
There was a problem hiding this comment.
Probably cleaner to make this setHashDistributionMode and make it a boolean ?
There was a problem hiding this comment.
Would rather keep it this way in case we add support for Range. More straightforward to deal with one config option. This also mirrors how Flink does it
| T get(Row data, int pos); | ||
| } | ||
|
|
||
| private static @Nullable PositionalGetter<?> buildGetter(FieldType beamType, Type icebergType) { |
There was a problem hiding this comment.
I think this also came up in a previous PR. Can you clarify why we would need the wrapper class here ? Also is there a way to avoid having to iterate through the set of types in multiple places ?
There was a problem hiding this comment.
Added some java doc
| import org.apache.iceberg.util.UUIDUtil; | ||
| import org.checkerframework.checker.nullness.qual.Nullable; | ||
|
|
||
| public class BeamRowWrapper implements StructLike { |
There was a problem hiding this comment.
Please add documentation on why this class is needed and how to use it. A BeamRowWrapper sounds like a more general utility that might have to live outside Iceberg ?
There was a problem hiding this comment.
This class is specifically for Iceberg (helps Iceberg fetch data directly from a Beam Row, instead of having to copy the data over to an Iceberg Record)
| } else { | ||
| try { | ||
| // see if table already exists with a spec | ||
| // TODO(ahmedabu98): improve this by periodically refreshing the table to fetch updated |
There was a problem hiding this comment.
Add a Github issue to the TODO ?
| IcebergDestination destination = dynamicDestinations.instantiateDestination(tableIdentifier); | ||
| Table table = getOrCreateTable(destination, dataSchema); | ||
|
|
||
| // TODO(ahmedabu98): cache this |
There was a problem hiding this comment.
Just fixed it instead
| supportsNamespaces.createNamespace(namespace); | ||
| LOG.info("Created new namespace '{}'.", namespace); | ||
| } catch (AlreadyExistsException ignored) { | ||
| // race condition: another worker already created this namespace |
There was a problem hiding this comment.
Let's at least log it.
| import org.joda.time.Duration; | ||
|
|
||
| class WriteToPartitions extends PTransform<PCollection<KV<Row, Row>>, IcebergWriteResult> { | ||
| private static final long DEFAULT_BYTES_PER_FILE = (1L << 29); // 512mb |
There was a problem hiding this comment.
This is known to be a good value for Iceberg ?
There was a problem hiding this comment.
Yes, the ideal Parquet file size is generally 512MB. It may help to extend a config option for users to provide their own, or fetch the option from existing table properties. Will defer that to a future PR though
| input.apply(Managed.write(ICEBERG).withConfig(config)); | ||
| pipeline.run().waitUntilFinish(); | ||
|
|
||
| // Read back and check records are correct |
There was a problem hiding this comment.
Is it possible to add a test to confirm that less files get created when distribution mode is set to HASH ?
There was a problem hiding this comment.
It's hard to test with direct runner because it's autosharding implementation is not smart. Sometimes it creates more shards than necessary.
I added a test for grouping minus autosharding though to validate the number of files created == number of partitions
|
Thanks @chamikaramj, I addressed your comments |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for partitioned writes in IcebergIO, adding DistributionMode (NONE and HASH) and autosharding capabilities. Key components include AssignDestinationsAndPartitions for record routing, BeamRowWrapper for adapting Beam rows to Iceberg's StructLike interface, and WriteToPartitions for managing the shuffled write path. Feedback highlights several improvement opportunities: addressing potential staleness in cached partition metadata, reducing expensive catalog calls in the hot path via shared caching, and minimizing thread contention by using more granular locking instead of synchronizing on a static cache. Additionally, debug print statements should be removed from the production code.
| // see if table already exists with a spec | ||
| // TODO(https://github.com/apache/beam/issues/38337): improve this by periodically | ||
| // refreshing the table to fetch updated specs | ||
| spec = catalogConfig.catalog().loadTable(TableIdentifier.parse(tableIdentifier)).spec(); |
There was a problem hiding this comment.
Calling catalog().loadTable() inside processElement can be very expensive, especially in pipelines with many unique table identifiers or high bundle counts. While the results are cached within the DoFn instance's partitionKeys map, the first element for every unique table in every bundle will still trigger a catalog call. Consider using a static cache (similar to WritePartitionedRowsToFiles.LAST_REFRESHED_TABLE_CACHE) to share table metadata across bundles and workers.
|
LGTM. Thanks. |
|
retest this please |
Adding a new sink code path that groups rows by partition before writing, making partitioned writes a lot more efficient and scalable.