feat: Databricks Zerobus loading#3904
Conversation
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
docs | 01c6b88 | Commit Preview URL Branch Preview URL |
Apr 29 2026, 07:11 AM |
|
This PR is not ready to merge (it's missing docs, tests, and config options), but before polishing it I want to get feedback on a performance concern: the new I've done perf tests using NYC taxi data, which identified three main bottlenecks for
1. Record acknowledgmentI think this one is out of our control, though we can skip waiting for acknowledgement at the cost of correctness guarantees (I've tested that and it makes things much faster, though still not faster than 2. load job file reading (client side)We have to read the JSON/Parquet load job file back into Python memory, so we can encode the records and feed it into the Zerobus SDK. 3. load job record encoding (client side)We have to do row-by-row processing to encode the values into a format that the Zerobus SDK accepts. Squeezing Zerobus into the wrong mold?Current resource emits dicts -> I think this doesn't fit the Zerobus streaming paradigm well. Something like resource emits dicts into Zerobus SDK would be a more natural fit. Zerobus Arrow supportDatabricks is working on Arrow support which looks like it might land soon. I think this will speed up or completely remove the need for (2) and (3). Together with skipping record acknowledgment, this might push @zilto what are your thoughts? |
|
Thanks for the performance analysis! Code looks good. My understandingTo double-check my understanding, when comparing "datatbricks" vs "databricks zerobus"
Performance improvement avenuesArrow supportZerobus Arrow support would be a quick win. We have a solid but incomplete PR (#3477) for Arrow IPC support that would skip a lot of the serialization costs in extract and normalize phases. Parallelize load phase
AFAIK, Disabling normalize and load stepsI agree with you that those features would be desirable more generally for |
Correct! Furthermore, I have tried:
None of it led to @zilto I'm happy to polish the PR to finalize it, but the current Zerobus performance makes me doubt whether there is any merit to merging this right now. Let me know what you think. |
Description
Adds support for using Zerobus to load data into Databricks Delta tables.
API:
databricks_adapter(my_resource, insert_api="zerobus")insert_apior set it tocopy_intoNotes:
bigquery's API:bigquery_adapter(my_resource, insert_api="streaming")zerobusinsert API is only supported for theappendwrite dispositionRelated Issues
Closes #3874