Skip to content

feat: Databricks Zerobus loading#3904

Draft
jorritsandbrink wants to merge 4 commits intodevelfrom
feat/3874-databricks-zerobus
Draft

feat: Databricks Zerobus loading#3904
jorritsandbrink wants to merge 4 commits intodevelfrom
feat/3874-databricks-zerobus

Conversation

@jorritsandbrink
Copy link
Copy Markdown
Collaborator

Description

Adds support for using Zerobus to load data into Databricks Delta tables.

API:

  • use Zerobus loading: databricks_adapter(my_resource, insert_api="zerobus")
  • use "standard" loading: do not set insert_api or set it to copy_into

Notes:

  • the API is consistent with bigquery's API: bigquery_adapter(my_resource, insert_api="streaming")
  • the zerobus insert API is only supported for the append write disposition

Related Issues

Closes #3874

@jorritsandbrink jorritsandbrink self-assigned this Apr 29, 2026
@jorritsandbrink jorritsandbrink added the ci full Use to trigger CI on a PR for full load tests label Apr 29, 2026
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs 01c6b88 Commit Preview URL

Branch Preview URL
Apr 29 2026, 07:11 AM

@jorritsandbrink
Copy link
Copy Markdown
Collaborator Author

This PR is not ready to merge (it's missing docs, tests, and config options), but before polishing it I want to get feedback on a performance concern: the new zerobus insert API is significantly slower than the existing copy_into insert API.

I've done perf tests using NYC taxi data, which identified three main bottlenecks for zerobus:

  1. record acknowledgement (server side)
  2. load job file reading (client side)
  3. load job record encoding (client side)

1. Record acknowledgment

I think this one is out of our control, though we can skip waiting for acknowledgement at the cost of correctness guarantees (I've tested that and it makes things much faster, though still not faster than copy_into).

2. load job file reading (client side)

We have to read the JSON/Parquet load job file back into Python memory, so we can encode the records and feed it into the Zerobus SDK.

3. load job record encoding (client side)

We have to do row-by-row processing to encode the values into a format that the Zerobus SDK accepts.

Squeezing Zerobus into the wrong mold?

Current zerobus pipeline:

resource emits dicts -> extract writes to disk -> normalize reads from disk, does per record processing and writes to disk -> load reads from disk, does per record processing, and emits into Zerobus SDK

I think this doesn't fit the Zerobus streaming paradigm well. Something like resource emits dicts into Zerobus SDK would be a more natural fit.

Zerobus Arrow support

Databricks is working on Arrow support which looks like it might land soon. I think this will speed up or completely remove the need for (2) and (3). Together with skipping record acknowledgment, this might push zerobus perf ahead of copy_into perf.

@zilto what are your thoughts?

@zilto zilto self-requested a review April 29, 2026 13:26
@zilto
Copy link
Copy Markdown
Collaborator

zilto commented Apr 29, 2026

Thanks for the performance analysis! Code looks good.

My understanding

To double-check my understanding, when comparing "datatbricks" vs "databricks zerobus"

  1. extract and normalize steps are identical; only load step differs
  2. we're using Zerobus with: JSON encoding, record acknowledgement (ingest_record_offset()), synchronously

Performance improvement avenues

Arrow support

Zerobus Arrow support would be a quick win. We have a solid but incomplete PR (#3477) for Arrow IPC support that would skip a lot of the serialization costs in extract and normalize phases.

Parallelize load phase

dlt comes from batch world and zerobus from stream world. Typically, dlt scales by batching records and making a big INSERT with transactional guarantees. OTOH, zerobus scales by having more connections push data (what the docs indicate).

AFAIK, dlt will parallelize load step per load package file (docs). We can try setting the normalize.data_writer.file_max_bytes and see how it compares to copy_into insert?

Disabling normalize and load steps

I agree with you that those features would be desirable more generally for dlt (trade-off in features, guarantees, and performance). Though, I think we should explore them in separate PRs

@jorritsandbrink
Copy link
Copy Markdown
Collaborator Author

To double-check my understanding, when comparing "datatbricks" vs "databricks zerobus"

extract and normalize steps are identical; only load step differs
we're using Zerobus with: JSON encoding, record acknowledgement (ingest_record_offset()), synchronously

Correct!

Furthermore, I have tried:

  • ingest_records_nowait() instead of ingest_records_offset() — this shifts waiting for acknowledgment to stream.close(), and thus doesn't help much (skipping stream.close() does actually save a lot of time, but comes at the expense of correctness guarantees)
  • parallellizing work accross load jobs by setting normalize.data_writer.file_max_bytes as you suggested — this definitely reduces load time, but still loses to copy_into
  • tuning different parameters such as number of parallel loads, size of the batch yielded by the dlt resource, size of the batch yielded into the Zerobus stream

None of it led to zerobus being faster than copy_into. Even on workloads that should favor Zerobus, such as a resource yielding Python dictionaries and using the jsonl file format, copy_into was faster. With a resource that yields Arrow data and the parquet file format, copy_into was much faster.

@zilto I'm happy to polish the PR to finalize it, but the current Zerobus performance makes me doubt whether there is any merit to merging this right now. Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci full Use to trigger CI on a PR for full load tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(databricks): Support ingestion via Zerobus

2 participants