feat: Databricks Zerobus loading by jorritsandbrink · Pull Request #3904 · dlt-hub/dlt

jorritsandbrink · 2026-04-29T07:22:25Z

Description

Adds support for using Zerobus to load data into Databricks Delta tables.

API:

use Zerobus loading: databricks_adapter(my_resource, insert_api="zerobus")
use "standard" loading: do not set insert_api or set it to copy_into

Notes:

the API is consistent with bigquery's API: bigquery_adapter(my_resource, insert_api="streaming")
the zerobus insert API is only supported for the append write disposition

Related Issues

Closes #3874

…-databricks-zerobus

cloudflare-workers-and-pages · 2026-04-29T07:22:30Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	docs	`0faec99`	Commit Preview URL Branch Preview URL	May 13 2026, 07:30 PM

jorritsandbrink · 2026-04-29T08:31:52Z

This PR is not ready to merge (it's missing docs, tests, and config options), but before polishing it I want to get feedback on a performance concern: the new zerobus insert API is significantly slower than the existing copy_into insert API.

I've done perf tests using NYC taxi data, which identified three main bottlenecks for zerobus:

record acknowledgement (server side)
load job file reading (client side)
load job record encoding (client side)

1. Record acknowledgment

I think this one is out of our control, though we can skip waiting for acknowledgement at the cost of correctness guarantees (I've tested that and it makes things much faster, though still not faster than copy_into).

2. load job file reading (client side)

We have to read the JSON/Parquet load job file back into Python memory, so we can encode the records and feed it into the Zerobus SDK.

3. load job record encoding (client side)

We have to do row-by-row processing to encode the values into a format that the Zerobus SDK accepts.

Squeezing Zerobus into the wrong mold?

Current zerobus pipeline:

resource emits dicts -> extract writes to disk -> normalize reads from disk, does per record processing and writes to disk -> load reads from disk, does per record processing, and emits into Zerobus SDK

I think this doesn't fit the Zerobus streaming paradigm well. Something like resource emits dicts into Zerobus SDK would be a more natural fit.

Zerobus Arrow support

Databricks is working on Arrow support which looks like it might land soon. I think this will speed up or completely remove the need for (2) and (3). Together with skipping record acknowledgment, this might push zerobus perf ahead of copy_into perf.

@zilto what are your thoughts?

zilto · 2026-04-29T17:00:10Z

Thanks for the performance analysis! Code looks good.

My understanding

To double-check my understanding, when comparing "datatbricks" vs "databricks zerobus"

extract and normalize steps are identical; only load step differs
we're using Zerobus with: JSON encoding, record acknowledgement (ingest_record_offset()), synchronously

Performance improvement avenues

Arrow support

Zerobus Arrow support would be a quick win. We have a solid but incomplete PR (#3477) for Arrow IPC support that would skip a lot of the serialization costs in extract and normalize phases.

Parallelize load phase

dlt comes from batch world and zerobus from stream world. Typically, dlt scales by batching records and making a big INSERT with transactional guarantees. OTOH, zerobus scales by having more connections push data (what the docs indicate).

AFAIK, dlt will parallelize load step per load package file (docs). We can try setting the normalize.data_writer.file_max_bytes and see how it compares to copy_into insert?

Disabling normalize and load steps

I agree with you that those features would be desirable more generally for dlt (trade-off in features, guarantees, and performance). Though, I think we should explore them in separate PRs

jorritsandbrink · 2026-04-29T19:25:35Z

To double-check my understanding, when comparing "datatbricks" vs "databricks zerobus"

extract and normalize steps are identical; only load step differs
we're using Zerobus with: JSON encoding, record acknowledgement (ingest_record_offset()), synchronously

Correct!

Furthermore, I have tried:

ingest_records_nowait() instead of ingest_records_offset() — this shifts waiting for acknowledgment to stream.close(), and thus doesn't help much (skipping stream.close() does actually save a lot of time, but comes at the expense of correctness guarantees)
parallellizing work accross load jobs by setting normalize.data_writer.file_max_bytes as you suggested — this definitely reduces load time, but still loses to copy_into
tuning different parameters such as number of parallel loads, size of the batch yielded by the dlt resource, size of the batch yielded into the Zerobus stream

None of it led to zerobus being faster than copy_into. Even on workloads that should favor Zerobus, such as a resource yielding Python dictionaries and using the jsonl file format, copy_into was faster. With a resource that yields Arrow data and the parquet file format, copy_into was much faster.

@zilto I'm happy to polish the PR to finalize it, but the current Zerobus performance makes me doubt whether there is any merit to merging this right now. Let me know what you think.

…-databricks-zerobus

jorritsandbrink · 2026-05-13T02:27:14Z

Switched Zerobus serialization over from JSON to Arrow, which became available recently in databricks-zerobus-ingest-sdk==1.2.0. This made the code both simpler and faster. I've been able to get the zerobus path faster than the copy_into path in a 3M row NYC taxi benchmark by tuning the batch size and using ZSTD as ipc_compression.

Important to note that Arrow isn't officially supported yet in the Zerobus Python SDK. It's not documented, but there is an example in the repo.

@zilto Can you review this PR?

zilto

We need to investigate:

why athena tests are failing. Some of their config were edited in this PR
why hf tests are failing. We're hitting rate limits. Do you expect any behavior change from this PR (e.g., batching logic)?

If those are unrelated, we can merge.

Otherwise, two minor questions / nits.

zilto · 2026-05-13T16:53:30Z

+TDataRecord = dict[str, Any]
+"""Table row dictionary. Not guaranteed to be JSON serializable without custom encoding."""
+TDataRecordBatch = list[TDataRecord]
+"""List of table row dictionaries. Not guaranteed to be JSON serializable without custom encoding."""


I like the centralization of batching logic. Though, how does TDataRecord and TDataRecordBatch differ from TDataItem and TDataItems?

(it's a bit annoyting that TDataItems is not guaranteed to be a list...)

I think they roughly represent the same concepts. My issue is that TDataItem is just Any, which makes dev hard.

zilto · 2026-05-13T17:59:38Z

+    default_sql_configs_with_staging = [
+        # Athena needs filesystem staging, which will be automatically set; we have to supply a bucket url though.
+        cid_configs_by_cid["athena"],
+        cid_configs_by_cid["athena-iceberg"],
+        cid_configs_by_cid["athena-s3-tables"],
    ]


CI workflows related to these configurations seem to be failing now. I retried them and they failed again.

It's not related to changes in this PR. The failing tests also fail on origin/devel.

zilto

The failing CI is also failing on devel branch, so seems unrelated.

Let's merge. Good job!

jorritsandbrink added 4 commits April 27, 2026 22:47

fix databricks adapter column hints

8ad5ce2

add core databricks zerobus functionality

4f899d0

add concurrent zerobus streams test

768de81

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/3874…

01c6b88

…-databricks-zerobus

jorritsandbrink self-assigned this Apr 29, 2026

jorritsandbrink added the ci full Use to trigger CI on a PR for full load tests label Apr 29, 2026

zilto self-requested a review April 29, 2026 13:26

jorritsandbrink added 9 commits May 11, 2026 20:54

switch zerobus stream to arrow backend

cb05d6f

expose zerobus stream configuration options

92fada8

add zerobus schema evolution tests

caf70c4

add default insert api config

c9a9278

add zerobus credentials fallback

f332b06

add zerobus docs

801f746

add zerobus error handling

e476d5e

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/3874…

9956775

…-databricks-zerobus

exclude macos for zerobus

c3128ed

jorritsandbrink marked this pull request as ready for review May 13, 2026 02:27

zilto requested changes May 13, 2026

View reviewed changes

remove duplicate typing imports

0faec99

zilto approved these changes May 13, 2026

View reviewed changes

zilto merged commit cef33fe into devel May 13, 2026
47 checks passed

zilto deleted the feat/3874-databricks-zerobus branch May 13, 2026 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Databricks Zerobus loading#3904

feat: Databricks Zerobus loading#3904
zilto merged 14 commits into
develfrom
feat/3874-databricks-zerobus

jorritsandbrink commented Apr 29, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

jorritsandbrink commented Apr 29, 2026

Uh oh!

zilto commented Apr 29, 2026

Uh oh!

jorritsandbrink commented Apr 29, 2026

Uh oh!

jorritsandbrink commented May 13, 2026

Uh oh!

zilto left a comment

Uh oh!

zilto May 13, 2026

Uh oh!

jorritsandbrink May 13, 2026

Uh oh!

Uh oh!

zilto May 13, 2026

Uh oh!

jorritsandbrink May 13, 2026

Uh oh!

zilto left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jorritsandbrink commented Apr 29, 2026

Description

Related Issues

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

jorritsandbrink commented Apr 29, 2026

1. Record acknowledgment

2. load job file reading (client side)

3. load job record encoding (client side)

Squeezing Zerobus into the wrong mold?

Zerobus Arrow support

Uh oh!

zilto commented Apr 29, 2026

My understanding

Performance improvement avenues

Arrow support

Parallelize load phase

Disabling normalize and load steps

Uh oh!

jorritsandbrink commented Apr 29, 2026

Uh oh!

jorritsandbrink commented May 13, 2026

Uh oh!

zilto left a comment

Choose a reason for hiding this comment

Uh oh!

zilto May 13, 2026

Choose a reason for hiding this comment

Uh oh!

jorritsandbrink May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zilto May 13, 2026

Choose a reason for hiding this comment

Uh oh!

jorritsandbrink May 13, 2026

Choose a reason for hiding this comment

Uh oh!

zilto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloudflare-workers-and-pages Bot commented Apr 29, 2026 •

edited

Loading