feat: Databricks Zerobus loading by jorritsandbrink · Pull Request #3904 · dlt-hub/dlt

jorritsandbrink · 2026-04-29T07:22:25Z

Description

Adds support for using Zerobus to load data into Databricks Delta tables.

API:

use Zerobus loading: databricks_adapter(my_resource, insert_api="zerobus")
use "standard" loading: do not set insert_api or set it to copy_into

Notes:

the API is consistent with bigquery's API: bigquery_adapter(my_resource, insert_api="streaming")
the zerobus insert API is only supported for the append write disposition

Related Issues

Closes #3874

…-databricks-zerobus

cloudflare-workers-and-pages · 2026-04-29T07:22:30Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	docs	`01c6b88`	Commit Preview URL Branch Preview URL	Apr 29 2026, 07:11 AM

jorritsandbrink · 2026-04-29T08:31:52Z

This PR is not ready to merge (it's missing docs, tests, and config options), but before polishing it I want to get feedback on a performance concern: the new zerobus insert API is significantly slower than the existing copy_into insert API.

I've done perf tests using NYC taxi data, which identified three main bottlenecks for zerobus:

record acknowledgement (server side)
load job file reading (client side)
load job record encoding (client side)

1. Record acknowledgment

I think this one is out of our control, though we can skip waiting for acknowledgement at the cost of correctness guarantees (I've tested that and it makes things much faster, though still not faster than copy_into).

2. load job file reading (client side)

We have to read the JSON/Parquet load job file back into Python memory, so we can encode the records and feed it into the Zerobus SDK.

3. load job record encoding (client side)

We have to do row-by-row processing to encode the values into a format that the Zerobus SDK accepts.

Squeezing Zerobus into the wrong mold?

Current zerobus pipeline:

resource emits dicts -> extract writes to disk -> normalize reads from disk, does per record processing and writes to disk -> load reads from disk, does per record processing, and emits into Zerobus SDK

I think this doesn't fit the Zerobus streaming paradigm well. Something like resource emits dicts into Zerobus SDK would be a more natural fit.

Zerobus Arrow support

Databricks is working on Arrow support which looks like it might land soon. I think this will speed up or completely remove the need for (2) and (3). Together with skipping record acknowledgment, this might push zerobus perf ahead of copy_into perf.

@zilto what are your thoughts?

zilto · 2026-04-29T17:00:10Z

Thanks for the performance analysis! Code looks good.

My understanding

To double-check my understanding, when comparing "datatbricks" vs "databricks zerobus"

extract and normalize steps are identical; only load step differs
we're using Zerobus with: JSON encoding, record acknowledgement (ingest_record_offset()), synchronously

Performance improvement avenues

Arrow support

Zerobus Arrow support would be a quick win. We have a solid but incomplete PR (#3477) for Arrow IPC support that would skip a lot of the serialization costs in extract and normalize phases.

Parallelize load phase

dlt comes from batch world and zerobus from stream world. Typically, dlt scales by batching records and making a big INSERT with transactional guarantees. OTOH, zerobus scales by having more connections push data (what the docs indicate).

AFAIK, dlt will parallelize load step per load package file (docs). We can try setting the normalize.data_writer.file_max_bytes and see how it compares to copy_into insert?

Disabling normalize and load steps

I agree with you that those features would be desirable more generally for dlt (trade-off in features, guarantees, and performance). Though, I think we should explore them in separate PRs

jorritsandbrink · 2026-04-29T19:25:35Z

To double-check my understanding, when comparing "datatbricks" vs "databricks zerobus"

extract and normalize steps are identical; only load step differs
we're using Zerobus with: JSON encoding, record acknowledgement (ingest_record_offset()), synchronously

Correct!

Furthermore, I have tried:

ingest_records_nowait() instead of ingest_records_offset() — this shifts waiting for acknowledgment to stream.close(), and thus doesn't help much (skipping stream.close() does actually save a lot of time, but comes at the expense of correctness guarantees)
parallellizing work accross load jobs by setting normalize.data_writer.file_max_bytes as you suggested — this definitely reduces load time, but still loses to copy_into
tuning different parameters such as number of parallel loads, size of the batch yielded by the dlt resource, size of the batch yielded into the Zerobus stream

None of it led to zerobus being faster than copy_into. Even on workloads that should favor Zerobus, such as a resource yielding Python dictionaries and using the jsonl file format, copy_into was faster. With a resource that yields Arrow data and the parquet file format, copy_into was much faster.

@zilto I'm happy to polish the PR to finalize it, but the current Zerobus performance makes me doubt whether there is any merit to merging this right now. Let me know what you think.

jorritsandbrink added 4 commits April 27, 2026 22:47

fix databricks adapter column hints

8ad5ce2

add core databricks zerobus functionality

4f899d0

add concurrent zerobus streams test

768de81

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/3874…

01c6b88

…-databricks-zerobus

jorritsandbrink self-assigned this Apr 29, 2026

jorritsandbrink added the ci full Use to trigger CI on a PR for full load tests label Apr 29, 2026

zilto self-requested a review April 29, 2026 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Databricks Zerobus loading#3904

feat: Databricks Zerobus loading#3904
jorritsandbrink wants to merge 4 commits intodevelfrom
feat/3874-databricks-zerobus

jorritsandbrink commented Apr 29, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 29, 2026

Uh oh!

jorritsandbrink commented Apr 29, 2026

Uh oh!

zilto commented Apr 29, 2026

Uh oh!

jorritsandbrink commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jorritsandbrink commented Apr 29, 2026

Description

Related Issues

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 29, 2026

Deploying with Cloudflare Workers

Uh oh!

jorritsandbrink commented Apr 29, 2026

1. Record acknowledgment

2. load job file reading (client side)

3. load job record encoding (client side)

Squeezing Zerobus into the wrong mold?

Zerobus Arrow support

Uh oh!

zilto commented Apr 29, 2026

My understanding

Performance improvement avenues

Arrow support

Parallelize load phase

Disabling normalize and load steps

Uh oh!

jorritsandbrink commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants