Skip to content

[WIP] Explicit joins in Relation.join()#3868

Open
burnash wants to merge 2 commits intodevelfrom
feat/3747-explicit-joins
Open

[WIP] Explicit joins in Relation.join()#3868
burnash wants to merge 2 commits intodevelfrom
feat/3747-explicit-joins

Conversation

@burnash
Copy link
Copy Markdown
Collaborator

@burnash burnash commented Apr 15, 2026

Closes #3747

@burnash burnash changed the base branch from feat/3403-relation-join to devel April 15, 2026 12:55
@burnash burnash changed the base branch from devel to feat/3403-relation-join April 15, 2026 12:55
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 15, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs 5b2b259 Commit Preview URL

Branch Preview URL
Apr 23 2026, 04:07 PM

@burnash burnash changed the base branch from feat/3403-relation-join to devel April 15, 2026 15:41
table_name: Optional[str],
quote: bool = True,
casefold: bool = True,
dataset_name: Optional[str] = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed you allow to pass catalogs into sqlglot schema qualifications. so maybe also allow catalog overwrite here as well?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate on the use of catalog here? correct me if I'm wrong: from how I see it create_sqlglot_schema does {dataset_name: {table: columns}} so catalog is always empty.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was intended to generate fully qualified table names using dataset name and catalog name on self. now we can also use it to use arbitrary dataset name. so why not catalog name as well? bind_query may need it but you are right - there's no such case now. I see this only for filesystem case where we have many duckdb attached to represent foreign datasets. and those are visible as catalogs that needs to be added during bind_query.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tldr;> ignore this comment for now

Comment thread dlt/dataset/relation.py
f"'{target_dataset.dataset_name}' vs '{self._dataset.dataset_name}'"
)
# cross-dataset filesystem not supported
if isinstance(self.sql_client, WithSchemas):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good followup ticket. if we use duckdb ATTACH we will be able to join dataset on separate physical locations. ie. joining lance and HF tables will be possible

Comment thread dlt/dataset/relation.py

# physical destination check
if target_dataset is not self._dataset:
if not self._dataset.is_same_physical_destination(target_dataset):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our physical destination check is currently half implemented:
#3758

currently FYI but we can add this to our estimations

Comment thread dlt/dataset/relation.py Outdated
kind=kind,
)
else:
if target_dataset is not self._dataset:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm this line basically answers if dataset is foreign or not. I'd probably create is_foreign_dataset in Dataset that answers that given the "other" dataset.

  1. if foreign - we do what you do here
  2. if not - we add schemas from "other" to local schemas in "self" - those that are not present in "self"

now when dataset is foreign:

  • different dataset name
  • same name but different physical location - but those can't be joined :) so I think we are lucky with name comparison
    (we could also compare catalogs if destination client supports them - but that IMO can be shifted to the moment we deal with filesystem foreign joins)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the identity check is wrong here. dataset name is better but if won't it produce false negative for case insensitive destinations?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like (destination_fingerprint, effective_dataset_name) in the spirit of #3758

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just compare dataset names for now. _resolve_join_target already checks if physical locations are same (#3758 addresses joinability explicitly so this check will be even more sound). this covers IMO all practical cases.

once #3758 is implemented we will be able to:

  • can_join_with to check if we can join at all
  • location() + dataset_name for the identity check

@burnash burnash force-pushed the feat/3747-explicit-joins branch from 0c6c278 to 390d40a Compare April 16, 2026 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(feat) implement explicit joins with local and foreign datasets

2 participants