Skip to content

VariantGet RFC#58

Open
AdamGS wants to merge 3 commits into
developfrom
adamg/update-variant-canonical
Open

VariantGet RFC#58
AdamGS wants to merge 3 commits into
developfrom
adamg/update-variant-canonical

Conversation

@AdamGS
Copy link
Copy Markdown
Collaborator

@AdamGS AdamGS commented May 5, 2026

RFC for the VariantGet expression, with lessons and thoughts learned through vortex-data/vortex#7494.

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/update-variant-canonical branch from 09a29d1 to 26f1e73 Compare May 5, 2026 11:43
@AdamGS AdamGS marked this pull request as ready for review May 5, 2026 11:44
Comment thread rfcs/0058-variant-get-expr.md
Comment thread rfcs/0058-variant-get-expr.md Outdated
Comment thread rfcs/0058-variant-get-expr.md Outdated
Comment thread rfcs/0058-variant-get-expr.md
Comment on lines +125 to +128
This makes canonical variants more complex than a single raw child. Any code that transforms a
canonical `VariantArray` must preserve both `core_storage` and the optional `shredded` child, and
must keep them row-aligned through filter, slice, take and mask operations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is another approach that is less complex?

Comment thread rfcs/0058-variant-get-expr.md
Comment thread rfcs/0058-variant-get-expr.md
Comment thread rfcs/0058-variant-get-expr.md
Comment on lines +159 to +161
- How should implementations validate consistency between the shredded child and raw
`core_storage`? This may be a construction-time invariant, a debug assertion or a checked error
path when merging partial shredding.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely debug on construction? I guess both in debug, seems slow

Comment thread rfcs/0058-variant-get-expr.md
@robert3005
Copy link
Copy Markdown

I think that I am a bit dumb when reading this - what would be advantage of skipping shredded values in the canonical array?

@AdamGS
Copy link
Copy Markdown
Collaborator Author

AdamGS commented May 5, 2026

That's the current thing, it makes the canonical array a really weird thing that is basically a pass-through to a bunch of things, with delicate rules around it to make sure everything is pushdown down.
Moving the shredded child into it makes it both more useful and allows us to generalize the behavior better, we can theoretically have a JSON core_storage child with some shredded fields, making the behavior bigger than specific encodings.


A new VariantGet expression is required, the expression has two inputs:

1. Path to the required child - similar to JSONPath, but a much stricter subset. Just a combination of names and indexes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexes being list offsets? Can we just stick to field paths for now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to keep the flexibility, especially as variant both in Parquet and DuckDB supports that.

A new VariantGet expression is required, the expression has two inputs:

1. Path to the required child - similar to JSONPath, but a much stricter subset. Just a combination of names and indexes.
2. Optional dtype, if None - the return type is `None`, the expression's return type is `Variant`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why optional? We should assume that Vortex expressions are fully-typed by some surrounding engine. So presumably the output has been coerced into something

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say that at some other point in the doc, we can make it stricter for now.

`core_storage`, and its rows must stay aligned with the raw variant rows.

Nested shredded paths can be represented by nesting typed arrays inside struct arrays. For example,
if `$.a.b` is shredded but `$.a.c` is not, the shredded child may contain a field for `a`, whose
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is $.a.b as a both an int64 and a float64 array?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in cases of these conflicts, Variant semantics are row-wise. DuckDB and arrow have some casting semantics/options around that.


The canonical Variant array will add an additional child, representing optional shredded data, it will now have:

1. Validity
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your description of execution, it sounds like you want a non-optional "shredded" child that can be a struct array with no fields. That gives you a sensible place for validity.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. Why can't the validity be "inside" the child?

Comment thread rfcs/0058-variant-get-expr.md Outdated
Comment thread rfcs/0058-variant-get-expr.md
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants