VariantGet RFC#58
Conversation
c41bb3c to
09a29d1
Compare
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
09a29d1 to
26f1e73
Compare
| This makes canonical variants more complex than a single raw child. Any code that transforms a | ||
| canonical `VariantArray` must preserve both `core_storage` and the optional `shredded` child, and | ||
| must keep them row-aligned through filter, slice, take and mask operations. | ||
|
|
There was a problem hiding this comment.
what is another approach that is less complex?
| - How should implementations validate consistency between the shredded child and raw | ||
| `core_storage`? This may be a construction-time invariant, a debug assertion or a checked error | ||
| path when merging partial shredding. |
There was a problem hiding this comment.
likely debug on construction? I guess both in debug, seems slow
|
I think that I am a bit dumb when reading this - what would be advantage of skipping shredded values in the canonical array? |
|
That's the current thing, it makes the canonical array a really weird thing that is basically a pass-through to a bunch of things, with delicate rules around it to make sure everything is pushdown down. |
|
|
||
| A new VariantGet expression is required, the expression has two inputs: | ||
|
|
||
| 1. Path to the required child - similar to JSONPath, but a much stricter subset. Just a combination of names and indexes. |
There was a problem hiding this comment.
Indexes being list offsets? Can we just stick to field paths for now?
There was a problem hiding this comment.
Would like to keep the flexibility, especially as variant both in Parquet and DuckDB supports that.
| A new VariantGet expression is required, the expression has two inputs: | ||
|
|
||
| 1. Path to the required child - similar to JSONPath, but a much stricter subset. Just a combination of names and indexes. | ||
| 2. Optional dtype, if None - the return type is `None`, the expression's return type is `Variant`. |
There was a problem hiding this comment.
Why optional? We should assume that Vortex expressions are fully-typed by some surrounding engine. So presumably the output has been coerced into something
There was a problem hiding this comment.
I say that at some other point in the doc, we can make it stricter for now.
| `core_storage`, and its rows must stay aligned with the raw variant rows. | ||
|
|
||
| Nested shredded paths can be represented by nesting typed arrays inside struct arrays. For example, | ||
| if `$.a.b` is shredded but `$.a.c` is not, the shredded child may contain a field for `a`, whose |
There was a problem hiding this comment.
What if there is $.a.b as a both an int64 and a float64 array?
There was a problem hiding this comment.
in cases of these conflicts, Variant semantics are row-wise. DuckDB and arrow have some casting semantics/options around that.
|
|
||
| The canonical Variant array will add an additional child, representing optional shredded data, it will now have: | ||
|
|
||
| 1. Validity |
There was a problem hiding this comment.
From your description of execution, it sounds like you want a non-optional "shredded" child that can be a struct array with no fields. That gives you a sensible place for validity.
There was a problem hiding this comment.
Not sure I follow. Why can't the validity be "inside" the child?
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
RFC for the
VariantGetexpression, with lessons and thoughts learned through vortex-data/vortex#7494.