Updated Variant array and the new VariantGet expression#7877
Conversation
e6ed40d to
54d4015
Compare
Merging this PR will improve performance by 21.12%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | take_search[(0.005, 0.05)] |
168.5 µs | 132 µs | +27.59% |
| ⚡ | Simulation | take_search[(0.005, 0.1)] |
320.4 µs | 247.6 µs | +29.42% |
| ⚡ | Simulation | take_search[(0.005, 0.5)] |
1.5 ms | 1.2 ms | +31.01% |
| ⚡ | Simulation | take_search[(0.005, 1.0)] |
3.1 ms | 2.3 ms | +31.24% |
| ⚡ | Simulation | take_search[(0.01, 0.05)] |
179.6 µs | 143.1 µs | +25.45% |
| ⚡ | Simulation | take_search[(0.01, 0.1)] |
341.3 µs | 268.5 µs | +27.13% |
| ⚡ | Simulation | take_search[(0.01, 0.5)] |
1.6 ms | 1.3 ms | +28.56% |
| ⚡ | Simulation | take_search[(0.01, 1.0)] |
3.3 ms | 2.5 ms | +28.76% |
| ⚡ | Simulation | take_search[(0.1, 0.05)] |
249.2 µs | 212.8 µs | +17.12% |
| ⚡ | Simulation | take_search[(0.1, 0.1)] |
458.8 µs | 385.9 µs | +18.88% |
| ⚡ | Simulation | take_search[(0.1, 0.5)] |
2.2 ms | 1.8 ms | +20.39% |
| ⚡ | Simulation | take_search[(0.1, 1.0)] |
4.3 ms | 3.5 ms | +20.62% |
| ⚡ | Simulation | take_search_chunked[(0.005, 0.05)] |
193.3 µs | 162.3 µs | +19.12% |
| ⚡ | Simulation | take_search_chunked[(0.005, 0.1)] |
369.3 µs | 307.4 µs | +20.16% |
| ⚡ | Simulation | take_search_chunked[(0.005, 0.5)] |
1.8 ms | 1.5 ms | +21.06% |
| ⚡ | Simulation | take_search_chunked[(0.005, 1.0)] |
3.5 ms | 2.9 ms | +21.19% |
| ⚡ | Simulation | take_search_chunked[(0.01, 0.05)] |
206.3 µs | 175.4 µs | +17.67% |
| ⚡ | Simulation | take_search_chunked[(0.01, 0.1)] |
394 µs | 332.1 µs | +18.65% |
| ⚡ | Simulation | take_search_chunked[(0.01, 0.5)] |
1.9 ms | 1.6 ms | +19.48% |
| ⚡ | Simulation | take_search_chunked[(0.01, 1.0)] |
3.8 ms | 3.2 ms | +19.59% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing adamg/yet-another-variant-array (0ff0f40) with develop (7349cd6)
Footnotes
-
24 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
| Canonical::Bool(b) => { | ||
| let validity = child_to_validity(b.slots()[0].as_ref(), b.dtype().nullability()); | ||
| let len = b.len(); | ||
| let BoolDataParts { bits, offset, len } = b.into_data().into_parts(len); | ||
| Ok(RecursiveCanonical(Canonical::Bool( | ||
| BoolArray::try_new_from_handle(bits, offset, len, validity.execute(ctx)?)?, | ||
| ))) | ||
| } | ||
| Canonical::Primitive(p) => { | ||
| let PrimitiveDataParts { | ||
| ptype, | ||
| buffer, | ||
| validity, | ||
| } = p.into_data_parts(); | ||
| Ok(RecursiveCanonical(Canonical::Primitive(unsafe { | ||
| PrimitiveArray::new_unchecked_from_handle(buffer, ptype, validity.execute(ctx)?) | ||
| }))) | ||
| } | ||
| Canonical::Decimal(d) => { | ||
| let DecimalDataParts { | ||
| decimal_dtype, | ||
| values, | ||
| values_type, | ||
| validity, | ||
| } = d.into_data_parts(); | ||
| Ok(RecursiveCanonical(Canonical::Decimal(unsafe { | ||
| DecimalArray::new_unchecked_handle( | ||
| values, | ||
| values_type, | ||
| decimal_dtype, | ||
| validity.execute(ctx)?, | ||
| ) | ||
| }))) | ||
| } | ||
| Canonical::VarBinView(vbv) => { | ||
| let VarBinViewDataParts { | ||
| dtype, | ||
| buffers, | ||
| views, | ||
| validity, | ||
| } = vbv.into_data_parts(); | ||
| Ok(RecursiveCanonical(Canonical::VarBinView(unsafe { | ||
| VarBinViewArray::new_handle_unchecked( | ||
| views, | ||
| buffers, | ||
| dtype, | ||
| validity.execute(ctx)?, | ||
| ) | ||
| }))) |
There was a problem hiding this comment.
want to add a todo for using recursively_canonicalize_slots for all arrays here.
| let mut chunks = Vec::with_capacity(input.len()); | ||
|
|
||
| for idx in 0..input.len() { | ||
| let scalar = input.execute_scalar(idx, ctx)?; | ||
| let output = variant_get_scalar(&scalar, options, &dtype)?; | ||
| chunks.push(ConstantArray::new(output, 1).into_array()); | ||
| } | ||
|
|
||
| let array = ChunkedArray::try_new(chunks, dtype.clone())?.into_array(); | ||
| if dtype.is_variant() { | ||
| VariantArray::try_new(array, None).map(|array| array.into_array()) | ||
| } else { |
There was a problem hiding this comment.
this looks very slow, can you add a comment explaining this?
There was a problem hiding this comment.
This is the most general behavior we have, ideally we push it down into children that can do smarter stuff.
|
|
||
| Ok(options | ||
| .dtype() | ||
| .map_or(DType::Variant(Nullability::Nullable), DType::as_nullable)) |
There was a problem hiding this comment.
Why do you store as option dtype as possibly non-nullable? There is a hidden cast
| pub trait VariantArrayExt: TypedArrayRef<Variant> { | ||
| fn child(&self) -> &ArrayRef { | ||
| self.as_ref().slots()[0] | ||
| /// Returns the raw storage that preserves the full variant value for every row. |
There was a problem hiding this comment.
So this does contain shredded data again
| pub fn try_new(core_storage: ArrayRef, shredded: Option<ArrayRef>) -> VortexResult<Self> { | ||
| let dtype = core_storage.dtype().clone(); | ||
| vortex_ensure!( | ||
| matches!(dtype, DType::Variant(_)), | ||
| "VariantArray core_storage dtype must be Variant, found {dtype}" | ||
| ); | ||
| let len = core_storage.len(); | ||
| let stats = core_storage.statistics().to_owned(); | ||
| Ok(Array::try_from_parts( | ||
| ArrayParts::new(Variant, dtype, len, EmptyArrayData) | ||
| .with_slots(vec![Some(core_storage), shredded].into()), | ||
| )? | ||
| .with_stats_set(stats)) | ||
| } | ||
|
|
||
| /// Creates a new `VariantArray`. | ||
| pub fn new(child: ArrayRef) -> Self { | ||
| let dtype = DType::Variant(child.dtype().nullability()); | ||
| let len = child.len(); | ||
| let stats = child.statistics().to_owned(); | ||
| unsafe { | ||
| Array::from_parts_unchecked( | ||
| ArrayParts::new(Variant, dtype, len, EmptyArrayData) | ||
| .with_slots(smallvec![Some(child)]), | ||
| ) | ||
| pub fn new(core_storage: ArrayRef) -> Self { | ||
| Self::try_new(core_storage, None).vortex_expect("invalid VariantArray core_storage") | ||
| } |
There was a problem hiding this comment.
These methods have odd names and parameters
|
|
||
| impl Array<Variant> { | ||
| /// Creates a new `VariantArray` with raw core storage and optional shredded storage. | ||
| pub fn try_new(core_storage: ArrayRef, shredded: Option<ArrayRef>) -> VortexResult<Self> { |
There was a problem hiding this comment.
here explain or link to what you pass here
|
@claude can you review this? |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
| @@ -198,7 +223,12 @@ fn scalar_from_shredded_object_scalar( | |||
| if !unshredded.is_null() { | |||
| let unshredded = unshredded.as_struct(); | |||
There was a problem hiding this comment.
as_struct here should not panic here as long as the underlying parquet array is not corrupt right? Any present unshredded data when there is a shredded value has to be a struct
| }) | ||
| .transpose()?; | ||
| let core_storage = core_storage_without_typed_value(&array)?; | ||
| Ok(ExecutionResult::done( |
There was a problem hiding this comment.
I know there must be lots of discussion about this and I think having variant as a dtype is fine, but it is still weird for me to have a fully shredded int array and we still have to canonicalise it into a vortex variant array with empty core_storage, instead of returning a primitive array
| // TODO(joe)[#7674]: iterative execution here too | ||
| Ok(ExecutionResult::done(_canonicalize(array.as_view(), ctx)?)) | ||
| } | ||
| DType::Variant(..) => Ok(ExecutionResult::done(array)), |
There was a problem hiding this comment.
aren't we potentially returning a non canonical array here? the executor would try to call execute repeatedly getting back the same array then bail failing to match the array with canonical?
robert3005
left a comment
There was a problem hiding this comment.
I think everything apart from ChunkedArray(Variant) execute are just nits
| }; | ||
|
|
||
| let typed = shredded.execute_scalar(index, ctx)?; | ||
| let fallback = (typed.is_null() || typed.dtype().is_struct()) |
| chunks.push(ConstantArray::new(output, 1).into_array()); | ||
| } | ||
|
|
||
| let array = ChunkedArray::try_new(chunks, dtype.clone())?.into_array(); |
There was a problem hiding this comment.
this is very weird. How does this become canonical?
| // TODO(joe)[#7674]: iterative execution here too | ||
| Ok(ExecutionResult::done(_canonicalize(array.as_view(), ctx)?)) | ||
| } | ||
| DType::Variant(..) => Ok(ExecutionResult::done(array)), |
There was a problem hiding this comment.
This seems incorrect. Can you add a test for canonicalizing variant_get expression?
| let fallback = make_fallback(ctx)?; | ||
| let typed_mask = typed.is_not_null()?; | ||
| typed_mask | ||
| .zip(typed, fallback)? |
There was a problem hiding this comment.
as a follow up, we are computing the entire fallback array here but only take indices where typed_mask is false, so we can narrow the core_storage to only the false indices and compute from there then zip
e16b60d to
7546876
Compare
Signed-off-by: "Adam Gutglick" <adam@spiraldb.com> Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: "Adam Gutglick" <adam@spiraldb.com> Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Expose canonical VariantArray core_storage and optional shredded children, and surface Parquet typed_value as canonical shredded storage during canonicalization. Signed-off-by: "Adam Gutglick" <adam@spiraldb.com> Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Keep canonical VariantArray core_storage and shredded children row-aligned through slice, filter, take, mask validity, and canonicalization paths. Signed-off-by: "Adam Gutglick" <adam@spiraldb.com> Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: "Adam Gutglick" <adam@spiraldb.com> Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: "Adam Gutglick" <adam@spiraldb.com> Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: cf8065c I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: fdc1e44 I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 8ef03b2 I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 9b6eeb4 I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: c0764b3 I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 5ff2cb8 Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: be87775 I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 175ed3f I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: c2d25ff I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: c92afe0 I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 67895f8 I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 8f61afc Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
7546876 to
0ff0f40
Compare
Summary
This PR includes two big changes as Variant moves closer to readiness.
shreddedchild of a variant array in the canonical VariantArrayVariantGetexpression that can pull extract data out of variant arrays, either in a typed way or as a more opaqueVariant.For reviewers, some relevant context might be:
I think the Parquet spec is also a pretty good read and a very heavy influence of this work -
Shreddingand theVariant type.