Skip to content

[VL] Fix native crash in toVeloxExpr for nested field reference into array/map#12290

Open
felipepessoto wants to merge 3 commits into
apache:mainfrom
felipepessoto:fix-fieldref-nested-array-crash
Open

[VL] Fix native crash in toVeloxExpr for nested field reference into array/map#12290
felipepessoto wants to merge 3 commits into
apache:mainfrom
felipepessoto:fix-fieldref-nested-array-crash

Conversation

@felipepessoto

@felipepessoto felipepessoto commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

TL;DR fix this: https://github.com/apache/gluten/actions/runs/27487184961/job/81246596702?pr=12278
image

PR/CI with test only, before the fix: #12291 / https://github.com/apache/gluten/actions/runs/27490867910/job/81255583450?pr=12291

image

SubstraitVeloxExprConverter::toVeloxExpr(const ::substrait::Expression::FieldReference&, const RowTypePtr&) resolves a nested struct_field reference by walking the chain and descending one level at a time with inputColumnType = asRowType(inputColumnType->childAt(idx)).

When the field path traverses a non-struct child — for example a field nested under an array — asRowType() returns nullptr, and the next loop iteration dereferenced that null RowType, crashing the whole (forked) JVM with a SIGSEGV in libvelox.so:

# C  [libvelox.so+0x214951c]  gluten::SubstraitVeloxExprConverter::toVeloxExpr(substrait::Expression_FieldReference const&, std::shared_ptr<facebook::velox::RowType const> const&)+0x9c

Because a SIGSEGV is not a catchable C++ exception, SubstraitToVeloxPlanValidator (which wraps expression conversion in try/catch (VeloxException)) could not catch it and fall back — the process died outright.

This PR replaces the unchecked navigation with VELOX_USER_CHECK guards that throw a VeloxUserError:

  • the field-reference index is within range (the raw RowType::childAt/nameOf use unchecked operator[]), and
  • the child being descended into is actually a struct/row.

With these checks, an unsupported nested reference makes plan validation fail cleanly and the query falls back to vanilla Spark instead of crashing the JVM.

How was this patch tested?

Added cpp/velox/tests/SubstraitVeloxExprConverterTest.cc, a unit test that builds the exact crashing FieldReference — a reference descending into an array column — and asserts the converter now throws a VeloxUserError (via VELOX_ASSERT_THROW) instead of crashing the process; it also covers an out-of-range field index. Before this change the same input aborts the JVM with the SIGSEGV shown above.

This is the same code path hit by Delta Lake's UpdateSuite test "nested data support - analysis error - updating array type" (an UPDATE whose field path descends through an array), which is exercised by the Gluten Delta Spark UT CI: before this change the forked test JVM aborts with the SIGSEGV; after it, validation falls back to vanilla Spark and the query is handled correctly.

clang-format-15 reports no changes.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (claude-opus-4.8)

…array/map

SubstraitVeloxExprConverter::toVeloxExpr(FieldReference) resolves a nested
struct_field chain by calling asRowType() on each child to descend a level.
When the reference traverses a non-struct child -- e.g. a field nested under
an array, as exercised by Delta's
"nested data support - analysis error - updating array type" UpdateSuite test
-- asRowType() returns null and the next loop iteration dereferenced that null
RowType, crashing the whole forked JVM with a SIGSEGV in libvelox.so
(gluten::SubstraitVeloxExprConverter::toVeloxExpr). A SIGSEGV is not catchable,
so plan validation could not fall back and the test process died outright.

Replace the unchecked navigation with VELOX_USER_CHECK guards (field-index
bounds + non-null struct child) that throw a VeloxUserError. The plan validator
already wraps expression conversion in try/catch(VeloxException), so the query
now falls back to vanilla execution cleanly instead of crashing.

Add SubstraitVeloxExprConverterTest with a regression repro: a field reference
descending into an array column, and an out-of-range field index, both of which
must now throw instead of crashing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 14, 2026 06:34
@github-actions github-actions Bot added the VELOX label Jun 14, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds regression coverage and hardens Substrait field-reference conversion to fail gracefully (VeloxUserError) instead of crashing (SIGSEGV) on invalid nested paths and out-of-range indices.

Changes:

  • Add bounds checking for direct struct_field indices during field-access conversion.
  • Reject nested field descent into non-row types with a user error (instead of nullptr deref).
  • Add unit tests covering both failure modes and wire them into the Velox test target.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
cpp/velox/tests/SubstraitVeloxExprConverterTest.cc New regression tests for non-struct nested field references and out-of-range field indices.
cpp/velox/tests/CMakeLists.txt Registers the new test file in the Velox test suite.
cpp/velox/substrait/SubstraitToVeloxExpr.cc Adds range checks and non-row descent checks to prevent SIGSEGV and throw clean user errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/velox/substrait/SubstraitToVeloxExpr.cc Outdated
::substrait::Expression::FieldReference fieldReference;
fieldReference.mutable_direct_reference()->mutable_struct_field()->set_field(5);

VELOX_ASSERT_THROW(SubstraitVeloxExprConverter::toVeloxExpr(fieldReference, inputType), "out of range");
@felipepessoto felipepessoto changed the title [VL] Fix native crash in toVeloxExpr for nested field reference into array/map [WIP] [VL] Fix native crash in toVeloxExpr for nested field reference into array/map Jun 14, 2026
@felipepessoto felipepessoto changed the title [WIP] [VL] Fix native crash in toVeloxExpr for nested field reference into array/map [VL] Fix native crash in toVeloxExpr for nested field reference into array/map Jun 14, 2026
@felipepessoto

felipepessoto commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

@zhztheplayer @philo-he could you review this PR? I found the issue when running the Delta CI #12278.

I'm not familiar with this code, but with the fix the SIGSEGV are gone. I also created a PR without the fix and I could repro it: #12291 / https://github.com/apache/gluten/actions/runs/27490867910/job/81255583450?pr=12291.

Thanks

@zhztheplayer

Copy link
Copy Markdown
Member

cc @rui-mo

@rui-mo rui-mo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I assume this change converts a core dump into a user-facing error. Have you looked into the cases where Spark could produce an out-of-range index? I'm wondering why Spark would generate this kind of plan.

felipepessoto added a commit to felipepessoto/gluten that referenced this pull request Jun 16, 2026
…crash

Adds VeloxDeltaNestedFieldArraySuite: a Delta MERGE that updates a field nested
under an array (`value.a` where `value` is `array<struct<a:int>>`). On an
unfixed build this drives SubstraitVeloxExprConverter::toVeloxExpr to descend
into the array, dereference the null RowType returned by asRowType(), and crash
the forked JVM with a SIGSEGV -- the Spark-level companion to the existing
SubstraitVeloxExprConverterTest.cc.

This reaches the native converter (and the crash) under the Spark 4.0/4.1
-Pdelta CI jobs (Delta 4.x). With the fix in apache#12290 the converter throws a
catchable VeloxUserError, Gluten falls back, and Delta raises the expected
AnalysisException.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@felipepessoto

Copy link
Copy Markdown
Contributor Author

Thanks for the fix! I assume this change converts a core dump into a user-facing error. Have you looked into the cases where Spark could produce an out-of-range index? I'm wondering why Spark would generate this kind of plan.

I tried multiple things, but I couldn't repro with simple unit test. The repro is at https://github.com/apache/gluten/actions/runs/27487184961/job/81246596702?pr=12278

image

So, I assume it is the next test:

https://github.com/delta-io/delta/blob/9f600c42dd9bc383c025ee6e43f8199571a2783b/spark/src/test/scala/org/apache/spark/sql/delta/MergeIntoSuiteBase.scala#L1291

Suite MergeIntoNestedDataSQLPathBasedSuite

The problem is this test is not a simple SQL command:

image

… CI

After validating with PR apache#12278 (all 79 CI checks passed including Delta tests),
confirmed that the explicit VELOX_USER_CHECK(idx < size) is redundant - Velox's
RowType::childAt/nameOf already have VELOX_CHECK_LT bounds checks that throw
catchable VeloxUserError, providing sufficient error handling for Gluten.

Kept the essential VELOX_USER_CHECK_NOT_NULL for the rowType, which prevents
SIGSEGV by catching null before dereferencing.

Updated test comment to reflect actual Velox behavior (built-in bounds checks).

Validation: apache@4a9d3d846
@felipepessoto

felipepessoto commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@rui-mo

For (VELOX_USER_CHECK(idx < size)) -- I have not seen Spark produce this, and I don't think a well-formed plan would: the index comes from Spark's resolved GetStructField.ordinal / bound attribute ordinals, which the analyzer keeps in range. I added it only as defense-in-depth with a clearer message. In fact, in current Velox RowType::childAt/nameOf already
VELOX_CHECK_LT(idx, size), so an out-of-range index would already throw a catchable VeloxRuntimeError and fall back even without this check.

I'm dropping that change to keep the PR focused on the real crash fix.

https://github.com/facebookincubator/velox/blob/main/velox/type/Type.h#L1213C1-L1216C4

The test was checking for 'out of range' in the error message, but Velox's
VELOX_CHECK_LT throws with format 'Expression: idx < children_.size() (5 vs. 2)'.
Updated the test assertion to match the actual Velox error message.
Copilot AI review requested due to automatic review settings June 18, 2026 09:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment on lines +65 to +66
// Velox's VELOX_CHECK_LT throws with format "Expression: idx < children_.size() (5 vs. 2)"
VELOX_ASSERT_THROW(SubstraitVeloxExprConverter::toVeloxExpr(fieldReference, inputType), "idx < children_.size()");
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants