Skip to content

[VL][Delta] Delta CI pipeline#12278

Open
felipepessoto wants to merge 21 commits into
apache:mainfrom
felipepessoto:delta_pipeline
Open

[VL][Delta] Delta CI pipeline#12278
felipepessoto wants to merge 21 commits into
apache:mainfrom
felipepessoto:delta_pipeline

Conversation

@felipepessoto

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

Fix #9296

I wanted to create this PR to start discussing this, so we can have an idea of how it would work, if this is worth, etc.

Tests are failing, it can help uncover bugs.

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Copilot

Copilot AI review requested due to automatic review settings June 11, 2026 20:43
@github-actions github-actions Bot added the INFRA label Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a GitHub Actions pipeline to build a Gluten Velox bundle and run Delta Lake’s spark module unit tests against it, including automation to patch a Delta checkout and adjust its build config for the CI run.

Changes:

  • Introduces a new delta_spark_ut.yml workflow with native build, bundle assembly, and sharded Delta test execution.
  • Adds a setup-delta.sh helper to clone Delta, inject the bundle jar into the correct sbt project lib/, patch DeltaSQLCommandTest, and disable a scalastyle header check.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
.github/workflows/util/delta-spark-ut/setup-delta.sh Automates cloning/patching Delta and adjusting scalastyle to make Gluten-enabled tests run.
.github/workflows/delta_spark_ut.yml Defines the CI workflow to build Gluten artifacts and execute Delta Spark unit tests in shards.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/delta_spark_ut.yml
Comment thread .github/workflows/delta_spark_ut.yml
Comment thread .github/workflows/delta_spark_ut.yml
Comment thread .github/workflows/util/delta-spark-ut/setup-delta.sh
@felipepessoto

Copy link
Copy Markdown
Contributor Author

@zhztheplayer

Copy link
Copy Markdown
Member

Can we remove the duplicated tests in Gluten's codebase, if they are covered by the new way?

@felipepessoto

Copy link
Copy Markdown
Contributor Author

I guess so. Idk exactly what are these duplicates, if they are exact copies, and from which versions.

@philo-he philo-he left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipepessoto, thanks for the PR.

cancel-in-progress: true

jobs:
build-native-lib-centos-7:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this job a duplicate of the one in velox_backend_x86.yml? If so, we might consider moving the Delta tests into that file to reuse the native artifact it builds, since that artifact can't be shared across different workflows.

This is also a consideration for reducing our GHA usage, see #12288

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philo-he if reducing CI usage is a priority, introducing Delta CI can be a big problem, Delta tests run for many hours and requires 4~8 shards to run a reasonable time (2-3h).

@github-actions github-actions Bot added the DOCS label Jun 14, 2026
Copilot AI review requested due to automatic review settings June 14, 2026 00:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread .github/workflows/util/delta-spark-ut/compare-test-results.py
Comment thread .github/workflows/util/delta-spark-ut/compare-test-results.py
Comment thread .github/workflows/util/delta-spark-ut/setup-delta.sh
Copilot AI review requested due to automatic review settings June 14, 2026 04:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread .github/workflows/delta_spark_ut.yml
Comment thread .github/workflows/util/delta-spark-ut/record-crash.py Outdated
Comment thread .github/workflows/util/delta-spark-ut/compare-test-results.py
Comment thread .github/workflows/delta_spark_ut.yml
Copilot AI review requested due to automatic review settings June 14, 2026 05:57
@github-actions github-actions Bot added the VELOX label Jun 14, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Comment thread .github/workflows/delta_spark_ut.yml
Comment thread .github/workflows/delta_spark_ut.yml
Comment thread .github/workflows/util/delta-spark-ut/setup-delta.sh
Comment thread .github/workflows/util/delta-spark-ut/compare-test-results.py
Comment thread .github/workflows/util/delta-spark-ut/compare-test-results.py
Comment thread cpp/velox/substrait/SubstraitToVeloxExpr.cc

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.

Comment thread .github/workflows/util/delta-spark-ut/known-failures.txt Outdated
Comment thread .github/workflows/delta_spark_ut.yml
Comment thread cpp/velox/substrait/SubstraitToVeloxExpr.cc
Comment thread .github/workflows/util/delta-spark-ut/setup-delta.sh

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

Comment thread cpp/velox/substrait/SubstraitToVeloxExpr.cc
Comment thread .github/workflows/util/delta-spark-ut/setup-delta.sh
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

felipepessoto and others added 4 commits June 14, 2026 22:41
Adds a GitHub Actions workflow `delta_spark_ut.yml` (plus a helper
`util/delta-spark-ut/setup-delta.sh`) that runs delta-io/delta's `spark`
sbt module unit tests with a Gluten Velox bundle built from this
repository's source on the classpath. The pipeline lets us catch
regressions in Gluten against the latest Delta release before they
reach users.

Triggers:
  * `workflow_dispatch` with overridable `delta_ref` (default v4.2.0),
    `spark_version` (default 4.1), and `test_parallelism` (default 1).
  * `pull_request` when the workflow files, `gluten-delta/**`, or the
    reused `backends-velox/.../DeltaSQLCommandTest.scala` change.

Three jobs:

  1. `build-native-lib-centos-7` -- builds the Velox/Gluten native
     libraries in the `apache/gluten:vcpkg-centos-7-gcc13` container
     (x86_64), reusing `dev/ci-velox-buildstatic-centos-7.sh` and the
     ccache layout other pipelines already use. Uploads the cpp/build
     tree and the Arrow jars from the local m2 repo as artifacts.

  2. `build-gluten-bundle` -- in `apache/gluten:centos-9-jdk17`,
     downloads the native artifacts and runs `mvn clean install` with
     `-Pspark-4.1 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta`
     (install, not package, so the gluten-delta jar lands in m2 before
     the shaded fat jar is built; `-Dmaven.compiler.release=17`
     defeats any user settings.xml pinning release=1.8). Uploads the
     resulting `gluten-velox-bundle-spark4.1_2.13-linux_amd64-*.jar`.

  3. `delta-spark-test` -- in `apache/gluten:centos-9-jdk17`, sharded
     8-ways (matching Delta's upstream `spark_test.yaml`), runs
     `setup-delta.sh` and then `sbt spark/test`. `timeout-minutes: 300`
     to fail faster than the 6h GH job timeout. Test reports are
     uploaded unconditionally; on failure we also upload `hs_err_pid*`
     and `/tmp/*.hprof(.gz)` heap dumps for post-mortem analysis.

Non-obvious design decisions captured in the code (and worth flagging
for reviewers):

* **Bundle goes only in `delta/spark-unified/lib/`, NOT `delta/spark/lib/`.**
  sbt auto-scans `<baseDirectory>/lib` via `unmanagedBase`, and
  `unmanagedJars` are project-scoped (not inherited by dependents). In
  Delta v4.2.0 the layout is
    - `sparkV1 = project in file("spark")`     -> `spark/lib`
    - `spark   = project in file("spark-unified")` -> `spark-unified/lib`
  Placing the bundle only under `spark-unified/lib/` exposes it to the
  unified `spark` project's Compile and Test classpaths (where the
  forked test JVM loads `org.apache.gluten.GlutenPlugin` by name) while
  keeping it off `sparkV1`'s Compile classpath. The latter matters
  because the bundle pulls extra symbols under `org.apache.spark.sql`
  into scope, which would collide with Delta's own
  `MergeOutputGeneration.scala` (it imports both `org.apache.spark.sql._`
  and `org.apache.spark.sql.delta.ClassicColumnConversions._`).

* **`JAVA_TOOL_OPTIONS` carries the Netty + JDK-17 `--add-opens` set,
  but NOT `-Xmx`.** The flag set mirrors `extraJavaTestArgs` from
  Gluten's own root `pom.xml`. The crucial non-Delta-default flag is
  `-Dio.netty.tryReflectionSetAccessible=true`: without it Gluten's
  bundled Arrow allocator throws
  `UnsupportedOperationException: sun.misc.Unsafe or DirectByteBuffer
  ... not available` on first direct-buffer allocation. Delta's own
  `Test/javaOptions` (see `project/CrossSparkVersions.scala`
  `java17TestSettings`) already sets the base `--add-opens` but not
  the Netty property -- their own suites just don't load Arrow this way.
  `-Xmx` is intentionally NOT in `JAVA_TOOL_OPTIONS`: it is processed
  BEFORE the command line, so Delta's explicit `-Xmx1024m` would still
  win (last `-Xmx` wins).

* **Forked test JVM heap is bumped to `-Xmx6G` via
  `set spark / Test / javaOptions ++= Seq("-Xmx6G", ...)`.** `++=`
  appends to Delta's own seq, so our `-Xmx6G` lands AFTER `-Xmx1024m`
  and wins. Delta v4.2.0's 1G cap is far too tight once Gluten +
  Velox + Arrow + JNI are loaded -- a DV+CDC merge suite OOMs inside
  `RoaringBitmapArray.extendBitmaps` at that heap size. We also turn
  on `HeapDumpOnOutOfMemoryError` with `HeapDumpPath=/tmp/` so future
  OOMs come with a dump. `_JAVA_OPTIONS` is avoided because it would
  also override the sbt launcher heap and defeat budget accounting.

* **sbt launcher heap is `-J-Xmx4G`** -- sbt only compiles tests and
  orchestrates the test fork, so 4G is comfortable and leaves more of
  the 16 GB GH runner for the forked test JVM.

* **Single shared sbt/Ivy/Coursier cache across all shards** -- all
  shards have the same dependency tree, so a single cache key (with
  parallel saves resolved first-write-wins by GH) gives a better
  storage / hit-rate tradeoff than 8 isolated caches.

* **`yum install` does NOT include `curl`** -- the
  `apache/gluten:centos-9-jdk17` image ships `curl-minimal`, which
  provides the `curl` command (used by `sbt-launch-lib.bash` to
  download `sbt-launch.jar`) but conflicts with the full `curl`
  package. Installing the full package would fail with
  "package curl-minimal... conflicts with curl".

* **`DeltaSQLCommandTest` is reused, not duplicated.** `setup-delta.sh`
  overwrites Delta's own `DeltaSQLCommandTest.scala` with the existing
  copy under `backends-velox/src-delta40/...`. That file registers the
  Gluten plugin via typed `GlutenConfig` / `VeloxDeltaConfig`
  references, which resolve at test-compile time because the bundle is
  already on the unified `spark` project's Test classpath.

* **Delta's scalastyle `HeaderMatchesChecker` is disabled in the cloned
  Delta's `scalastyle-config.xml`.** Our reused
  `DeltaSQLCommandTest.scala` carries Gluten's ASF-only license header,
  which does not match Delta's expected regex (ASF + Spark-mod block +
  Delta copyright). `HeaderMatchesChecker` is a file-level checker that
  does not honor `// scalastyle:off`, so it has to be disabled at
  config level. The config is wired in via
  `ThisBuild / scalastyleConfig` in `project/Checkstyle.scala`, so a
  single edit covers every sbt sub-project.

Scope is intentionally limited to Velox + x86_64 to keep the matrix
small. ClickHouse and aarch64 can be added later if needed.

Co-authored-by: Copilot
Running delta-io/delta's spark suite against the Gluten Velox bundle
produces many expected failures. This adds a committed known-failures
baseline and a per-shard gate so CI is green when only baseline failures
occur and red on a genuine regression, enabling incremental fixes.

- Inject ScalaTest's -u JUnit XML reporter (Delta only configures the
  console reporter, so no machine-readable per-test results existed).
- Capture sbt's exit so expected test failures don't fail the step; fail
  loudly only when zero reports are produced (compile/launch failure).
- compare-test-results.py classifies each test vs known-failures.txt
  (regression / expected / now-passing) in enforce/seed/aggregate modes.
- Add update_baseline + fail_on_fixed inputs and an aggregate job that
  emits a ready-to-commit baseline artifact.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
At 8-way sharding, ~5 of the 8 delta-spark-test shards consistently exceed
the 300-minute job timeout: Delta's generated merge/CDC/DV suites are very
slow under Gluten. Observed in run 16 (shards 0,1,2,3,6 all hit the 300-min
cap and were cancelled, while 4,5,7 finished in <130 min), and reproduced
on the current run. Cancelled shards never upload their results, so the
known-failures baseline could only ever capture a subset of suites.

Delta's GreedyHashStrategy balances high-duration suites across shards by
estimated duration, so doubling NUM_SHARDS 8 -> 16 roughly halves per-shard
wall time and lets every shard finish (and report) within the timeout.
Bump timeout-minutes 300 -> 350 for extra margin. TEST_PARALLELISM_COUNT
stays 1 (each forked test JVM uses ~8G, so >1 would OOM the runner).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…array/map

SubstraitVeloxExprConverter::toVeloxExpr(FieldReference) resolves a nested
struct_field chain by calling asRowType() on each child to descend a level.
When the reference traverses a non-struct child -- e.g. a field nested under
an array, as exercised by Delta's
"nested data support - analysis error - updating array type" UpdateSuite test
-- asRowType() returns null and the next loop iteration dereferenced that null
RowType, crashing the whole forked JVM with a SIGSEGV in libvelox.so
(gluten::SubstraitVeloxExprConverter::toVeloxExpr). A SIGSEGV is not catchable,
so plan validation could not fall back and the test process died outright.

Replace the unchecked navigation with VELOX_USER_CHECK guards (field-index
bounds + non-null struct child) that throw a VeloxUserError. The plan validator
already wraps expression conversion in try/catch(VeloxException), so the query
now falls back to vanilla execution cleanly instead of crashing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Shard 2 (and intermittently other shards) hangs indefinitely after a suite's
last test -- silent until the 350-min job timeout with zero diagnostics.
ScalaTest's failAfter only wraps individual test BODIES, so a wedge in suite
teardown/afterAll, or in a non-interruptible native Velox/JNI call that ignores
Thread.interrupt(), has no timeout.

Add a background watchdog to the Delta test step: if the test output stays
silent for 15 min, it locates the sbt.ForkMain test JVM (scanning /proc, and
reading sbt's @argfile since recent sbt keeps the main class out of the
cmdline) and captures up to three jstack thread dumps -- to the job log (so they
survive a timeout cancellation) and to a per-shard artifact. This makes the
chronic hang diagnosable instead of an opaque timeout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 14, 2026 22:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Comment thread cpp/velox/substrait/SubstraitToVeloxExpr.cc
Comment thread .github/workflows/delta_spark_ut.yml
Comment thread .github/workflows/util/delta-spark-ut/setup-delta.sh
felipepessoto and others added 4 commits June 15, 2026 01:25
Run apache#12 shard 2 hung ~63 min in DeletionVectorsSuite with ZERO thread dumps
captured. Root cause: find_fork() probed only /proc cmdline/@argfile for
sbt.ForkMain; recent sbt keeps the main class out of both, so it matched nothing
and the watchdog silently did nothing.

- Locate the fork via `jps -l` (reads the main class from hsperfdata, robust to
  the @argfile launch) with the /proc scan as fallback.
- Safety net: if the fork still can't be pinpointed, jstack EVERY JVM so a hang
  is never left undiagnosed.
- Wrap jstack in `timeout` and log failures, and dump per-pid files for the
  artifact.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e aggregations"

TEMPORARY -- validation only. Per the pipeline philosophy (errors are expected,
only hangs/crashes that block CI progress matter), check whether the ClassCast
fix is still needed after the rebase: apache/main now has apache#12229 (basic
TIMESTAMP_NTZ support), which may make the min/max-over-TIMESTAMP_NTZ stats
aggregation offloadable -- so the ProjectExec->WholeStageTransformer cast no
longer throws, and the leaked single-thread executor that could wedge teardown
no longer allocates. Push without the fix and watch CI: if it progresses (no
hang) the fix is redundant; if shard(s) hang, restore it.

This reverts commit 1dde2b3.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validation result (run apache#13, ClassCast fix removed): shard 2 HUNG at the
ClassCastException (ProjectExec cannot be cast to WholeStageTransformer) in
DataSkipping TIMESTAMP_NTZ -- it never reached later suites. Per the pipeline
criterion (errors ok, hangs are not), removing the fix blocks CI, so the fix is
required after all (apache#12229 TIMESTAMP_NTZ support did NOT make the stats
aggregation offloadable -- the ClassCastException still occurs without the fix).

This restores commit 1dde2b3 (reverts the temporary validation revert
55de9e0).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eartbeat

The watchdog captured ZERO dumps in runs apache#12 and apache#13 despite shard 2 hanging for
30-60 min. Root cause: the step shell runs with `bash -eo pipefail`, inherited
by the watchdog subshell. When fork detection returned non-zero (jps not listing
sbt.ForkMain, /proc miss), errexit SILENTLY killed the watchdog before it dumped.

- `set +e +o pipefail` inside the subshell -- a diagnostic must never abort on a
  failed probe. (Verified locally: the subshell now survives a no-match probe.)
- Dump via `kill -QUIT` so the JVM prints its thread dump to its own stderr,
  which sbt relays into the job log through the SAME stream as test output (a
  separately spawned jstack child's stdout can be buffered/lost). Verified
  locally that SIGQUIT yields a full thread dump. jstack still written to a file
  for the artifact.
- Dump EVERY JVM when the fork can't be pinpointed (safety net).
- Startup "armed" line + ~5-min heartbeat so the watchdog is self-verifying --
  you can see it is alive and how long output has been silent, before any hang.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 03:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

felipepessoto and others added 2 commits June 15, 2026 05:10
…lush)

The v3 watchdog works (verified in run apache#14: 'armed' + 5-min heartbeats), but a
hung JVM stalls the shard until the 350-min timeout AND keeps the in-progress job
log frozen at the hang point -- so GitHub never flushes the watchdog's SIGQUIT
dump / the artifact only uploads at job end, making the captured dump unreachable
for hours.

After capturing the dump (job log + jstack artifact), SIGKILL the wedged JVM(s).
The suite then fails fast (acceptable -- errors are expected; only an
unrecoverable hang blocks CI), the shard proceeds or ends, and the log +
thread-dump artifacts flush so the hang is actually diagnosable. Turns a 350-min
stall into a fast failure with a thread dump.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…OM-kill

The chronic 'shard 2 hang' is sbt's main process wedging forever in
ScalaTestRunner.done()->Thread.join() after the forked test JVM dies. Run apache#15
showed that death has NO hs_err and NO heap dump -- i.e. a sudden kernel/cgroup
OOM-kill, not a JVM heap OOM. Cause: on the ~16G ubuntu-22.04 runner the budget
was sbt launcher 4G + fork heap 8G + a growing Velox off-heap pool, which tips
over and the kernel kills the fork.

The 8G fork heap was added earlier only to absorb a DV+CDC merge suite's giant
RoaringBitmapArray allocation -- a Gluten bug (garbage native row index) that is
now FIXED UPSTREAM by apache#12269. So 8G is no longer needed and was actively starving
Velox's off-heap. Drop it to 4G (still 4x Delta's 1G default), leaving ~4G more
headroom for off-heap. Also log cgroup memory.peak / memory.events after the run
to confirm/measure the OOM-kill.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 07:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Comment thread .github/workflows/delta_spark_ut.yml
Comment thread cpp/velox/substrait/SubstraitToVeloxExpr.cc
felipepessoto and others added 2 commits June 15, 2026 09:27
Run apache#16's cgroup diagnostic confirmed the chronic shard-2 hang is a kernel
OOM-kill of the forked test JVM: memory.peak = 15.97G on the ~16G runner. The
JVM HEAPS drive that peak, not off-heap (spark.memory.offHeap.size is already
capped at 2g in the patched DeltaSQLCommandTest). Gluten offloads data to Velox
off-heap, so the fork's JVM heap need is modest -- Delta's own default is 1G.
Drop the fork heap to 2G (the 8G/4G bumps were for a DV-write RoaringBitmap OOM
that apache#12269 fixed) to bring the peak to ~13G with real headroom, so the fork
stops getting OOM-killed and shard 2 can complete. Heap-dump-on-OOM stays to
catch any genuine >2G heap need.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Heap tuning (8G->4G->2G) left the cgroup memory.peak unchanged at ~15.9G, proving
the shard-2 OOM-kill is driven by NATIVE/off-heap memory, not the JVM heap. To
target the real fix, log a per-minute memory profile from the watchdog loop:
cgroup memory.current + each JVM's VmRSS (read from /proc, no ps dependency). The
lines just before a hang reveal which process (sbt launcher vs forked test JVM)
is the actual memory hog.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 14:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

statsResultAttrs,
StatisticsInputNode(dataCols))
val projOp = ProjectExec(statsResultAttrs, aggOp)
val offloads = Seq(OffloadOthers()).map(_.toStrcitRule())
statsResultAttrs,
StatisticsInputNode(dataCols))
val projOp = ProjectExec(statsResultAttrs, aggOp)
val offloads = Seq(OffloadOthers()).map(_.toStrcitRule())
Comment on lines +240 to +242
VELOX_USER_CHECK_NOT_NULL(
inputColumnType,
"Nested field reference into a non-struct type (e.g. an array or map element) is not supported.");
Comment on lines +220 to +224
VELOX_USER_CHECK(
idx >= 0 && static_cast<uint32_t>(idx) < inputColumnType->size(),
"Field reference index {} is out of range for the {}-field row type.",
idx,
inputColumnType->size());
- 'backends-velox/src-delta40/**/DeltaSQLCommandTest.scala'

env:
ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
felipepessoto and others added 2 commits June 15, 2026 16:00
The run apache#18 per-minute MEM profiler pinpointed the native OOM-kill driver: the
sbt LAUNCHER JVM grows to a rock-steady 5457M RSS during the test-compile and
then HOLDS ~5.3G for the entire run, even though it is idle during the long test
phase (it only relays the fork's test events). That fixed ~5.3G, stacked on the
forked test JVM's native spike in a heavy suite, is what pushed the cgroup to the
~15.9G OOM-kill that wedged shard 2. G1 never uncommits on its own because the
idle launcher never GCs.

Fix (behaviour-neutral -- touches NO Gluten/Spark runtime config, so it cannot
change the measured pass/fail signal): keep -Xmx4G for the test-compile headroom
but force the idle launcher to return memory -- periodic GC every 10s
(G1PeriodicGCInterval) with the system-load gate disabled (the busy fork would
otherwise suppress it), as a full STW collection (-G1PeriodicGCInvokesConcurrent)
that uncommits to a tight free ratio (Min/MaxHeapFreeRatio 5/15, JEP 346) above a
low -Xms512m floor. The idle launcher drops from ~5.3G to ~1-2G during the test
phase, cutting the cgroup peak ~3.8G (~15.9G -> ~12G) with zero compile-OOM risk.
The MEM profiler stays in to confirm the drop and that shard 2 stops OOM-dying.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eline branch

Removes the GlutenDeltaJobStatsTracker change (originally 1dde2b3, restored in
7a03784) that routed non-offloadable Delta stats aggregations -- e.g. min/max
over TIMESTAMP_NTZ -- to the row-based fallback tracker instead of crashing with
ClassCastException (ProjectExec cannot be cast to WholeStageTransformer).

It was kept here on the theory it was needed to stop the shard-2 hang, but that
hang was since root-caused to a native/off-heap OOM-kill (DeletionVectorsSuite's
'huge table: 2B rows' materializing a ~13G row-index), which the hang watchdog
self-resolves. The ClassCast fix only MOVED the hang; it never prevented it. Per
this pipeline's criterion (test failures are acceptable as long as CI progresses;
only a progress-blocking hang/crash warrants a fix) the change is not required on
this branch, and dropping it keeps the PR focused on the CI infrastructure. The
ClassCastException fix is a legitimate improvement and can land as its own PR.

Reverts the fix in both the Delta 3.x (src-delta33) and 4.x (src-delta40) copies.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 18:24

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

Comment on lines +220 to +224
VELOX_USER_CHECK(
idx >= 0 && static_cast<uint32_t>(idx) < inputColumnType->size(),
"Field reference index {} is out of range for the {}-field row type.",
idx,
inputColumnType->size());
Comment on lines +240 to +242
VELOX_USER_CHECK_NOT_NULL(
inputColumnType,
"Nested field reference into a non-struct type (e.g. an array or map element) is not supported.");
Comment on lines +61 to +68
echo "::group::Cloning delta-io/delta @ ${DELTA_REF}"
# Shallow clone the requested tag/branch. Fall back to full clone when the ref is a SHA.
if ! git clone --depth 1 --branch "$DELTA_REF" https://github.com/delta-io/delta.git "$DELTA_DIR"; then
echo "Shallow clone of ref '${DELTA_REF}' failed, falling back to full clone."
rm -rf "$DELTA_DIR"
git clone https://github.com/delta-io/delta.git "$DELTA_DIR"
git -C "$DELTA_DIR" checkout "$DELTA_REF"
fi
Comment on lines +165 to +167
for pattern in ("**/TEST-*.xml", "**/target/**/*.xml"):
xml_files.extend(glob.glob(os.path.join(reports_dir, pattern), recursive=True))
xml_files = sorted(set(xml_files))
Comment on lines +66 to +68
env:
ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
MVN_CMD: 'build/mvn -ntp'
Comment on lines +199 to +203
# Match the renamed fat jar produced by package/pom.xml's copy-fat-jar
# exec. The version part may bump (e.g. 1.7.0-SNAPSHOT -> 1.8.0-SNAPSHOT),
# so glob the version suffix.
jar=$(ls package/target/gluten-velox-bundle-spark${{ env.GLUTEN_BUNDLE_SPARK_VERSION }}_${{ env.GLUTEN_BUNDLE_SCALA_VERSION }}-linux_amd64-*.jar | head -n 1)
if [ -z "$jar" ] || [ ! -f "$jar" ]; then
felipepessoto and others added 2 commits June 15, 2026 21:25
The baseline was originally seeded from only 15 of 16 shards (run 27490052632)
because shard 2 hung/OOM-crashed and never produced results, so all of shard 2's
expected Delta-on-Gluten failures surfaced as regressions and failed its gate --
the sole red shard in the otherwise-green run.

Now that the hang watchdog and the launcher-heap reclaim let shard 2 run to a
gate result (851 tests, 60 failures), merge those 60 failures into the baseline.
The merge is purely ADDITIVE: all 894 existing entries are kept and the 60 new
shard-2 failures are added (894 -> 954). No entries are dropped -- deliberately
NOT replacing with the aggregate artifact, which would prune 67 baseline entries
that simply did not run this time (e.g. shard 7 ran only 15 tests), reintroducing
them as regressions the next time a shard runs further. now-passing was 0 across
all 16 shards, so nothing in the baseline needs removing.

The 60 added are genuine Gluten incompatibilities (DeltaVariantSuite x20,
DataSkippingDeltaV1Suite x20, CheckpointsSuite x5, ...), not OOM artifacts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…clean OOMs

Experiment: restrict the forked test JVM's native (Velox/Arrow) memory so a
runaway suite throws a clean Velox OOM instead of growing until the kernel
OOM-kills the whole fork. The kernel-kill is the severe failure mode -- it leaves
no hs_err, wedges sbt's main thread, and produces the chronic shard hang; a clean
OOM just fails the test (deterministic + baselineable) and the JVM survives.

In DeltaSQLCommandTest (src-delta40, the file this pipeline patches in):
  - spark.gluten.memory.isolation=true   -> off-heap becomes a HARD per-task cap
    (= spark.memory.offHeap.size / executorCores). Gluten documents this as
    'enable to avoid OOM' precisely because not all native memory is spillable.
  - spark.gluten.memory.overAcquiredMemoryRatio=0  -> drop the 30% over-acquire.
  - spark.memory.offHeap.size 2g -> 4g  -> keeps the per-task cap a reasonable
    ~1-2G (depending on the test master's core count) while the total managed
    off-heap stays <= 4g, so the fork stays well under the ~16G runner.

Expected, to be VERIFIED from the run (not assumed): shard 2's DeletionVectorsSuite
'2B rows' native balloon hits the cap and throws instead of kernel-killing (no
hang); and some other genuinely memory-heavy tests may also start throwing clean
OOMs (a net failure delta to measure). If the DV allocation turns out to be
untracked/unmanaged, it will still grow past the cap -- the run will show that too.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 16, 2026 02:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

Comment on lines +220 to +226
VELOX_USER_CHECK(
idx >= 0 && static_cast<uint32_t>(idx) < inputColumnType->size(),
"Field reference index {} is out of range for the {}-field row type.",
idx,
inputColumnType->size());
const TypePtr childType = inputColumnType->childAt(idx);
fieldAccess = makeFieldAccessExpr(inputColumnType->nameOf(idx), childType, fieldAccess);
Comment on lines +232 to +242
// Descending into a nested field is only valid when the current child is
// itself a struct/row. For array/map/primitive children (e.g. a field
// nested under an array, as in Delta's "updating array type" case)
// asRowType() returns null; previously the next loop iteration
// dereferenced that null RowType and crashed the process with a SIGSEGV.
// Throw a user error instead so plan validation fails cleanly and the
// query falls back to vanilla execution.
inputColumnType = asRowType(childType);
VELOX_USER_CHECK_NOT_NULL(
inputColumnType,
"Nested field reference into a non-struct type (e.g. an array or map element) is not supported.");
Comment on lines +51 to +65
// Bound native memory so a runaway suite (e.g. DeletionVectorsSuite's
// "2B rows" read, whose Velox native grows to ~13G) hits a HARD per-task
// cap and throws a clean Velox OOM -- a deterministic, baselineable test
// failure -- instead of growing until the kernel OOM-kills the whole fork
// (which wedges sbt into the chronic shard hang). `memory.isolation=true`
// makes the off-heap pool a hard per-task cap (= offHeap / executorCores);
// `overAcquiredMemoryRatio=0` drops Gluten's 30% over-acquire backup. The
// pool is bumped 2g->4g so the per-task cap stays a reasonable ~1-2G while
// the total managed off-heap stays <= 4g, keeping the fork well under the
// ~16G runner. NOTE: this is a HARD cap, so other genuinely memory-heavy
// tests may also start throwing clean OOMs -- the net failure delta is to
// be measured from the run, not assumed.
.set("spark.memory.offHeap.size", "4g")
.set("spark.gluten.memory.isolation", "true")
.set("spark.gluten.memory.overAcquiredMemoryRatio", "0")
Comment on lines +66 to +68
env:
ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
MVN_CMD: 'build/mvn -ntp'
Comment on lines +165 to +166
for pattern in ("**/TEST-*.xml", "**/target/**/*.xml"):
xml_files.extend(glob.glob(os.path.join(reports_dir, pattern), recursive=True))
Comment on lines +10 to +13
# Full 16-shard baseline. Originally seeded from 15 of 16 shards (run
# 27490052632); shard 2 then hung/OOM-crashed and was excluded. Shard 2's
# 60 failures have now been merged in on top of that seed, so every shard
# enforces clean. 954 known failures total.
Reverts 75f9be1. spark.gluten.memory.isolation=true + overAcquiredMemoryRatio=0
+ offHeap 2g->4g did NOT bound the DeletionVectorsSuite 2B-rows native OOM: in
run apache#22 shard 2's fork still ballooned to ~13G and was OOM-terminated by the
kernel (one cgroup OOM event, memory.peak 14.92G, sbt exit 137) -- identical to
before. Restore the validated baseline-refresh config (offHeap=2g, no isolation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] [Build] Run Delta unit tests during PR validation

4 participants