feat: decode inline Arrow IPC + warehouse-compat fallback#329
Open
jamesbroadhead wants to merge 16 commits into
Open
feat: decode inline Arrow IPC + warehouse-compat fallback#329jamesbroadhead wants to merge 16 commits into
jamesbroadhead wants to merge 16 commits into
Conversation
3 tasks
Renames the client-side analytics format model from "JSON"/"ARROW" to
"JSON_ARRAY"/"ARROW_STREAM" to match the Statement Execution API enum
verbatim — no more local-name to API-name translation.
Pure mechanical rename. No behavior change. Internal type values only;
the lowercase user-facing values passed to useChartData ("json", "arrow",
"auto") are unchanged.
Carved out of #256 (#327 is layer 1, this is layer 2). The actual
inline-Arrow-IPC + warehouse-fallback fix sits on top of this in layer 3.
Note: this is a breaking change for any direct consumer of
useAnalyticsQuery passing explicit format: "JSON" or "ARROW" — they will
need to update to "JSON_ARRAY" / "ARROW_STREAM". Consumers using
useChartData (lowercase "json"/"arrow"/"auto") are unaffected.
Co-authored-by: Isaac
Widen AnalyticsFormat to also include the pre-rename "JSON" and "ARROW" spellings, both marked @deprecated with a JSDoc note describing the removal condition (no consumer on appkit/appkit-ui < 0.33.0). Add a normalizeAnalyticsFormat helper and call it at the analytics route handler entry point so all downstream code (cache key, format branching, formatParameters) continues to operate on the canonical "JSON_ARRAY" | "ARROW_STREAM" values. InferResultByFormat is widened to also match "ARROW" so callers passing the legacy spelling still get TypedArrowTable<...> inferred. This lifts the breaking-change carve-out from the rename, so callers of useAnalyticsQuery({ format: "JSON" | "ARROW" }) keep working with only an IDE deprecation hint. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Serverless warehouses return ARROW_STREAM + INLINE results as base64 Arrow IPC in result.attachment rather than result.data_array. The previous code path discarded inline data for any ARROW_STREAM response (designed for EXTERNAL_LINKS), so these warehouses silently returned empty results. This commit makes the analytics plugin work across classic and serverless warehouses by handling both dispositions for ARROW_STREAM, decoding inline Arrow IPC attachments server-side, and falling back to JSON_ARRAY when a warehouse rejects ARROW_STREAM + INLINE. Changes - Inline Arrow IPC decoding (new arrow-schema.ts) via apache-arrow's tableFromIPC, producing the same row-object shape as JSON_ARRAY regardless of warehouse backend. apache-arrow@21.1.0 added as a server dep. - Format fallback: ARROW_STREAM + INLINE requests automatically fall back to JSON_ARRAY if a classic warehouse rejects them. Explicit format requests are respected without fallback. - Zod-validated SSE wire protocol for /api/analytics/query (shared schema between server and client; malformed payloads surface a clear error instead of silent undefined). - Default remains JSON_ARRAY for compatibility. Stack: layer 3 of 3 carved from #256. - #327 — coverage backfill (layer 1) - #328 — AnalyticsFormat rename to API enum names (layer 2) - (this PR) — the actual fix Fixes #242 Co-authored-by: Isaac
ca69f8e to
08c5486
Compare
Six issues surfaced by GPT 5.4 xhigh + Gemini 3.1 Pro parallel review followed by an adversarial debate round (reviewer: GPT, critic: Gemini, meta: Claude Opus). 1. Raise SSE event-size cap from 8 MiB to 12 MiB on both server (streamDefaults.maxEventSize) and client (connectSSE.maxBufferSize). The inline Arrow attachment cap (MAX_INLINE_ATTACHMENT_BYTES) stays at 8 MiB *decoded*; base64 encoding + JSON + SSE framing inflate that to ~10.6 MiB on the wire, so 12 MiB leaves enough headroom for legal 8-MiB-decoded payloads to traverse the buffer. 2. Empty `data_array: []` is truthy, so zero-row ARROW_STREAM responses skipped empty-table synthesis and fell through to the JSON row transform — callers requesting Arrow got [] JSON rows. Length-check explicitly. 3. The arrow-fix commit dropped lowercase legacy "json" / "arrow" from DataFormat / resolveFormat(), silently breaking existing useChartData callers passing those spellings. Restore them as @deprecated aliases on the DataFormat union; resolveFormat() normalizes them to the canonical "JSON_ARRAY" / "ARROW_STREAM" return values. 4. The JSON_ARRAY -> ARROW_STREAM retry in DESCRIBE QUERY only fired on thrown exceptions. Some warehouses signal the rejection as `status.state === "FAILED"` instead. Extract the rejection-matcher helper and retry on both paths before degrading the typegen result to `unknown`. 5. analytics.test.ts:946 asserted `format: "JSON"` returns 400, but the route now accepts "JSON" as a legacy alias (normalized to JSON_ARRAY). Use a truly unsupported value ("CSV") so the test still exercises the malformed-format path. 6. Restore `zod: 4.3.6` to @databricks/appkit dependencies. main has it; the rebase conflict-resolution accepted the branch's older deps list which lacked it. appkit imports `zod` directly from several files (analytics.ts, agent tools, tests). Co-authored-by: Isaac Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The original commit added zod@3.23.8 to shared for the new SSE wire protocol schema. With zod restored on appkit at 4.3.6 (matching main), the workspace now had two different zod majors resolving in different packages — a latent peer-dep / type-incompatibility foot-gun even though the schema itself was already cross-major-compatible. Bump shared's zod to 4.3.6 so the whole workspace lands on one major. The schema's two-arg `z.record(z.string(), z.unknown())` form is the zod 4 spelling, so no functional change is needed; drop the now-stale "keeps it valid under either major" comment. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com> # Conflicts: # packages/appkit-ui/src/react/hooks/types.ts # packages/appkit-ui/src/react/hooks/use-chart-data.ts # packages/appkit/src/plugins/analytics/analytics.ts # packages/appkit/src/plugins/analytics/types.ts
Restoring zod@4.3.6 to appkit and bumping shared's zod from 3.23.8 to 4.3.6 left the lockfile out of sync with package.json, breaking CI's pnpm install --frozen-lockfile step on every job. Regenerate the lockfile so both specifier entries match the manifests. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The merge with main left two separate import statements for "./types" — one for the type-only specifiers and a duplicate value import of normalizeAnalyticsFormat. Biome rejected this as both an organize-imports failure and a noRedeclare error. Merge them into a single mixed type/value import. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
The merge resolution in client.ts dropped the logger.error call from the executeStatement catch block — main has it, our pre-merge branch had it, the resolved version lost it. Without that line the "error log redaction" tests fail because the connector no longer surfaces the failure message to the log spy. Restore the call. Test plan: the two sql-warehouse.test.ts redaction tests pass locally; behavior matches the comment "executeStatement's catch ... is the single point that logs (gated on isAborted)". Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
…e SSE message
Address Mario's design feedback: SSE is for short control messages, not
bulk binary. Inline Arrow IPC payloads from serverless warehouses no
longer ride the SSE channel as base64; they are stashed server-side and
fetched out-of-band through the existing /arrow-result/:jobId endpoint
with the canonical application/vnd.apache.arrow.stream content-type.
Wire protocol
- Discriminated union shrinks from three variants to two: the
arrow_inline message type is gone. Both INLINE and EXTERNAL_LINKS
ARROW_STREAM responses now flow as a single `arrow` message whose
statement_id discriminates dispatch: warehouse-issued ids hit the
warehouse path, synthetic "inline-<uuid>" ids hit the stash. The
client sees one path.
Server
- New InlineArrowStash: TTL'd (10 min), bounded-memory (256 MiB),
drain-on-read, per-user-keyed map of decoded Arrow IPC bytes. Stash
key is the request's user id (or "global" for SP contexts) and is
symmetric between put and take.
- AnalyticsPlugin holds one stash instance and uses it in two places:
- _executeWithFormatFallback decodes result.attachment once, puts the
bytes in the stash, and emits an arrow message with the synthetic
id. Bulk bytes never traverse SSE.
- _handleArrowRoute prefix-dispatches on the jobId: "inline-" drains
the stash and serves with application/vnd.apache.arrow.stream + a
no-store cache header; other ids fall through to the existing
warehouse-fetch path unchanged.
- Connector's MAX_INLINE_ATTACHMENT_BYTES raised from 8 MiB to 25 MiB
(the Databricks API hard cap on INLINE) since the SSE event-size
budget no longer constrains it.
Client
- useAnalyticsQuery loses the arrow_inline branch and the local base64
decoder. Both inline and external-links responses fetch through
/api/analytics/arrow-result/:id; the prefix branch lives server-side.
- The dead client-side MAX_INLINE_ATTACHMENT_BYTES guard goes away.
SSE buffers
- streamDefaults.maxEventSize: 12 MiB -> 1 MiB
- connectSSE.maxBufferSize: 12 MiB -> 1 MiB
SSE now carries only short JSON control messages (result rows, arrow
envelope with statement id, error frames). Multi-MiB caps are no longer
needed and would mask buffer regressions.
Tests
- New InlineArrowStash unit tests (TTL eviction, max-bytes LRU, drain-
on-read, per-user scoping).
- Reworked the route's "emits arrow_inline" test into a stash + arrow-
message assertion: the SSE payload must not contain the base64 bytes
or the arrow_inline type literal, and the decoded bytes must be in
the stash keyed by the same synthetic id.
- New /arrow-result tests cover the inline path: success drain, 410 on
unknown id, 410 on user mismatch.
- Client tests rewritten to assert both warehouse and inline-prefixed
ids fetch through the same /arrow-result URL with no local decoding.
- Shared schema tests assert the retired arrow_inline type no longer
parses.
- The /arrow-result content-type for warehouse hits stays application/
octet-stream (no behavior change there).
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Four findings surfaced by the GPT pass on the reworked PR: 1. ARROW_STREAM cache replay returned drained inline-* ids (HIGH). The previous code capped the cache TTL at 10 min for ARROW_STREAM, which made sense for EXTERNAL_LINKS pre-signed URLs that expire in ~15 min but is broken for inline ids: the stash drains on the first /arrow-result fetch, so any cache hit replays an id whose bytes are gone and reliably 410s. Bypass cache entirely for ARROW_STREAM (TTL = 0); JSON_ARRAY responses still cache normally. 2. Stash evict-on-fit invalidated already-issued ids (MEDIUM). The earlier `evictUntilFits` dropped the oldest entries when a new payload would push total bytes past `maxBytes`, but those oldest entries had ids that were already in flight to clients. Replace eviction with rejection: `put()` now returns `string | null` and the caller falls back to EXTERNAL_LINKS when the stash is full. Every id we hand out stays valid until naturally drained or expired. 3. Aborted stream still decoded + stashed (MEDIUM). If the client cancels the SSE between query completion and stash write, we still decoded the base64 attachment and held the bytes until TTL eviction. Re-check `signal.aborted` before decode/put so canceled streams exit cleanly. 4. Empty result message wrote `undefined` to the hook's state (LOW). The wire schema makes `data` optional; an empty result set may omit it. Normalize the missing case to `[]` so consumers can rely on `data` being either `null` (no message yet) or a value of the inferred result type. Also documents the process-local-memory constraint on the stash in its docstring: a `GET /arrow-result/inline-*` that lands on a different replica than the original SSE request will 410. Multi-replica deployments need sticky sessions or a shared external store, neither in scope for this PR. Tests: - `inline-arrow-stash`: replaced the eviction test with a rejection test that asserts `put()` returns null when the stash is full and that previously-issued ids remain takeable. - `useAnalyticsQuery`: new test asserts an empty result message normalizes to []. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
- agents.ts had two unused imports that biome's noUnusedImports rule flags as errors in CI. Drop them; behavior unchanged. - inline-arrow-stash.test.ts: introduce a mustPut() helper that asserts the non-null contract for successful puts, so the new `put(): string | null` return type does not poison every downstream take() call with a string-vs-string|null TS error. - Minor formatter touch-ups picked up by biome --write. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
Resolves conflict in packages/appkit-ui/src/react/hooks/__tests__/use-analytics-query.test.ts by combining the PR's arrow-IPC SSE message tests with the useStableParams refetch tests from #321 (now on main). The merged file uses a single connectSSE spy that both captures the onMessage handler (for the arrow tests) and counts invocations (for the stable-params tests). All 10 tests pass. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
When the inline Arrow stash refuses a new entry (put returns null), the route must retry the statement with EXTERNAL_LINKS instead of emitting a useless inline- id. The stash itself was unit-tested already; this adds the integration test through the /query route. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
…arehouses Some serverless warehouse variants only support ARROW_STREAM for the INLINE disposition — JSON_ARRAY + INLINE is rejected with 'Inline disposition only supports ARROW_STREAM format.' Before this change, every default useAnalyticsQuery call against such a warehouse failed. The plugin now classifies inline rejections into 'needs-arrow' vs 'needs-json' signals. For a JSON_ARRAY caller hitting needs-arrow, the plugin retries as ARROW_STREAM + INLINE and decodes the Arrow IPC attachment back to plain row objects server-side, keeping the caller's JSON_ARRAY contract intact (scalar values stringified to match the warehouse's native JSON_ARRAY shape). The existing ARROW_STREAM + INLINE → EXTERNAL_LINKS path now uses the same classifier with the 'needs-json' signal. Matching is case-insensitive and handles the real warehouse error wordings rather than the case-sensitive 'INLINE' + 'ARROW_STREAM' substring pair the old heuristic required, which never matched the actual wire errors. Verified live against three e2-dogfood warehouses: one that refuses JSON_ARRAY + INLINE, one other serverless, and one classic — all three now produce identical JSON row output for the same SELECT. Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Some warehouses return
ARROW_STREAM+INLINEresults as base64 Arrow IPC inresult.attachmentrather thanresult.data_array. Some refuseJSON_ARRAY+INLINEentirely and only supportARROW_STREAMforINLINE. Others (most classic + some serverless) do the opposite — they refuseARROW_STREAM+INLINEand requireEXTERNAL_LINKS. The previous code path silently returned empty results in all of these cases.This PR makes the analytics plugin work across all three warehouse shapes by:
ARROW_STREAM,/arrow-resultroute,JSON_ARRAYcaller always gets JSON rows; anARROW_STREAMcaller always gets Arrow bytes), regardless of which disposition the warehouse actually accepted.Design
Earlier iterations of this PR sent inline Arrow bytes over SSE (base64 inside JSON inside SSE framing) using an
arrow_inlinemessage type. That was reworked: inline payloads are now stashed server-side and delivered through the same/arrow-result/<jobId>endpoint that already serves EXTERNAL_LINKS results. The SSE control channel only ever carries{ type: "arrow", statement_id }; the id is either a real warehouse statement id or a syntheticinline-<uuid>, and the server demultiplexes.Wins from the redesign:
application/vnd.apache.arrow.stream) instead of base64-in-JSON-in-SSEarrow_inlinewire message is gone — the discriminated union rejects it as schema-invalidStash properties (see
inline-arrow-stash.ts):put/takeDisposition fallback
The plugin classifies an inline rejection into one of two signals (case-insensitive, requires
INVALID_PARAMETER_VALUE/NOT_IMPLEMENTEDplus mention of "inline" — unrelated errors aren't reinterpreted as disposition mismatches):needs-arrow: warehouse says it only acceptsARROW_STREAMfor INLINE ("Inline disposition only supports ARROW_STREAM format."). If the caller asked forJSON_ARRAY, retry asARROW_STREAM + INLINEand decode the Arrow IPC attachment to plain row objects server-side — the caller's JSON contract is preserved. Scalar values are stringified to match the warehouse's native JSON_ARRAY shape.needs-json: warehouse says it only acceptsJSON_ARRAYfor INLINE ("The format field must be JSON_ARRAY when the disposition field is INLINE.","ARROW_STREAM is not supported with INLINE disposition"). If the caller asked forARROW_STREAM, retry asARROW_STREAM + EXTERNAL_LINKS.For
ARROW_STREAMcallers, the stash-full case also falls back toEXTERNAL_LINKS. Explicit format requests are honored — no auto-downgrade across format boundaries (a JSON caller never gets Arrow bytes; an Arrow caller never gets JSON rows).Changes
connectors/sql-warehouse/arrow-schema.ts+client.ts): detectsresult.attachmentand decodes viaapache-arrow'stableFromIPC.apache-arrow@21.1.0added as a server dependency.plugins/analytics/inline-arrow-stash.ts): the bounded, TTL'd, drain-on-read, per-user keyed store described above./arrow-resultroute (plugins/analytics/analytics.ts): single endpoint serves both warehouse statement ids and syntheticinline-*ids; differentiation is by id prefix.plugins/analytics/analytics.ts): replaces the prior heuristic, which case-sensitively required bothINLINEandARROW_STREAMin the error message and never matched the actual wire wordings.ARROW_STREAM, decode and stringify so the SSE payload still looks like{type: "result", data: [...rows]}.shared/src/sse/analytics.ts): single source of truth between server and client. Default format remainsJSON_ARRAY.Tests
connectors/sql-warehouse/tests/arrow-schema.test.ts(new, 514 lines)connectors/sql-warehouse/tests/client.test.ts(new, 383 lines) — real base64 Arrow IPC captured from a live serverless warehouseplugins/analytics/tests/inline-arrow-stash.test.ts(new, 116 lines)plugins/analytics/tests/analytics.test.ts(+770) — integration coverage for both fallback directions (warehouse refuses JSON_ARRAY + INLINE → server retries as ARROW_STREAM + INLINE and decodes; warehouse refuses ARROW_STREAM + INLINE → falls back to EXTERNAL_LINKS), plus stash-full fallback and the no-fallback safety netappkit-ui/src/react/hooks/__tests__/use-analytics-query.test.ts— arrow-IPC SSE tests (this PR) merged with the stable-params tests already on main from fix(appkit-ui): stabilize useAnalyticsQuery params reference to avoid infinite refetch #321shared/src/sse/analytics.test.ts(new, 87 lines)Full suite: 2,674 tests, all green on the current head.
Stack
This was the third PR in a stack; both predecessors are now on main:
test: backfill coverage for genie connector, service context, stream registry #327 — coverage backfill(merged)feat: rename AnalyticsFormat to API enum names with legacy aliases #328 —(merged)AnalyticsFormatrename to API enum namesTest plan
Verified via automated tests on the current head:
useAnalyticsQueryreadsARROW_STREAMresults via/arrow-resultfor both real warehouse statement ids and syntheticinline-*idsJSON_ARRAYdirect path: warehouse accepts INLINE + JSON_ARRAY → row data returned unchangedJSON_ARRAYfallback path: warehouse refuses INLINE + JSON_ARRAY but accepts INLINE + ARROW_STREAM → server decodes Arrow attachment to JSON rows; SSE wire carries{type: "result"}, not{type: "arrow"}JSON_ARRAYno-fallback safety: rejection without a needs-arrow signal does NOT trigger a retryARROW_STREAM+ INLINE → EXTERNAL_LINKS fallback when the warehouse rejects INLINEARROW_STREAM+ INLINE → EXTERNAL_LINKS fallback when the inline stash is fullarrow_inlineSSE message is rejected as schema-invalid and never triggers a local decodeLive smoke test against three e2-dogfood warehouses (
SELECT 1 AS one, 'hello' UNION ALL SELECT 2, 'world'):JSON_ARRAYcaller (default)ARROW_STREAMcaller[{"one":"1","greeting":"hello"},{"one":"2","greeting":"world"}]All three warehouses now return identical row shapes for the same default
useAnalyticsQuerycall.Fixes #242. Replaces #256.
This pull request was AI-assisted by Isaac.