Skip to content

fix(backend/kernel): comparator parity — async statement retention + intervals_as_string + precision/scale#803

Open
vikrantpuppala wants to merge 1 commit into
mainfrom
fix/kernel-async-statement-retention
Open

fix(backend/kernel): comparator parity — async statement retention + intervals_as_string + precision/scale#803
vikrantpuppala wants to merge 1 commit into
mainfrom
fix/kernel-async-statement-retention

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

@vikrantpuppala vikrantpuppala commented May 23, 2026

Summary

Comparator parity fixes for the kernel backend. Three issues bundled because they're all small kernel-backend-client changes surfaced by the same audit run.

1. Async Statement retention (original scope)

KernelDatabricksClient.execute_command closed the parent Statement in finally regardless of async_op. The kernel's Statement.close() invalidates child handles (see databricks-sql-kernel/src/statement/validity.rs), so the async handle was being killed before the user could poll it — breaking the entire async surface (execute_asyncis_query_pendingget_async_execution_result).

Fix: when async_op=True, retain the parent Statement in a new _async_statements dict alongside _async_handles, and close it from close_command, close_session, and get_execution_result after the executed handle is done.

Comparator outcome: STATEMENT_ASYNC suite 3/3 match (was 0/3).

2. intervals_as_string wire-through (companion to kernel PR #64)

pyarrow's Python bindings cannot decode Arrow's month_interval type at all (id 21 — raises KeyError from .as_py, to_pylist, cast(string), to_pandas). Every kernel-backend SELECT * over any table with an INTERVAL YEAR TO MONTH column threw ArrowNotImplementedError32 / 88 audit diffs.

Fix: pass intervals_as_string=True to the kernel Session(...) constructor unconditionally. The kernel post-processor then stringifies Interval / Duration columns server-side to Utf8 (see kernel PR #64), so pyarrow never sees the unreadable type.

Comparator outcome: bucket A (ArrowNotImplementedError) drops from 32 → 0 diffs.

3. Decimal precision/scale extraction (new)

description_from_arrow_schema hard-coded None for slots 4 and 5 of the PEP 249 description tuple. For DECIMAL columns the Arrow schema carries precision / scale on Decimal128Type, but we were silently dropping them. The kernel backend returned:

('decimal_column', 'decimal', None, None, None, None, None)

while Thrift returned:

('decimal_column', 'decimal', None, None, 10, 2, None)

Any consumer that reads PEP 249 slots 4/5 (SQLAlchemy, pandas-read-sql, etc.) can't distinguish DECIMAL(10,2) from DECIMAL(38,18) on the kernel backend. 88 comparator diffs (44 precision + 44 scale).

Fix: factor out _precision_scale_for_arrow_type(arrow_type) and call it from the description builder. Today it only handles decimals; future extensions (e.g. fractional-second precision from Time64Type) slot in here without touching the description builder.

Comparator outcome: 88 precision+scale diffs → 0.

Dependencies

Items #2 and #3 require kernel PR #64 (intervals_as_string + empty-result schema fix) to land first. For local testing the comparator harness uses KERNEL_FREEZE=1 against a kernel checkout of the feature branch.

Headline comparator results (full kernel run)

Before After
match / diff 60 / 88 103 / 45
Bucket A (ArrowNotImplementedError) 32 0
decimal_column precision/scale diffs 88 0
Type-code mismatches (slot 1) 21 14
Suites fully clean 12 / 30 17 / 30

Remaining 45 diffs cluster into 4 documented causes (complex types in fetchall_arrow, timestamp tz on Arrow path, VOID surface, METADATA pattern semantics) — tracked in ~/docs/python-kernel/comparator-diff-tasklist.md.

Test plan

  • Manual repro of async path: cur.execute_async("SELECT 1"); while cur.is_query_pending(): time.sleep(0.2); cur.get_async_execution_result(); cur.fetchall() succeeds on use_kernel=True
  • Manual repro of interval path: cur.execute("SELECT ym_interval_column FROM ...") returns string-shaped rows matching the Thrift surface
  • Manual repro of decimal path: cur.description for a DECIMAL(10,2) column now reports precision=10, scale=2 on both backends
  • Comparator audit confirms all three improvements (60→103 match, 88→45 diff)
  • Existing kernel-backend tests — no behavior change on sync path or non-decimal columns

This pull request and its description were written by Isaac.

@vikrantpuppala vikrantpuppala force-pushed the fix/kernel-async-statement-retention branch from 291e18e to 312164a Compare May 23, 2026 16:16
@vikrantpuppala vikrantpuppala changed the title fix(backend/kernel): retain parent Statement for async-submitted commands fix(backend/kernel): comparator parity — async statement retention + intervals_as_string May 23, 2026
…intervals_as_string + precision/scale

Three parity-blocking issues surfaced by the python-comparator audit
(tests/comparator-tests/python/), bundled into one PR since they're all
small kernel-backend client changes.

## 1. Async Statement retention (original scope of this PR)

`KernelDatabricksClient.execute_command` closed the parent `Statement`
in `finally` regardless of `async_op`. The kernel's `Statement.close()`
invalidates child handles (see `databricks-sql-kernel/src/statement/
validity.rs`), so the async handle was being killed before the user
could poll it, breaking the entire async surface
(`execute_async` → `is_query_pending` → `get_async_execution_result`).

Fix: when `async_op=True`, retain the parent Statement in a new
`_async_statements` dict alongside `_async_handles`, and close it from
`close_command`, `close_session`, and `get_execution_result` after the
executed handle is done.

Comparator outcome: STATEMENT_ASYNC suite 3/3 match (was 0/3).

## 2. intervals_as_string wire-through (companion to kernel PR #64)

pyarrow's Python bindings cannot decode Arrow's `month_interval` type
at all (id 21 — raises `KeyError` from `.as_py`, `to_pylist`,
`cast(string)`, `to_pandas`). Every kernel-backend `SELECT *` over any
table with an INTERVAL YEAR TO MONTH column was throwing
`ArrowNotImplementedError` — 32/88 audit diffs.

Fix: pass `intervals_as_string=True` to the kernel `Session(...)`
constructor unconditionally. The kernel post-processor then stringifies
Interval / Duration columns server-side to Utf8 (see kernel PR #64), so
pyarrow never sees the unreadable type.

Comparator outcome: bucket A (ArrowNotImplementedError) drops from
32 → 0 diffs.

## 3. Decimal precision/scale extraction (new)

`description_from_arrow_schema` hard-coded `None` for slots 4 and 5 of
the PEP 249 description tuple. For DECIMAL columns the Arrow schema
carries precision/scale on `Decimal128Type.precision` / `.scale`, but
we were silently dropping them — so kernel-backend
`cursor.description[i]` returned
`('decimal_column', 'decimal', None, None, None, None, None)` while
Thrift returned
`('decimal_column', 'decimal', None, None, 10, 2, None)`.

That diff propagates to any consumer that reads PEP 249 slots 4/5
(SQLAlchemy, pandas-read-sql, etc.) so they can't tell DECIMAL(10,2)
from DECIMAL(38,18) on the kernel backend. 88 comparator diffs (44
precision + 44 scale).

Fix: factor out `_precision_scale_for_arrow_type(arrow_type)` and call
it from the description builder. Today it only handles decimals;
future extensions (e.g. fractional-second precision from `Time64Type`)
slot in here without touching the description builder.

Comparator outcome: 88 precision+scale diffs → 0.

## Dependencies

Both intervals_as_string and the empty-result schema fix in kernel
PR #64 are required for the parity gains to land. The driver-side
fixes here work standalone but the comparator outcome numbers assume
PR #64 is also live.

## Headline comparator results

- Before this PR (and PR #64): 60 match / 88 diff (out of 148)
- After this PR + PR #64: 103 match / 45 diff
- 17 of 30 (connection_config × suite) pairs now fully clean.
- Remaining 45 diffs cluster into 4 known causes documented in
  ~/docs/python-kernel/comparator-diff-tasklist.md (complex types in
  fetchall_arrow path, timestamp tz, VOID, METADATA pattern matching).

Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala vikrantpuppala force-pushed the fix/kernel-async-statement-retention branch from 312164a to 81f065f Compare May 23, 2026 17:18
@vikrantpuppala vikrantpuppala changed the title fix(backend/kernel): comparator parity — async statement retention + intervals_as_string fix(backend/kernel): comparator parity — async statement retention + intervals_as_string + precision/scale May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant