fix: Respect typed-array byteOffset in bigNumsToStrings by Karakatiza666 · Pull Request #439 · apache/arrow-js

Karakatiza666 · 2026-05-21T10:07:14Z

This PR was co-authored with Claude Code.

Fix incorrect 64-bit JSON IPC output for tables loaded from binary Arrow

The bug

When the JSON IPC writer serializes a column whose values are stored as 64-bit words (offsets or data), it reads the buffer's underlying ArrayBuffer directly without honoring the typed-array's byteOffset / byteLength. If the column happens to be a view into a larger shared buffer rather than a freshly-allocated array, the writer reads from the wrong window of memory and emits garbage for the affected DATA / OFFSET strings.

When it actually triggers

Only on the flow "read data in binary Arrow IPC, write data in JSON IPC." That's the only public path that produces vectors backed by offset views — the binary IPC reader carves every buffer out of a single shared byte array, so anything loaded from a file, stream, or another Arrow implementation arrives at the JSON writer in exactly the at-risk shape.

Tables built directly in JS memory (e.g. via the existing JSON round-trip tests, or tableFromArrays) are not affected, which is why the bug must have gone unnoticed: the RecordBatchJSONWriter test suite synthesizes its inputs in memory and never sees the offset-view state. The flow is mostly used in cross-language integration testing, not by typical end users — which may be the second reason it lingered.

What's affected

Any column whose values are 64-bit words processed through the writer's bignum helper:

Int64, Uint64
DateMillisecond
All four Timestamp units
TimeMicrosecond, TimeNanosecond
All four Duration units
Decimal128
LargeUtf8 and LargeBinary (their OFFSET arrays)

Fixed types unaffected: 32-bit ints/floats, DateDay, TimeSecond/Millisecond, regular Utf8/Binary, anything that doesn't go through the 64-bit serialization path.

Test coverage

A new test/unit/ipc/writer/json-offset-views-tests.ts adds 16 focused unit tests — one per type whose JSON serialization goes through the buggy 64-bit helper. Each test takes a freshly-generated vector from the shared generate-test-data helpers, rewraps every backing typed array at a nonzero byteOffset, and asserts the JSON writer produces the same output for the rewrapped vector as for the original. The list of covered types is a flat table at the top of the file; adding a future affected type is a one-line addition.

The existing RecordBatchJSONWriter suite in test/unit/ipc/writer/json-writer-tests.ts was also extended: it now routes its generated source tables through binary IPC before the JSON round-trip, so the JSON writer sees the offset-view typed arrays that real-world inputs always carry. This widens end-to-end coverage to every type the random/dictionary table generators produce, on the user-facing API path.

The two layers are complementary:

The focused unit tests pinpoint exactly which type fails on regression and stay sound even if the binary IPC reader's memory layout ever changes — they construct offset views directly, without depending on tableFromIPC.
The extended round-trip suite catches the same regression on a much broader type matrix through the actual public flow a user would use.
With the fix reverted, all 16 focused tests fail and over 100 cases in the extended round-trip suite fail. With the fix applied, all tests across 46 suites pass.

The new test file is not strictly necessary right now, but it makes catching a regression more robust.

Copilot

Pull request overview

Fixes incorrect JSON IPC output when vectors’ 64-bit buffers are typed-array views into a larger underlying ArrayBuffer (non-zero byteOffset), which commonly occurs when data is loaded via binary IPC and then re-serialized to JSON.

Changes:

Fix bigNumsToStrings() to respect byteOffset/byteLength when creating the Uint32Array view used for 64-bit/128-bit string conversion.
Strengthen end-to-end JSON writer tests by routing generated tables through binary IPC before JSON serialization.
Add focused unit tests that rewrap affected vectors’ backing buffers into non-zero-offset views and assert JSON output stability across all affected types.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`src/visitor/jsonvectorassembler.ts`	Fixes 64-bit word extraction by constructing the `Uint32Array` view with the correct buffer window (`byteOffset`/`byteLength`).
`test/unit/ipc/writer/json-writer-tests.ts`	Makes the JSON writer test exercise realistic IPC-loaded buffer layouts by round-tripping through binary IPC first.
`test/unit/ipc/writer/json-offset-views-tests.ts`	Adds targeted regression tests for offset-view safety across all types that rely on the 64-bit serialization helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

domoritz · 2026-06-04T13:58:43Z

+// specific language governing permissions and limitations
+// under the License.
+
+// The bug: bigNumsToStrings() in src/visitor/jsonvectorassembler.ts used to


I don't think we need this much detail in the unit test comment especially on old code.

domoritz · 2026-06-04T13:59:05Z

+    return new RecordBatch(schema, root as Data<Struct>);
+}
+
+// Every affected callsite of bigNumsToStrings, one entry per type. Adding


Can we test bigNumsToStrings directly as well?

Wouldn't it be redundant to the per-type test?

That's okay. I think a single test that exercises the fix you implemented with as little setup as possible would be great (really testing the unit). These tests require a lot of setup.

But I don't feel too strong about this.

I'd need to make bigNumsToStrings an export. Is that acceptable?

To the public API? I think that it could be useful as a feature anyway so fine with me. But if we can keep it internal, that would be best.

Yes, it can be reached through a deep import now. I don't feel it's elegant to expose it for a single redundant unit test

'new Uint32Array(values.buffer)' ignored byteOffset/byteLength, so any 64-bit DATA or OFFSET backed by a subarray view (every binary-IPC-loaded vector, every sliced vector) was serialized from the wrong window. Affects all callers: visitInt (Int64/Uint64), visitDate (ms), visitTimestamp, visitTime (>=us), visitDecimal, visitDuration, visitLargeUtf8, visitLargeBinary, visitLargeList. Add test/unit/ipc/writer/json-offset-views-tests.ts: each type built twice (fresh array vs .subarray() view inside a padded buffer) must serialize to identical JSON. Reverting the fix fails 17/18.

Karakatiza666 · 2026-06-05T05:03:32Z

Rebased on latest main, added bigNumsToStrings unit test making it a module export in the process, revised the comment

Karakatiza666 mentioned this pull request May 21, 2026

feat: Add LargeList support #438

Merged

9 tasks

kou requested a review from Copilot May 22, 2026 01:38

Copilot started reviewing on behalf of kou May 22, 2026 01:38 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Karakatiza666 force-pushed the fix-json-write branch from 49b3140 to de89204 Compare May 31, 2026 18:57

domoritz requested a review from trxcllnt June 4, 2026 13:57

domoritz approved these changes Jun 4, 2026

View reviewed changes

Karakatiza666 force-pushed the fix-json-write branch from de89204 to e1e4257 Compare June 5, 2026 05:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Respect typed-array byteOffset in bigNumsToStrings#439

fix: Respect typed-array byteOffset in bigNumsToStrings#439
Karakatiza666 wants to merge 1 commit into
apache:mainfrom
Karakatiza666:fix-json-write

Karakatiza666 commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

domoritz Jun 4, 2026

Uh oh!

domoritz Jun 4, 2026

Uh oh!

Karakatiza666 Jun 4, 2026

Uh oh!

domoritz Jun 4, 2026

Uh oh!

Karakatiza666 Jun 5, 2026

Uh oh!

domoritz Jun 5, 2026

Uh oh!

Karakatiza666 Jun 6, 2026

Uh oh!

Karakatiza666 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Karakatiza666 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix incorrect 64-bit JSON IPC output for tables loaded from binary Arrow

The bug

When it actually triggers

What's affected

Test coverage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

domoritz Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

domoritz Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Karakatiza666 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

domoritz Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Karakatiza666 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

domoritz Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Karakatiza666 Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Karakatiza666 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Karakatiza666 commented May 21, 2026 •

edited

Loading