feat: Add LargeList type support#325
Draft
GeorgeLeePatterson wants to merge 25 commits into
Draft
Conversation
This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+), enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion that use view types for efficient string handling. - Added BinaryView and Utf8View type classes with view struct layout constants - Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24 - Data class support for variadic buffer management - Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data) - Set visitor: Marks as immutable (read-only) - VectorLoader: Reads from IPC format with variadicBufferCounts - TypeComparator, TypeCtor: Type system integration - JSON visitors: Explicitly unsupported (throws error) - Generated schema files for BinaryView, Utf8View, ListView, LargeListView - Script to regenerate from Arrow format definitions - Reading BinaryView/Utf8View columns from Arrow IPC files - Accessing values with proper inline/out-of-line handling - Variadic buffer management - Type checking and comparison - ✅ Unit tests for BinaryView and Utf8View (test/unit/ipc/view-types-tests.ts) - ✅ Tests verify both inline (≤12 bytes) and out-of-line data handling - ✅ TypeScript compiles without errors - ✅ All existing tests pass - ✅ Verified with DataFusion 50.0.3 integration (enables native view types, removing need for workarounds) - Reading query results from DataFusion 50.0+ with view types enabled - Consuming InfluxDB 3.0 Arrow data with Utf8View/BinaryView columns - Processing Arrow IPC streams from any system using view types - Builders for write operations - ListView/LargeListView type implementation - Additional test coverage Closes apache#311 Related to apache#225
… from test tsconfig
Add scripts/update_flatbuffers.sh and test/unit/ipc/view-types-tests.ts to RAT (Release Audit Tool) exclusion list. Both files have proper Apache license headers but need to be excluded from license scanning.
This reverts commit dfe9d56.
Remove blank line after shebang to match Apache Arrow JS convention. License header must start on line 2 with '#' as shown in ci/scripts/build.sh
Add BinaryView and Utf8View to main exports in Arrow.ts. These types were implemented but not exported, causing 'BinaryView is not a constructor' errors in ES5 UMD tests.
Add BinaryView and Utf8View to Arrow.dom.ts exports. Arrow.node.ts re-exports from Arrow.dom.ts, so this fixes both entrypoints.
- Simplify variadicBuffers byteLength calculation with reduce - Remove unsupported type enum entries (only add BinaryView and Utf8View) - Eliminate type casting by extracting getBinaryViewBytes helper - Simplify readVariadicBuffers with Array.from - Remove CompressedVectorLoader override (inherits base implementation) - Delete SparseTensor.ts (not implementing tensors in this PR)
- Implement BinaryViewBuilder with inline/out-of-line storage logic - Implement Utf8ViewBuilder with UTF-8 encoding support - Support random-access writes (not just append-only) - Proper variadic buffer management (32MB buffers per spec) - Handle null values correctly - Register builders in builderctor visitor - Add comprehensive test suite covering: - Inline values (≤12 bytes) - Out-of-line values (>12 bytes) - Mixed inline/out-of-line - Null values - Empty values - 12-byte boundary cases - UTF-8 multibyte characters - Large batches (1000 values) - Multiple flushes Fixes: - Correct buffer allocation for random-access writes - Proper byteLength calculation (no double-counting) - Follows FixedWidthBuilder patterns for index-based writes
ESLint rule jest/prefer-to-have-length requires using toHaveLength() instead of toBe() for length checks.
Use reduce instead of explicit loops for variadicBuffers byteLength calculation, consistent with changes in Data class.
- Add ListView and LargeListView type classes with child field support - Add type guard methods isListView and isLargeListView - Add visitor support in typeassembler and typector - Add Data interfaces for ListView with offsets and sizes buffers - Add makeData overloads for ListView and LargeListView - Update DataProps union type to include ListView types ListView and LargeListView use offset+size buffers instead of consecutive offsets, allowing out-of-order writes and value sharing.
- Add ListView and LargeListView type classes to src/type.ts - Add visitor support in src/visitor.ts (inferDType and getVisitFnByTypeId) - Add visitor support in src/visitor/typector.ts and typeassembler.ts - Add DataProps interfaces for ListView/LargeListView in src/data.ts - Implement MakeDataVisitor methods for ListView/LargeListView - Implement GetVisitor methods for ListView/LargeListView in src/visitor/get.ts - Add comprehensive test suite in test/unit/ipc/list-view-tests.ts - Tests in-order and out-of-order offsets - Tests value sharing between list elements - Tests null handling and empty lists - Tests LargeListView with BigInt64Array offsets - Tests type properties ListView and LargeListView are Arrow 1.4 variable-size list types that use offset+size buffers instead of consecutive offsets, enabling out-of-order writes and value sharing.
Add type 25 (ListView) and 26 (LargeListView) to the Type enum.
Implements builders for ListView and LargeListView types: - ListViewBuilder: Uses Int32Array for offsets and sizes - LargeListViewBuilder: Uses BigInt64Array for offsets and sizes Key implementation details: - Both builders extend Builder directly (not VariableWidthBuilder) - Use DataBufferBuilder for independent offset and size buffers - Override flush() to pass both valueOffsets and sizes to makeData - Properly handle null values and empty lists Includes comprehensive test suite with 11 passing tests: - Basic value appending - Null handling - Empty lists - Multiple flushes - Varying list sizes - BigInt offset verification This is part of the stacked PR strategy for view types support.
ESLint rule jest/prefer-to-have-length requires using toHaveLength() instead of toBe() for length checks.
- Add LargeList type class and interface to type system - Implement LargeListBuilder for write support - Add LargeList visitors for all operations (get, set, indexof, etc.) - Add LargeList to data props and makeData function - Update vectorassembler and vectorloader for LargeList - Add LargeList enum entry (Type.LargeList = 21) - Use BigInt64Array for LargeList offsets
9 tasks
kou
pushed a commit
that referenced
this pull request
Jun 5, 2026
This PR was co-authored with [Claude Code](https://claude.com/claude-code). --- ## Summary This PR builds on an unresolved #299 to implement full support for the `LargeList` data type in Apache Arrow JavaScript bindings. `LargeList` uses 64-bit offsets (`BigInt64Array`) instead of 32-bit offsets, enabling list values larger than 2GB. Where possible, the code size was reduced by distilling helpers used in both `List` and `LargeList`. ## Related Issues Closes #70 ## Implementation Details ### Core Type System - Added `Type.LargeList = 21` enum value - Implemented `LargeList<T>` class with `BigInt64Array` offset support - Added `DataType.isLargeList()` type guard - Added `LargeListDataProps` interface and `MakeDataVisitor.visitLargeList` (widens 32-bit offsets via `toBigInt64Array`) - Mapped `LargeList` and `LargeListBuilder` into `TypeToDataType`, `TypeToBuilder`, and `DataTypeToBuilder` in `interfaces.ts` ### Visitor Pattern Implementation Wired `visitLargeList()` across every visitor, factoring shared helpers where the offset width was the only difference: - `GetVisitor` / `SetVisitor`: merged `getList` / `setList` into single helpers using `bigIntToNumber` at the offset boundary — one implementation covers both List and LargeList - `IteratorVisitor`, `IndexOfVisitor`: register `visitLargeList` (the generic implementations are offset-width agnostic) - `TypeComparator`: widened compareList to `List | LargeList` (structural comparison only) - `VectorAssembler`: generalized `assembleListVector` to coerce begin/end via `bigIntToNumber`; registers `visitLargeList` - `VectorLoader`: `visitLargeList` mirrors `visitList`; base `readOffsets` already honors `OffsetArrayType` (`BigInt64Array`) - `JSONVectorAssembler`: emits `OFFSET` via `bigNumsToStrings`, matching the `LargeUtf8` / `LargeBinary` pattern - `TypeAssembler` / `JSONTypeAssembler`: `FlatBuffers` + JSON type serialization ### IPC Support - `ipc/metadata/message.ts`: `decodeFieldType` handles `Type.LargeList` - Read and write paths both round-trip via the assembler/loader registrations above ### Latent Bug Fix - `util/buffer.ts`: `rebaseValueOffsets` now coerces its number offset to `BigInt` when the offsets array is `BigInt64Array`. Previously a non-zero offset on a 64-bit offsets array would `TypeError` on bigint += number — required for `LargeList` IPC writes on sliced data, and also fixes the same latent issue for `LargeUtf8` / `LargeBinary`. ### Builders - New `src/builder/largelist.ts` (`LargeListBuilder`), mirroring `ListBuilder` with `BigInt()` for offset accumulation and `Number()` coercion when passing the start index to `child.set` - Widened `VariableWidthBuilder` bound to include `LargeList` in `builder.ts` - `GetBuilderCtor.visitLargeList` returns `LargeListBuilder` ### Testing - `test/generate-test-data.ts`: - Factored a shared `generateListLike` helper used by both `generateList` (`Int32`) and `generateLargeList` (`BigInt64`) - Added `createVariableWidthOffsets64`; truncates `min` / `max` at entry so fractional stride from `childVec.length / (length - nullCount)` doesn't `RangeError` in `BigInt()` - `test/unit/generated-data-tests.ts`: `LargeList` added to the matrix - `test/unit/builders/builder-tests.ts`: `LargeListBuilder` entry added alongside `ListBuilder` / `FixedSizeListBuilder` / `MapBuilder` - `test/unit/visitor-tests.ts`: `visitLargeList` added to `BasicVisitor` / `FeatureVisitor` and to both describe matrices ### Public API - Exported `LargeList` and `LargeListBuilder` from `src/Arrow.ts` and `src/Arrow.dom.ts` ## Test Plan All existing tests continue to pass, plus the `LargeList` path is exercised by: - ✅ Generated-data matrix: `get` / `set` / `iterator` / `indexOf` / `slice` / `concat` / IPC round-trip - ✅ Builder matrix: no-nulls / with-nulls / length=518 - ✅ Visitor dispatch (`BasicVisitor` + `FeatureVisitor`) - ✅ IPC stream round-trip (16 IPC suites green, including JSON form via `JSONVectorAssembler` / `JSONVectorLoader`) All tests across 45 suites pass. The tests were run with: ```bash npx jest --config jestconfigs/jest.src.config.js ``` ## Checklist - [x] Implementation follows existing code patterns - [x] All visitor methods implemented (`get` / `set` / `iterator` / `indexOf` / `TypeComparator` / `VectorAssembler` / `VectorLoader` / `JSONVectorAssembler` / `TypeAssembler` / `JSONTypeAssembler`) - [x] IPC serialization/deserialization support added (binary + JSON form) - [x] `LargeListBuilder` added and wired through `GetBuilderCtor` + `interfaces.ts` - [x] Latent `rebaseValueOffsets` bigint bug fixed - [x] Comprehensive tests added using existing test framework - [x] All tests passing - [x] Public API exports added - [x] No breaking changes ## Notes - This implementation provides full `LargeList` support: IPC read/write (binary + JSON form), in-memory access and mutation, type comparison, and construction via `LargeListBuilder` — parallel to the existing `List` type, just with 64-bit offsets. - Storage and wire format are honest 64-bit (`BigInt64Array` end-to-end). The only narrowing happens at JS-runtime boundaries where `Data.slice` accepts number — identical to the `LargeUtf8` / `LargeBinary` policy upstream - Helpers were merged across `List`/`LargeList` only where the offset width was the sole difference and `bigIntToNumber` coercion at the boundary made the merge non-confusing; `LargeListBuilder` stays separate because the `BigInt()` / `Number()` coercions in `_flushPending` would obscure a merged version - Another relevant PR with a subset of changes here, but with a different scope (includes changes relevant to BinaryView, Utf8View, ListVIew, LargeListView): #325 --------- Signed-off-by: Karakatiza666 <bulakh.96@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add support for LargeList type (Type 21) to bring arrow-js closer to Arrow format 1.4.0+ compliance.
Changes:
Follows the same patterns as existing List, LargeBinary, and LargeUtf8 types.
Related to #324