gemv: coalesce batched DMA into a single iterated BD per column#127
Open
atassis wants to merge 2 commits into
Open
gemv: coalesce batched DMA into a single iterated BD per column#127atassis wants to merge 2 commits into
atassis wants to merge 2 commits into
Conversation
The batched GEMV unrolled num_batches host DMA descriptors per column (one fill + one drain + a task-group wait per batch). Express the batch as a single iterated BD instead, the same B-unroll -> BD-iteration idiom GEMM already uses for tiling: place num_batches in the size-uncapped descriptor dim and split the contiguous run across the two wrap dims (each <= 1023), per the AIE shim BD limits (AIEXDialect.cpp verifyStridesWraps). The run split keeps the inner size and strides 4-byte-aligned (granularity-aware), and the per-batch drain wait is dropped in favour of ObjectFifo backpressure (fifo depth >= 2 asserted; the A-fill and C-drain run on separate shim channels, paced by the fifo locks). num_batches == 1 and any config that cannot be coalesced (run with no aligned split under the wrap cap, or a batch stride that is too large or unaligned) fall back to the existing per-batch path and are byte-identical. The coalesced descriptor accesses the exact same DRAM elements in the same order as the unroll (access-equivalent), so this is a descriptor-count / build-time and correctness change, not a runtime change.
The test suite had no num_batches > 1 GEMV coverage. Add a batched golden reference (num_batches independent matrix-vector products, stacked contiguously) and a parametrized device test covering: the coalesced path with large num_batches (the size-uncapped dim) and a multi-dimension run split; a run that requires an aligned (even) inner split; and the per-batch fallback (batch stride over the limit).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the GEMV operator's per-batch host DMA descriptor unroll with a single iterated descriptor per column.
GEMM already folds many transfers into one multi-dimensional iterated descriptor (via
TensorTiler2D); GEMV did not, so a batched GEMV emittednum_batchesseparate fill/drain descriptors (plus a per-batch task-groupwait) per AIE column, linear in the batch count. The test suite also had nonum_batches > 1GEMV coverage, so the batched path shipped untested.Added
num_batches > 1golden reference (generate_golden_reference_batched) and a parametrized device test (test_gemv_batched) covering: the coalesced path with largenum_batches(the size-uncapped descriptor dimension) and a multi-dimension run split; a run that requires an alignment-aware (even) inner split; the per-batch fallback (a batch stride over the limit); and an attention-style shape withtile_size_input > 1andnum_batches= a head count.Changed
num_batchesis placed in the size-uncapped descriptor dimension and the contiguous per-batch run is split across the two wrap dimensions (each <= 1023), per the AIE shim BD limits (AIEXDialect.cppverifyStridesWraps). The run split keeps the innermost size and the strides 4-byte-aligned (the shim's address granularity, which is enforced even for linear transfers).waitis replaced by ObjectFifo backpressure. Depth 2 is sufficient (rather than depthnum_batches): the A-fill (MM2S) and C-drain (S2MM) run on separate shim channels and the pipeline is paced by the fifo locks, so the producer cannot overrun the consumer; this is a streaming pipeline, not a store-all buffer. The depth invariant is asserted.num_batches == 1, and any configuration that cannot be coalesced (a run with no aligned split under the wrap cap, or a batch stride that is too large or not aligned), fall back to the existing per-batch path and are byte-identical. The coalesced path therefore introduces no new failure modes relative to the unroll.Removed
Motivation and measurements
This is a descriptor-count / build-time and correctness change, not a runtime speedup: the transfer is access-equivalent (same bytes, same order), so it does not change data-movement time.
The descriptor count for a batched GEMV drops from O(num_batches) to O(1) per column, which shrinks the descriptor-bound lowering phase of the build. Measured cold compile of a single GEMV (shape M=448, K=64, columns=8, build only, no device run):
The coalesced build is roughly constant in
num_batcheswhile the unrolled build grows linearly, so the saving scales with the batch count.Validation
num_batches == 1generates byte-identical lowered output to the previous code.num_batchesof 4, 8, 32, 100, and 192, plus the aligned-split shape (M=1026), the fallback shape (batch stride > limit), and an attention-style shape (tile_size_input=4); the existingnum_batches == 1tests are unchanged.llama_3.2_1bapplication, which calls GEMV withnum_batches = n_heads, builds and generates coherent output on NPU2 with the coalesced path.swiglu_decodeusesnum_batches=1(byte-identical, unaffected); thellama_3.2_1bapplication calls GEMV withnum_batches = n_headsfor the attention scores/context GEMVs - those now coalesce, and the coalesced descriptor is legal and access-equivalent at those shapes, so behavior is unchanged.PR Merge Checklist
develcommit and pointing todevel.