Skip to content

gemv: coalesce batched DMA into a single iterated BD per column#127

Open
atassis wants to merge 2 commits into
amd:develfrom
atassis:gemv-coalesce
Open

gemv: coalesce batched DMA into a single iterated BD per column#127
atassis wants to merge 2 commits into
amd:develfrom
atassis:gemv-coalesce

Conversation

@atassis

@atassis atassis commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Replaces the GEMV operator's per-batch host DMA descriptor unroll with a single iterated descriptor per column.

GEMM already folds many transfers into one multi-dimensional iterated descriptor (via TensorTiler2D); GEMV did not, so a batched GEMV emitted num_batches separate fill/drain descriptors (plus a per-batch task-group wait) per AIE column, linear in the batch count. The test suite also had no num_batches > 1 GEMV coverage, so the batched path shipped untested.

Added

  • A num_batches > 1 golden reference (generate_golden_reference_batched) and a parametrized device test (test_gemv_batched) covering: the coalesced path with large num_batches (the size-uncapped descriptor dimension) and a multi-dimension run split; a run that requires an alignment-aware (even) inner split; the per-batch fallback (a batch stride over the limit); and an attention-style shape with tile_size_input > 1 and num_batches = a head count.

Changed

  • The GEMV runtime now coalesces the batched A-fill / C-drain into one iterated descriptor per column by default. num_batches is placed in the size-uncapped descriptor dimension and the contiguous per-batch run is split across the two wrap dimensions (each <= 1023), per the AIE shim BD limits (AIEXDialect.cpp verifyStridesWraps). The run split keeps the innermost size and the strides 4-byte-aligned (the shim's address granularity, which is enforced even for linear transfers).
  • The per-batch drain wait is replaced by ObjectFifo backpressure. Depth 2 is sufficient (rather than depth num_batches): the A-fill (MM2S) and C-drain (S2MM) run on separate shim channels and the pipeline is paced by the fifo locks, so the producer cannot overrun the consumer; this is a streaming pipeline, not a store-all buffer. The depth invariant is asserted.
  • The coalesced descriptor accesses the exact same DRAM elements in the same order as the previous unroll. num_batches == 1, and any configuration that cannot be coalesced (a run with no aligned split under the wrap cap, or a batch stride that is too large or not aligned), fall back to the existing per-batch path and are byte-identical. The coalesced path therefore introduces no new failure modes relative to the unroll.

Removed

  • Nothing (behavior for existing callers is unchanged).

Motivation and measurements

This is a descriptor-count / build-time and correctness change, not a runtime speedup: the transfer is access-equivalent (same bytes, same order), so it does not change data-movement time.

The descriptor count for a batched GEMV drops from O(num_batches) to O(1) per column, which shrinks the descriptor-bound lowering phase of the build. Measured cold compile of a single GEMV (shape M=448, K=64, columns=8, build only, no device run):

num_batches unrolled coalesced
192 5.99s 4.72s
1536 52.37s 3.54s

The coalesced build is roughly constant in num_batches while the unrolled build grows linearly, so the saving scales with the batch count.

Validation

  • Offline access-equivalence of the coalesced vs unrolled descriptor for the tested and fallback configurations, including the alignment-aware split.
  • num_batches == 1 generates byte-identical lowered output to the previous code.
  • The new device tests pass on NPU2 at num_batches of 4, 8, 32, 100, and 192, plus the aligned-split shape (M=1026), the fallback shape (batch stride > limit), and an attention-style shape (tile_size_input=4); the existing num_batches == 1 tests are unchanged.
  • Exercised end to end: the in-repo llama_3.2_1b application, which calls GEMV with num_batches = n_heads, builds and generates coherent output on NPU2 with the coalesced path.
  • In-repo GEMV consumers were checked: swiglu_decode uses num_batches=1 (byte-identical, unaffected); the llama_3.2_1b application calls GEMV with num_batches = n_heads for the attention scores/context GEMVs - those now coalesce, and the coalesced descriptor is legal and access-equivalent at those shapes, so behavior is unchanged.

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR has been reviewed and approved.
  3. All checks are passing.

atassis added 2 commits June 26, 2026 20:15
The batched GEMV unrolled num_batches host DMA descriptors per column (one fill +
one drain + a task-group wait per batch). Express the batch as a single iterated BD
instead, the same B-unroll -> BD-iteration idiom GEMM already uses for tiling: place
num_batches in the size-uncapped descriptor dim and split the contiguous run across
the two wrap dims (each <= 1023), per the AIE shim BD limits (AIEXDialect.cpp
verifyStridesWraps). The run split keeps the inner size and strides 4-byte-aligned
(granularity-aware), and the per-batch drain wait is dropped in favour of ObjectFifo
backpressure (fifo depth >= 2 asserted; the A-fill and C-drain run on separate shim
channels, paced by the fifo locks).

num_batches == 1 and any config that cannot be coalesced (run with no aligned split
under the wrap cap, or a batch stride that is too large or unaligned) fall back to
the existing per-batch path and are byte-identical. The coalesced descriptor accesses
the exact same DRAM elements in the same order as the unroll (access-equivalent), so
this is a descriptor-count / build-time and correctness change, not a runtime change.
The test suite had no num_batches > 1 GEMV coverage. Add a batched golden reference
(num_batches independent matrix-vector products, stacked contiguously) and a
parametrized device test covering: the coalesced path with large num_batches (the
size-uncapped dim) and a multi-dimension run split; a run that requires an aligned
(even) inner split; and the per-batch fallback (batch stride over the limit).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant