gemv: coalesce batched DMA into a single iterated BD per column by atassis · Pull Request #127 · amd/IRON

atassis · 2026-06-26T18:04:56Z

Replaces the GEMV operator's per-batch host DMA descriptor unroll with a single iterated descriptor per column.

GEMM already folds many transfers into one multi-dimensional iterated descriptor (via TensorTiler2D); GEMV did not, so a batched GEMV emitted num_batches separate fill/drain descriptors (plus a per-batch task-group wait) per AIE column, linear in the batch count. The test suite also had no num_batches > 1 GEMV coverage, so the batched path shipped untested.

Added

A num_batches > 1 golden reference (generate_golden_reference_batched) and a parametrized device test (test_gemv_batched) covering: the coalesced path with large num_batches (the size-uncapped descriptor dimension) and a multi-dimension run split; a run that requires an alignment-aware (even) inner split; the per-batch fallback (a batch stride over the limit); and an attention-style shape with tile_size_input > 1 and num_batches = a head count.

Changed

The GEMV runtime now coalesces the batched A-fill / C-drain into one iterated descriptor per column by default. num_batches is placed in the size-uncapped descriptor dimension and the contiguous per-batch run is split across the two wrap dimensions (each <= 1023), per the AIE shim BD limits (AIEXDialect.cpp verifyStridesWraps). The run split keeps the innermost size and the strides 4-byte-aligned (the shim's address granularity, which is enforced even for linear transfers).
The per-batch drain wait is replaced by ObjectFifo backpressure. Depth 2 is sufficient (rather than depth num_batches): the A-fill (MM2S) and C-drain (S2MM) run on separate shim channels and the pipeline is paced by the fifo locks, so the producer cannot overrun the consumer; this is a streaming pipeline, not a store-all buffer. The depth invariant is asserted.
The coalesced descriptor accesses the exact same DRAM elements in the same order as the previous unroll. num_batches == 1, and any configuration that cannot be coalesced (a run with no aligned split under the wrap cap, or a batch stride that is too large or not aligned), fall back to the existing per-batch path and are byte-identical. The coalesced path therefore introduces no new failure modes relative to the unroll.

Removed

Nothing (behavior for existing callers is unchanged).

Motivation and measurements

This is a descriptor-count / build-time and correctness change, not a runtime speedup: the transfer is access-equivalent (same bytes, same order), so it does not change data-movement time.

The descriptor count for a batched GEMV drops from O(num_batches) to O(1) per column, which shrinks the descriptor-bound lowering phase of the build. Measured cold compile of a single GEMV (shape M=448, K=64, columns=8, build only, no device run):

num_batches	unrolled	coalesced
192	5.99s	4.72s
1536	52.37s	3.54s

The coalesced build is roughly constant in num_batches while the unrolled build grows linearly, so the saving scales with the batch count.

Validation

Offline access-equivalence of the coalesced vs unrolled descriptor for the tested and fallback configurations, including the alignment-aware split.
num_batches == 1 generates byte-identical lowered output to the previous code.
The new device tests pass on NPU2 at num_batches of 4, 8, 32, 100, and 192, plus the aligned-split shape (M=1026), the fallback shape (batch stride > limit), and an attention-style shape (tile_size_input=4); the existing num_batches == 1 tests are unchanged.
Exercised end to end: the in-repo llama_3.2_1b application, which calls GEMV with num_batches = n_heads, builds and generates coherent output on NPU2 with the coalesced path.
In-repo GEMV consumers were checked: swiglu_decode uses num_batches=1 (byte-identical, unaffected); the llama_3.2_1b application calls GEMV with num_batches = n_heads for the attention scores/context GEMVs - those now coalesce, and the coalesced descriptor is legal and access-equivalent at those shapes, so behavior is unchanged.

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR has been reviewed and approved.
All checks are passing.

The batched GEMV unrolled num_batches host DMA descriptors per column (one fill + one drain + a task-group wait per batch). Express the batch as a single iterated BD instead, the same B-unroll -> BD-iteration idiom GEMM already uses for tiling: place num_batches in the size-uncapped descriptor dim and split the contiguous run across the two wrap dims (each <= 1023), per the AIE shim BD limits (AIEXDialect.cpp verifyStridesWraps). The run split keeps the inner size and strides 4-byte-aligned (granularity-aware), and the per-batch drain wait is dropped in favour of ObjectFifo backpressure (fifo depth >= 2 asserted; the A-fill and C-drain run on separate shim channels, paced by the fifo locks). num_batches == 1 and any config that cannot be coalesced (run with no aligned split under the wrap cap, or a batch stride that is too large or unaligned) fall back to the existing per-batch path and are byte-identical. The coalesced descriptor accesses the exact same DRAM elements in the same order as the unroll (access-equivalent), so this is a descriptor-count / build-time and correctness change, not a runtime change.

The test suite had no num_batches > 1 GEMV coverage. Add a batched golden reference (num_batches independent matrix-vector products, stacked contiguously) and a parametrized device test covering: the coalesced path with large num_batches (the size-uncapped dim) and a multi-dimension run split; a run that requires an aligned (even) inner split; and the per-batch fallback (batch stride over the limit).

atassis added 2 commits June 26, 2026 20:15

atassis requested review from andrej, hunhoffe and jgmelber as code owners June 26, 2026 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gemv: coalesce batched DMA into a single iterated BD per column#127

gemv: coalesce batched DMA into a single iterated BD per column#127
atassis wants to merge 2 commits into
amd:develfrom
atassis:gemv-coalesce

atassis commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

atassis commented Jun 26, 2026

Added

Changed

Removed

Motivation and measurements

Validation

PR Merge Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant