Fix optional-output guard in DecoderAttention/MultiHeadAttention shape inference#29243
Fix optional-output guard in DecoderAttention/MultiHeadAttention shape inference#29243titaiwangms wants to merge 36 commits into
Conversation
DecoderAttention and MultiHeadAttention shape-inference functions guarded population of present_key (output 1) and present_value (output 2) with getNumOutputs() > 1, but write output index 2. present_key and present_value are produced as a both-or-neither pair, so require all three outputs (> 2) before populating them, matching BaseGroupQueryAttention (>= 3) and the EmbedLayerNorm guard. Also add a bounds check in InferenceContextImpl::getOutputType so an out-of-range output index fails inference cleanly instead of indexing past the end, mirroring DataPropagationContextImpl and getInputType. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…outputs omitted Cover DecoderAttention, MultiHeadAttention and DecoderMaskedMultiHeadAttention nodes declared with exactly two outputs (present_key kept, present_value omitted). Each test builds the node and asserts Graph::Resolve() shape inference completes cleanly. Tests are execution-provider independent and throw-free, so they run on the default CPU build and in no-exception (ORT_NO_EXCEPTIONS) builds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the optional-present-output regression suite with cases that declare all three outputs for DecoderAttention, MultiHeadAttention and DecoderMaskedMultiHeadAttention and assert the present_key/present_value branch still runs and infers their element types. Together with the two-output cases this pins the output-count guard to exactly three. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use each op's actual output names (DecoderAttention: new_key_cache / new_value_cache; MultiHeadAttention: present_key / present_value), align the three guards to a consistent '// has <names> outputs' phrasing, and note that the two optional cache outputs are produced as a pair, so they are present only when the node declares more than two outputs. Comment-only; no logic change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MultiHeadAttention and DecoderMaskedMultiHeadAttention two-output cases only passed a query input, so the present-output branch (which references output index 2) was never entered and the tests could not detect a regression there. Supply shaped past_key / past_value (and past_sequence_length for MHA, buffer sharing for DMMHA) so the branch is exercised while only two outputs are declared, matching the DecoderAttention case which already reached that path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tianleiwu
left a comment
There was a problem hiding this comment.
Correct, minimal, and well-tested fix. The present/KV-cache branch previously entered under getNumOutputs() > 1 but unconditionally wrote both output index 1 and index 2, so a node declaring exactly two outputs hit an out-of-bounds write to index 2. Raising the three guards to > 2 matches the both-or-neither convention already used by BaseGroupQueryAttention (>= 3) and PagedAttention (which keeps > 1 but immediately enforces != 3, so it is already safe and was correctly left untouched). The getOutputType bounds check is good defense-in-depth.
Tests are thorough: the three "omitted" cases pin the safe lower bound and the three "all-present" cases prove the > 2 branch still performs real KV-cache shape inference.
One housekeeping item: CI lintrunner / CLANGFORMAT already flagged formatting on the new test file — please run lintrunner -a before merge.
Non-blocking nit: the PR text says the new getOutputType check is "consistent with the checked access already used by getInputType", but getInputType uses vector::at() (throws std::out_of_range) while the new code uses fail_type_inference. The fail_type_inference variant is actually the nicer choice (clean type-inference error), so no change needed — just noting the mechanisms differ.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ng-convention note Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Residual hardening for MaxpoolWithMask::Compute: the 1D/2D/3D dispatch is selected by kernel_shape.size() and reads x_shape[2..4] / output_dims[2..4] accordingly, but the only rank guard was x_shape.NumDimensions() >= 3. A kernel_shape rank that exceeds the input spatial rank (e.g. rank-3 X with a 2D kernel_shape) caused out-of-bounds reads of x_shape[3]/x_shape[4] and output_dims[3]/output_dims[4]. ONNX shape inference catches this for shaped ONNX models, but the kernel must validate at Compute time for defense in depth (ORT-format / shapeless models bypass shape inference). Add ORT_RETURN_IF_NOT(kernel rank == input spatial rank) alongside the existing mask/input guards, and a kExpectFailure regression test that bypasses ONNX shape inference via AddShapeToTensorData(false) to exercise the Compute-time guard directly. Agent-signed-off: Developer (0b207188) [claude-opus-4.8 via copilot] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… check
Follow-up polish on the kernel-rank guard. The 1D/2D/3D dispatch switch has a
default case that already returns INVALID_ARGUMENT, so an oversized kernel rank
was never a silent/uninitialized-output bug, but the default produced a vague
'Unsupported pooling size :' message late in Compute after allocating the output.
Add an explicit early guard requiring the input spatial rank (== pooling kernel
rank) to be in {1, 2, 3}, with a clear message, alongside the existing guards and
before output allocation. Extract the repeated x_shape.NumDimensions() - 2 into a
single input_spatial_rank local (defined after the rank >= 3 guard so it cannot
underflow) and reuse it. Add a MaxPoolWithMask_KernelRankTooLarge regression test
(rank-6 X + 4D kernel_shape) that passes the equality guard but trips the new
supported-rank guard, asserting the specific message substring.
Agent-signed-off: Developer (0b207188) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er shapes The recurrence zero-point (R_zero_point) and scale (R_scale) inputs were not run through the kernel's shape validation: R_zp_shape was bound to w_zp->Shape() instead of r_zp->Shape(), and the R_scale WeightCheck was passed W_scale_shape instead of R_scale_shape. As a result a malformed r_zp/r_scale (e.g. a shape inconsistent with R) was never rejected and downstream code iterated using the input parameter's element count over the recurrence parameter's buffer. Bind R_zp_shape to r_zp->Shape() and validate R_scale against R_scale_shape so the recurrence parameters are checked symmetrically with the input (W) ones. Add two expect-failure tests that supply R_zero_point / R_scale shapes whose first dimension does not match num_directions and assert the specific INVALID_ARGUMENT diagnostics. Status-based validation, so the tests need no ORT_NO_EXCEPTIONS guard. CPU-only op; no CUDA implementation exists. Agent-signed-off: Developer (0b207188) [claude-opus-4.8 via copilot] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the host-side shape validations from the CPU implementation into the CUDA ComputeInternal: data/scales rank and per-dimension equality, scales/zero_points rank and per-dimension equality, and block_size > 0 (the constructor permits the unset default of 0, which is an invalid divisor for block mapping). Reuse a single packing-components value across the output-shape and quantization-parameter logic. Validate the gathered index value on the device: the kernel sets a device flag when an index falls outside [-gather_axis_dim, gather_axis_dim) and normalizes negative indices to match the CPU kernel; the host reads the flag back after the launch and returns a clean status. Index arithmetic stays in int64. Add CUDA-scoped expect-failure tests for mismatched scales shape, out-of-range index, and block_size == 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Harden EmbedLayerNormalization input handling so malformed inputs are rejected with a clear INVALID_ARGUMENT instead of reading past the embedding tables. - Shared CheckInputs helper: when position_ids are not supplied, require position_embedding to have at least sequence_length rows. - CUDA kernel: validate word/position/segment ids against their embedding table sizes, surfacing an out-of-range id via a device error flag instead of silently clamping; widen the embedding offset arithmetic to int64 to avoid overflow. - CUDA op: read back the device error flag and return INVALID_ARGUMENT. - Tests: enable EmbedLayerNormNegativePositionIds on CUDA and add EmbedLayerNormWordIdOutOfRange and EmbedLayerNormPositionEmbeddingTooFewRows expect-failure cases covering CPU and CUDA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GridSample sizes its output from the grid, so a zero-size input spatial dimension skips the empty-output early return and reaches the clamp/reflect index computation with an invalid extent. Add an ORT_RETURN_IF_NOT check that every input spatial dimension is greater than zero in both the CPU Compute and the CUDA ComputeInternal (covering 2D and 3D-spatial inputs, NCHW and NHWC layouts), with a consistent diagnostic message. Add an expect-failure OpTester case that fans across the registered CPU and CUDA execution providers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (cc69a935) [claude-opus-4.8 via copilot] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…UDA kernels) Add ORT_RETURN_IF_NOT guards that every input spatial dimension is greater than zero in GridSample::Compute (CPU) and GridSample::ComputeInternal (CUDA), covering 2D and 3D-spatial inputs in both NCHW and NHWC layouts. The output is sized by the grid, so a zero-size input spatial dimension would otherwise skip the empty-output early return and reach interpolation with an invalid extent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… inference The GatherBlockQuantized type/shape inference uses block_size as a divisor when relating the data and scales shapes, but only rejected negative values. A block_size of 0 reached the division and raised an integer divide-by-zero during Graph::Resolve. Reject block_size <= 0 with a clear status instead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validate the bits attribute in the CUDA constructor before it is used as the divisor in 8 / bits to derive the uint8 packing factor: uint8 data accepts bits == 4 or 8 and int4/uint4 accepts bits == 4. Wrap the kernel launch line to stay within the column limit and document the positional parameters of the Test_Fail_WithoutZeroPoints helper. Add CUDA-scoped expect-failure coverage for an unsupported bits value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…h capture The device kernel already validates word/position/segment ids and skips the embedding reads for any out-of-range id, so input handling stays in range regardless of the host-side readback. The readback only upgrades a skipped row into a clean INVALID_ARGUMENT status. cudaStreamSynchronize and device-to-host copies are not allowed on a stream that is capturing a CUDA graph, so make the readback conditional: query cudaStreamIsCapturing and perform the copy + synchronize + status check only when the stream is not capturing. This restores CUDA graph capture support for static-shape models (where EmbedLayerNormalization is typically the first node) while keeping ids in range device-side; error surfacing resumes on normal non-capturing runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use inline parameter-name comments for the positional arguments passed to Test_Fail_WithoutZeroPoints in the CUDA-scoped tests so the gather_axis, quantize_axis, block_size, and bits values are self-documenting at the call site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… EPs The zero-spatial-dimension validation guard currently lives only in the CPU and CUDA GridSample kernels. GetExecutionProviders() also adds the WebGPU and CoreML kernels on builds that enable them, and those separate kernels do not carry the guard, so the kExpectFailure substring assertion would falsely fail there. Add a CPU+CUDA-only provider helper and use it for the expect-failure case; positive tests are unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (cc69a935) [claude-opus-4.8 via copilot] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pture The device-side index bounds check and kernel early-return already protect memory safety, so the host readback of the error flag only upgrades a silently incorrect result into a clean error. That readback performs a device-to-host copy and a stream synchronize, both of which are illegal while the stream is being captured for a CUDA graph. Query the capture status with cudaStreamIsCapturing and perform the readback only when the stream is not capturing, matching the idiom used by EmbedLayerNormalization. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Source-only follow-up to the EmbedLayerNormalization input-validation changes: - Widen the output offset to int64 to match the word/position/segment offsets; narrow explicitly at the LayerNorm call, which takes an int offset. - Reword the kernel and test comments to neutral "out of range of the embedding table" phrasing. - Trim the segment_embedding_length parameter comment to fit the column limit. - Document that the CUDA NHWC EP is intentionally left enabled for the negative-position-id test (it shares the same validated kernel and is skipped automatically if not registered). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For large batch * sequence_length * hidden_size (which can exceed 2^31 elements on high-memory GPUs), the EmbedLayerNormalization output write index computed as sequence_position * hidden_size overflowed 32-bit arithmetic and wrapped negative, producing an out-of-range write. Make the whole output write-index path 64-bit end-to-end: - Pass the already-widened int64 output offset straight to LayerNorm instead of narrowing it back to int. - Widen the shared LayerNorm helper's offset parameter and its internal index from int to int64_t so the offset is not truncated again. This is source-compatible with the only other caller (skip layer norm passes an int offset, which promotes cleanly). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…his change The output write-index 64-bit widening also requires touching the shared LayerNorm helper in layer_norm.cuh, which is used by other BERT kernels (skip layer norm) and needs its own cross-kernel review and regression testing. Keep that complete change as a separate follow-up so it does not gate this single-op input-validation change. Restore the output offset to its original 32-bit form and fully revert the shared layer_norm.cuh helper. The embedding-index validation and the read-side word/position/segment offset arithmetic are unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Support large tensors in the CUDA BERT LayerNorm device helpers by computing the global output element offset in 64-bit. For big batch * sequence_length * hidden_size (which exceeds 2^31 elements on high-memory GPUs), the previous 32-bit row offset arithmetic could not represent the index. - layer_norm.cuh: widen the global write offset / index of LayerNorm, SimplifiedLayerNorm, LayerNormSmall and SimplifiedLayerNormSmall to int64_t. The gamma/beta/bias indices (ld, i, threadIdx.x * ILP) stay int since they are bounded by hidden_size. - embed_layer_norm_impl.cu: compute output_offset in int64 and pass it to LayerNorm without narrowing. - skip_layer_norm_impl.cu: compute the per-block offset/index in int64 in both SkipLayerNormKernel and SkipLayerNormKernelSmall. Numerics are unchanged for normal tensor sizes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ange The previous commit re-applied two logically separate changes together: the EmbedLayerNormalization output write-index widening (LayerNorm() offset/idx and output_offset) and the SkipLayerNormalization write-index widening. Split them so each lands as its own single-op change. This reverts only the SkipLayerNorm portion (SimplifiedLayerNorm, LayerNormSmall and SimplifiedLayerNormSmall offset/idx in layer_norm.cuh, plus both kernels in skip_layer_norm_impl.cu) so it can be re-applied as its own commit. The EmbedLayerNorm write-index widening (LayerNorm() and output_offset) is kept. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Support large tensors in SkipLayerNormalization by computing the global output element offset in 64-bit. For big rows * hidden_size (rows = batch * sequence length), which can exceed 2^31 elements on high-memory GPUs, the previous 32-bit row offset arithmetic could not represent the index. - skip_layer_norm_impl.cu: compute the per-block offset/index in int64 in both SkipLayerNormKernel and SkipLayerNormKernelSmall. - layer_norm.cuh: widen the global write offset/index of SimplifiedLayerNorm, LayerNormSmall and SimplifiedLayerNormSmall (the helpers used only by SkipLayerNorm) to int64_t. The gamma/beta/bias indices (ld, i, threadIdx.x * ILP) stay int since they are bounded by hidden_size. LayerNorm() (used by EmbedLayerNorm) is widened separately in the EmbedLayerNorm write-index change. Numerics are unchanged for normal sizes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ions builds The CudaShapeMismatch, CudaInvalidBlockSizeZero, and CudaUnsupportedBits cases reject their inputs by throwing: the first two via fail_shape_inference during Graph::Resolve and the last via an ORT_ENFORCE in the kernel constructor. A throw aborts under ORT_NO_EXCEPTIONS instead of surfacing as the failure that OpTester::ExpectResult::kExpectFailure expects, so wrap them in !defined(ORT_NO_EXCEPTIONS). CudaInvalidIndices stays unguarded because it fails through a device error flag turned into a Status, which does not throw. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…o-exceptions builds" This reverts commit e843cde. The GatherBlockQuantized test file has seven pre-existing throw-based expect-failure tests and no ORT_NO_EXCEPTIONS guards, so the provider test is not built with exceptions disabled. Guarding only the three new CUDA tests is inconsistent with that convention, so restore them to unguarded. The accurate comment on CudaInvalidBlockSizeZero (it is rejected by shape inference during Graph::Resolve, not by ComputeInternal) is kept. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…no-exceptions builds" This reverts commit 0b432bf.
…ow-up change Revert the write-index 64-bit widening currently spread across follow-up commits so it can re-land as one coherent change. This reverts only the write-offset int64 in the LayerNorm device helpers, the embed output_offset, and the SkipLayerNorm kernels; the EmbedLayerNormalization read-side offsets (word/position/segment) stay int64 as they are part of the input validation. No net behavior change once the consolidated widening is re-applied. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sts for no-exceptions builds"" This reverts commit 87139d3.
Support large tensors in the CUDA BERT LayerNorm write path by computing the global output element offset in 64-bit. For big rows * hidden_size (rows = batch * sequence_length), which can exceed 2^31 elements on high-memory GPUs, the previous 32-bit row offset arithmetic could not represent the index. Three files, write-offset/index only (gamma/beta/bias indices ld, i, threadIdx.x * ILP stay int since they are bounded by hidden_size): - layer_norm.cuh: widen the global write offset/index of LayerNorm, SimplifiedLayerNorm, LayerNormSmall and SimplifiedLayerNormSmall to int64_t. All four have callers (EmbedLayerNorm via LayerNorm; SkipLayerNorm via all four) whose offset = row_index * hidden_size can exceed 2^31. - embed_layer_norm_impl.cu: compute output_offset in int64 and pass it to LayerNorm without narrowing. - skip_layer_norm_impl.cu: compute the per-block offset/index in int64 in both SkipLayerNormKernel and SkipLayerNormKernelSmall. Numerics are unchanged for normal tensor sizes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add brief explanatory comments at each int64-widened global write offset/index in the CUDA BERT LayerNorm helpers and SkipLayerNorm kernels, noting that the offset (row_index * hidden_size = batch * sequence_length * hidden_size) can exceed 2^31 for large tensors and that gamma/beta indices intentionally stay int (bounded by hidden_size). This prevents a future reader from narrowing the index back to int and reintroducing the large-tensor limitation. Comment-only; no behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e inference GetFirstElement<T> read sizeof(T) bytes from a graph initializer's raw_data guarded only by the presence bit. Add a length check before the cast and fail shape inference cleanly when raw_data is shorter than the element size. Also clamp the computed element count to >= 0 in CalcRangeDim so shape inference matches the CPU kernel for empty/backward ranges. Add model-load regression tests in range_test.cc (guarded by #ifndef DISABLE_CONTRIB_OPS, Status-only assertions for no-exception builds). Agent-signed-off: Developer (b2fe149b) [claude-opus-4.8 via copilot] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- GetFirstElement<T>: read raw_data via std::memcpy into an aligned local after the length check, avoiding an unaligned access on strict-alignment targets. - CalcRangeDim<T> and the CPU kernel ComputeRange: guard the element-count cast with std::isfinite so a non-finite computed count fails cleanly instead of an out-of-range cast (fail_shape_inference in the schema, a non-OK Status in the kernel). - range_test.cc: generalize the model-building helper over element type, dimensions and per-input bytes. Truncated raw_data is now modeled as dims=[0] with empty raw_data so the cases are tight regressions for the length check; add coverage for start/limit/delta positions, float and int64 element types, a zero-delta failure, and an exact-size success boundary. Throwing tests are additionally guarded by !defined(ORT_NO_EXCEPTIONS). Agent-signed-off: Developer (b2fe149b) [claude-opus-4.8 via copilot] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cast The std::isfinite guard catches inf/NaN but not a finite count larger than the int64 range, where static_cast<int64_t>(count) is out of range. Handle the non-positive case before the cast (so a large-magnitude negative count cannot overflow it) and reject counts at or above 2^63 with a neutral message. Applied identically in the schema CalcRangeDim (fail_shape_inference) and the CPU kernel ComputeRange (non-OK Status), keeping the two sites consistent. Add a guarded regression test (start=0, limit=1e19, delta=1) asserting a clean non-OK status; verified it fails without the new guard. Agent-signed-off: Developer (b2fe149b) [claude-opus-4.8 via copilot] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tianleiwu
left a comment
There was a problem hiding this comment.
Review of the broader hardening pass (head f7f73c1)
This PR has grown well beyond its title: 27 files bundling ~10 independent contrib-op hardening changes plus a new SKILL doc. The attention optional-output guard fix (bert_defs.cc > 1 → > 2, plus the getOutputType bounds check in graph.cc) was reviewed in a prior round and that thread is resolved — it remains correct.
Genuine bug fix: the DynamicQuantizeLSTM change (R_zp_shape = r_zp->Shape(), WeightCheck(R_scale_shape, R_scale)) fixes a real copy-paste bug where the recurrence zero-point/scale were validated against the input weight shapes. Good catch, and it has a test.
Scope / process (non-blocking): per repo guidance (≤10 files, separate unrelated functional changes), this is a strong candidate to split — particularly the independent DynamicQuantizeLSTM correctness fix, which would be far easier to cherry-pick or revert on its own than buried among the defensive guards.
Performance (non-blocking, see inline notes): GatherBlockQuantized and EmbedLayerNormalization now add a device→host readback + cudaStreamSynchronize on every non-capturing run. Memory safety is already guaranteed device-side, so the sync only surfaces a clean error. For GatherBlockQuantized (dequant gather path) a per-call stream sync can hurt latency; worth confirming the tradeoff or gating the readback.
Everything else — Range overflow/alignment hardening, GridSample empty-spatial guards, GatherBlockQuantized shape/bits validation, EmbedLayerNorm device-side id validation, 64-bit LayerNorm offsets, MaxpoolWithMask rank checks — looks correct and consistent with existing patterns.
|
Superseding this with #29268, a focused PR containing only the optional-output shape-inference guard fix and its test/doc. This PR's branch had been reused as the shared integration branch and accumulated a number of unrelated commits, so its diff no longer reflects the intended change. Please review #29268 instead. Thanks! |
Summary
DecoderAttentionTypeAndShapeInferenceandMultiHeadAttentionTypeAndShapeInference(the latter shared byMultiHeadAttentionandDecoderMaskedMultiHeadAttention) guarded population of the optional present/cache outputs withgetNumOutputs() > 1, but they populate output index 2. These outputs are produced as a pair (both-or-neither), so the guard should require more than 2 outputs before populating outputs 1 and 2.This changes the three guards from
> 1to> 2, matching the established pattern already used inBaseGroupQueryAttentionTypeAndShapeInference(>= 3),PagedAttentionTypeAndShapeInference, and theEmbedLayerNormalizationfix in #28176.Defensive bounds check
Also adds a bounds check in
InferenceContextImpl::getOutputType(core/graph/graph.cc), consistent with the checked access already used bygetInputType, so an out-of-range output index yields a clean type-inference failure instead of undefined behavior.Tests
New
onnxruntime/test/contrib_ops/attention_optional_outputs_shape_inference_test.ccwith 6 cases:> 2branch still performs legitimate KV-cache shape inference.The tests run through graph resolution and are execution-provider-independent. They are throw-free and safe under
ORT_NO_EXCEPTIONSbuilds.Reference
Mirrors #28176.