Skip to content

Fix optional-output guard in DecoderAttention/MultiHeadAttention shape inference#29243

Closed
titaiwangms wants to merge 36 commits into
microsoft:mainfrom
titaiwangms:fix/contrib-attn-shapeinf-oob
Closed

Fix optional-output guard in DecoderAttention/MultiHeadAttention shape inference#29243
titaiwangms wants to merge 36 commits into
microsoft:mainfrom
titaiwangms:fix/contrib-attn-shapeinf-oob

Conversation

@titaiwangms

@titaiwangms titaiwangms commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

DecoderAttentionTypeAndShapeInference and MultiHeadAttentionTypeAndShapeInference (the latter shared by MultiHeadAttention and DecoderMaskedMultiHeadAttention) guarded population of the optional present/cache outputs with getNumOutputs() > 1, but they populate output index 2. These outputs are produced as a pair (both-or-neither), so the guard should require more than 2 outputs before populating outputs 1 and 2.

This changes the three guards from > 1 to > 2, matching the established pattern already used in BaseGroupQueryAttentionTypeAndShapeInference (>= 3), PagedAttentionTypeAndShapeInference, and the EmbedLayerNormalization fix in #28176.

Defensive bounds check

Also adds a bounds check in InferenceContextImpl::getOutputType (core/graph/graph.cc), consistent with the checked access already used by getInputType, so an out-of-range output index yields a clean type-inference failure instead of undefined behavior.

Tests

New onnxruntime/test/contrib_ops/attention_optional_outputs_shape_inference_test.cc with 6 cases:

  • 3 cases that declare exactly 2 outputs (omitting the paired optional output) and confirm shape inference completes cleanly.
  • 3 cases that declare all present outputs and assert their element types are still inferred, proving the > 2 branch still performs legitimate KV-cache shape inference.

The tests run through graph resolution and are execution-provider-independent. They are throw-free and safe under ORT_NO_EXCEPTIONS builds.

Reference

Mirrors #28176.

titaiwangms and others added 5 commits June 23, 2026 23:49
DecoderAttention and MultiHeadAttention shape-inference functions guarded
population of present_key (output 1) and present_value (output 2) with
getNumOutputs() > 1, but write output index 2. present_key and present_value
are produced as a both-or-neither pair, so require all three outputs (> 2)
before populating them, matching BaseGroupQueryAttention (>= 3) and the
EmbedLayerNorm guard. Also add a bounds check in
InferenceContextImpl::getOutputType so an out-of-range output index fails
inference cleanly instead of indexing past the end, mirroring
DataPropagationContextImpl and getInputType.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…outputs omitted

Cover DecoderAttention, MultiHeadAttention and DecoderMaskedMultiHeadAttention
nodes declared with exactly two outputs (present_key kept, present_value
omitted). Each test builds the node and asserts Graph::Resolve() shape inference
completes cleanly. Tests are execution-provider independent and throw-free, so
they run on the default CPU build and in no-exception (ORT_NO_EXCEPTIONS) builds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the optional-present-output regression suite with cases that declare all
three outputs for DecoderAttention, MultiHeadAttention and
DecoderMaskedMultiHeadAttention and assert the present_key/present_value branch
still runs and infers their element types. Together with the two-output cases
this pins the output-count guard to exactly three.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use each op's actual output names (DecoderAttention: new_key_cache /
new_value_cache; MultiHeadAttention: present_key / present_value), align the
three guards to a consistent '// has <names> outputs' phrasing, and note that
the two optional cache outputs are produced as a pair, so they are present only
when the node declares more than two outputs. Comment-only; no logic change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MultiHeadAttention and DecoderMaskedMultiHeadAttention two-output cases only
passed a query input, so the present-output branch (which references output index
2) was never entered and the tests could not detect a regression there. Supply
shaped past_key / past_value (and past_sequence_length for MHA, buffer sharing for
DMMHA) so the branch is exercised while only two outputs are declared, matching the
DecoderAttention case which already reached that path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, minimal, and well-tested fix. The present/KV-cache branch previously entered under getNumOutputs() > 1 but unconditionally wrote both output index 1 and index 2, so a node declaring exactly two outputs hit an out-of-bounds write to index 2. Raising the three guards to > 2 matches the both-or-neither convention already used by BaseGroupQueryAttention (>= 3) and PagedAttention (which keeps > 1 but immediately enforces != 3, so it is already safe and was correctly left untouched). The getOutputType bounds check is good defense-in-depth.

Tests are thorough: the three "omitted" cases pin the safe lower bound and the three "all-present" cases prove the > 2 branch still performs real KV-cache shape inference.

One housekeeping item: CI lintrunner / CLANGFORMAT already flagged formatting on the new test file — please run lintrunner -a before merge.

Non-blocking nit: the PR text says the new getOutputType check is "consistent with the checked access already used by getInputType", but getInputType uses vector::at() (throws std::out_of_range) while the new code uses fail_type_inference. The fail_type_inference variant is actually the nicer choice (clean type-inference error), so no change needed — just noting the mechanisms differ.

Comment thread onnxruntime/core/graph/contrib_ops/bert_defs.cc
tianleiwu
tianleiwu previously approved these changes Jun 24, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
titaiwangms and others added 4 commits June 24, 2026 20:42
…ng-convention note

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Residual hardening for MaxpoolWithMask::Compute: the 1D/2D/3D dispatch is
selected by kernel_shape.size() and reads x_shape[2..4] / output_dims[2..4]
accordingly, but the only rank guard was x_shape.NumDimensions() >= 3. A
kernel_shape rank that exceeds the input spatial rank (e.g. rank-3 X with a
2D kernel_shape) caused out-of-bounds reads of x_shape[3]/x_shape[4] and
output_dims[3]/output_dims[4]. ONNX shape inference catches this for shaped
ONNX models, but the kernel must validate at Compute time for defense in
depth (ORT-format / shapeless models bypass shape inference).

Add ORT_RETURN_IF_NOT(kernel rank == input spatial rank) alongside the
existing mask/input guards, and a kExpectFailure regression test that
bypasses ONNX shape inference via AddShapeToTensorData(false) to exercise
the Compute-time guard directly.

Agent-signed-off: Developer (0b207188) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… check

Follow-up polish on the kernel-rank guard. The 1D/2D/3D dispatch switch has a
default case that already returns INVALID_ARGUMENT, so an oversized kernel rank
was never a silent/uninitialized-output bug, but the default produced a vague
'Unsupported pooling size :' message late in Compute after allocating the output.

Add an explicit early guard requiring the input spatial rank (== pooling kernel
rank) to be in {1, 2, 3}, with a clear message, alongside the existing guards and
before output allocation. Extract the repeated x_shape.NumDimensions() - 2 into a
single input_spatial_rank local (defined after the rank >= 3 guard so it cannot
underflow) and reuse it. Add a MaxPoolWithMask_KernelRankTooLarge regression test
(rank-6 X + 4D kernel_shape) that passes the equality guard but trips the new
supported-rank guard, asserting the specific message substring.

Agent-signed-off: Developer (0b207188) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er shapes

The recurrence zero-point (R_zero_point) and scale (R_scale) inputs were not run
through the kernel's shape validation: R_zp_shape was bound to w_zp->Shape()
instead of r_zp->Shape(), and the R_scale WeightCheck was passed W_scale_shape
instead of R_scale_shape. As a result a malformed r_zp/r_scale (e.g. a shape
inconsistent with R) was never rejected and downstream code iterated using the
input parameter's element count over the recurrence parameter's buffer.

Bind R_zp_shape to r_zp->Shape() and validate R_scale against R_scale_shape so
the recurrence parameters are checked symmetrically with the input (W) ones.
Add two expect-failure tests that supply R_zero_point / R_scale shapes whose
first dimension does not match num_directions and assert the specific
INVALID_ARGUMENT diagnostics. Status-based validation, so the tests need no
ORT_NO_EXCEPTIONS guard. CPU-only op; no CUDA implementation exists.

Agent-signed-off: Developer (0b207188) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms enabled auto-merge (squash) June 25, 2026 00:59
titaiwangms and others added 14 commits June 25, 2026 01:01
Port the host-side shape validations from the CPU implementation into the
CUDA ComputeInternal: data/scales rank and per-dimension equality,
scales/zero_points rank and per-dimension equality, and block_size > 0
(the constructor permits the unset default of 0, which is an invalid
divisor for block mapping). Reuse a single packing-components value across
the output-shape and quantization-parameter logic.

Validate the gathered index value on the device: the kernel sets a device
flag when an index falls outside [-gather_axis_dim, gather_axis_dim) and
normalizes negative indices to match the CPU kernel; the host reads the
flag back after the launch and returns a clean status. Index arithmetic
stays in int64.

Add CUDA-scoped expect-failure tests for mismatched scales shape,
out-of-range index, and block_size == 0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Harden EmbedLayerNormalization input handling so malformed inputs are
rejected with a clear INVALID_ARGUMENT instead of reading past the
embedding tables.

- Shared CheckInputs helper: when position_ids are not supplied, require
  position_embedding to have at least sequence_length rows.
- CUDA kernel: validate word/position/segment ids against their embedding
  table sizes, surfacing an out-of-range id via a device error flag
  instead of silently clamping; widen the embedding offset arithmetic to
  int64 to avoid overflow.
- CUDA op: read back the device error flag and return INVALID_ARGUMENT.
- Tests: enable EmbedLayerNormNegativePositionIds on CUDA and add
  EmbedLayerNormWordIdOutOfRange and EmbedLayerNormPositionEmbeddingTooFewRows
  expect-failure cases covering CPU and CUDA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GridSample sizes its output from the grid, so a zero-size input spatial
dimension skips the empty-output early return and reaches the
clamp/reflect index computation with an invalid extent. Add an
ORT_RETURN_IF_NOT check that every input spatial dimension is greater
than zero in both the CPU Compute and the CUDA ComputeInternal (covering
2D and 3D-spatial inputs, NCHW and NHWC layouts), with a consistent
diagnostic message. Add an expect-failure OpTester case that fans across
the registered CPU and CUDA execution providers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Agent-signed-off: Developer (cc69a935) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…UDA kernels)

Add ORT_RETURN_IF_NOT guards that every input spatial dimension is greater
than zero in GridSample::Compute (CPU) and GridSample::ComputeInternal
(CUDA), covering 2D and 3D-spatial inputs in both NCHW and NHWC layouts.
The output is sized by the grid, so a zero-size input spatial dimension
would otherwise skip the empty-output early return and reach interpolation
with an invalid extent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… inference

The GatherBlockQuantized type/shape inference uses block_size as a divisor
when relating the data and scales shapes, but only rejected negative values.
A block_size of 0 reached the division and raised an integer divide-by-zero
during Graph::Resolve. Reject block_size <= 0 with a clear status instead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validate the bits attribute in the CUDA constructor before it is used as the
divisor in 8 / bits to derive the uint8 packing factor: uint8 data accepts
bits == 4 or 8 and int4/uint4 accepts bits == 4. Wrap the kernel launch line
to stay within the column limit and document the positional parameters of the
Test_Fail_WithoutZeroPoints helper. Add CUDA-scoped expect-failure coverage
for an unsupported bits value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…h capture

The device kernel already validates word/position/segment ids and skips the
embedding reads for any out-of-range id, so input handling stays in range
regardless of the host-side readback. The readback only upgrades a skipped
row into a clean INVALID_ARGUMENT status.

cudaStreamSynchronize and device-to-host copies are not allowed on a stream
that is capturing a CUDA graph, so make the readback conditional: query
cudaStreamIsCapturing and perform the copy + synchronize + status check only
when the stream is not capturing. This restores CUDA graph capture support
for static-shape models (where EmbedLayerNormalization is typically the first
node) while keeping ids in range device-side; error surfacing resumes on
normal non-capturing runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use inline parameter-name comments for the positional arguments passed to
Test_Fail_WithoutZeroPoints in the CUDA-scoped tests so the gather_axis,
quantize_axis, block_size, and bits values are self-documenting at the call
site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… EPs

The zero-spatial-dimension validation guard currently lives only in the
CPU and CUDA GridSample kernels. GetExecutionProviders() also adds the
WebGPU and CoreML kernels on builds that enable them, and those separate
kernels do not carry the guard, so the kExpectFailure substring
assertion would falsely fail there. Add a CPU+CUDA-only provider helper
and use it for the expect-failure case; positive tests are unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Agent-signed-off: Developer (cc69a935) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pture

The device-side index bounds check and kernel early-return already protect
memory safety, so the host readback of the error flag only upgrades a silently
incorrect result into a clean error. That readback performs a device-to-host
copy and a stream synchronize, both of which are illegal while the stream is
being captured for a CUDA graph. Query the capture status with
cudaStreamIsCapturing and perform the readback only when the stream is not
capturing, matching the idiom used by EmbedLayerNormalization.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Source-only follow-up to the EmbedLayerNormalization input-validation
changes:

- Widen the output offset to int64 to match the word/position/segment
  offsets; narrow explicitly at the LayerNorm call, which takes an int
  offset.
- Reword the kernel and test comments to neutral "out of range of the
  embedding table" phrasing.
- Trim the segment_embedding_length parameter comment to fit the column
  limit.
- Document that the CUDA NHWC EP is intentionally left enabled for the
  negative-position-id test (it shares the same validated kernel and is
  skipped automatically if not registered).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For large batch * sequence_length * hidden_size (which can exceed 2^31
elements on high-memory GPUs), the EmbedLayerNormalization output write
index computed as sequence_position * hidden_size overflowed 32-bit
arithmetic and wrapped negative, producing an out-of-range write.

Make the whole output write-index path 64-bit end-to-end:
- Pass the already-widened int64 output offset straight to LayerNorm
  instead of narrowing it back to int.
- Widen the shared LayerNorm helper's offset parameter and its internal
  index from int to int64_t so the offset is not truncated again. This is
  source-compatible with the only other caller (skip layer norm passes an
  int offset, which promotes cleanly).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…his change

The output write-index 64-bit widening also requires touching the shared
LayerNorm helper in layer_norm.cuh, which is used by other BERT kernels
(skip layer norm) and needs its own cross-kernel review and regression
testing. Keep that complete change as a separate follow-up so it does not
gate this single-op input-validation change.

Restore the output offset to its original 32-bit form and fully revert the
shared layer_norm.cuh helper. The embedding-index validation and the
read-side word/position/segment offset arithmetic are unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Support large tensors in the CUDA BERT LayerNorm device helpers by computing
the global output element offset in 64-bit. For big batch * sequence_length *
hidden_size (which exceeds 2^31 elements on high-memory GPUs), the previous
32-bit row offset arithmetic could not represent the index.

- layer_norm.cuh: widen the global write offset / index of LayerNorm,
  SimplifiedLayerNorm, LayerNormSmall and SimplifiedLayerNormSmall to int64_t.
  The gamma/beta/bias indices (ld, i, threadIdx.x * ILP) stay int since they
  are bounded by hidden_size.
- embed_layer_norm_impl.cu: compute output_offset in int64 and pass it to
  LayerNorm without narrowing.
- skip_layer_norm_impl.cu: compute the per-block offset/index in int64 in both
  SkipLayerNormKernel and SkipLayerNormKernelSmall.

Numerics are unchanged for normal tensor sizes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
titaiwangms and others added 12 commits June 25, 2026 01:34
…ange

The previous commit re-applied two logically separate changes together: the
EmbedLayerNormalization output write-index widening (LayerNorm() offset/idx and
output_offset) and the SkipLayerNormalization write-index widening. Split them
so each lands as its own single-op change.

This reverts only the SkipLayerNorm portion (SimplifiedLayerNorm, LayerNormSmall
and SimplifiedLayerNormSmall offset/idx in layer_norm.cuh, plus both kernels in
skip_layer_norm_impl.cu) so it can be re-applied as its own commit. The
EmbedLayerNorm write-index widening (LayerNorm() and output_offset) is kept.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Support large tensors in SkipLayerNormalization by computing the global output
element offset in 64-bit. For big rows * hidden_size (rows = batch * sequence
length), which can exceed 2^31 elements on high-memory GPUs, the previous 32-bit
row offset arithmetic could not represent the index.

- skip_layer_norm_impl.cu: compute the per-block offset/index in int64 in both
  SkipLayerNormKernel and SkipLayerNormKernelSmall.
- layer_norm.cuh: widen the global write offset/index of SimplifiedLayerNorm,
  LayerNormSmall and SimplifiedLayerNormSmall (the helpers used only by
  SkipLayerNorm) to int64_t. The gamma/beta/bias indices (ld, i,
  threadIdx.x * ILP) stay int since they are bounded by hidden_size.

LayerNorm() (used by EmbedLayerNorm) is widened separately in the
EmbedLayerNorm write-index change. Numerics are unchanged for normal sizes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ions builds

The CudaShapeMismatch, CudaInvalidBlockSizeZero, and CudaUnsupportedBits cases
reject their inputs by throwing: the first two via fail_shape_inference during
Graph::Resolve and the last via an ORT_ENFORCE in the kernel constructor. A
throw aborts under ORT_NO_EXCEPTIONS instead of surfacing as the failure that
OpTester::ExpectResult::kExpectFailure expects, so wrap them in
!defined(ORT_NO_EXCEPTIONS). CudaInvalidIndices stays unguarded because it fails
through a device error flag turned into a Status, which does not throw.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…o-exceptions builds"

This reverts commit e843cde.

The GatherBlockQuantized test file has seven pre-existing throw-based
expect-failure tests and no ORT_NO_EXCEPTIONS guards, so the provider test is
not built with exceptions disabled. Guarding only the three new CUDA tests is
inconsistent with that convention, so restore them to unguarded. The accurate
comment on CudaInvalidBlockSizeZero (it is rejected by shape inference during
Graph::Resolve, not by ComputeInternal) is kept.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ow-up change

Revert the write-index 64-bit widening currently spread across follow-up commits
so it can re-land as one coherent change. This reverts only the write-offset
int64 in the LayerNorm device helpers, the embed output_offset, and the
SkipLayerNorm kernels; the EmbedLayerNormalization read-side offsets
(word/position/segment) stay int64 as they are part of the input validation.

No net behavior change once the consolidated widening is re-applied.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sts for no-exceptions builds""

This reverts commit 87139d3.
Support large tensors in the CUDA BERT LayerNorm write path by computing the
global output element offset in 64-bit. For big rows * hidden_size (rows =
batch * sequence_length), which can exceed 2^31 elements on high-memory GPUs,
the previous 32-bit row offset arithmetic could not represent the index.

Three files, write-offset/index only (gamma/beta/bias indices ld, i,
threadIdx.x * ILP stay int since they are bounded by hidden_size):

- layer_norm.cuh: widen the global write offset/index of LayerNorm,
  SimplifiedLayerNorm, LayerNormSmall and SimplifiedLayerNormSmall to int64_t.
  All four have callers (EmbedLayerNorm via LayerNorm; SkipLayerNorm via all
  four) whose offset = row_index * hidden_size can exceed 2^31.
- embed_layer_norm_impl.cu: compute output_offset in int64 and pass it to
  LayerNorm without narrowing.
- skip_layer_norm_impl.cu: compute the per-block offset/index in int64 in both
  SkipLayerNormKernel and SkipLayerNormKernelSmall.

Numerics are unchanged for normal tensor sizes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add brief explanatory comments at each int64-widened global write offset/index
in the CUDA BERT LayerNorm helpers and SkipLayerNorm kernels, noting that the
offset (row_index * hidden_size = batch * sequence_length * hidden_size) can
exceed 2^31 for large tensors and that gamma/beta indices intentionally stay
int (bounded by hidden_size). This prevents a future reader from narrowing the
index back to int and reintroducing the large-tensor limitation.

Comment-only; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e inference

GetFirstElement<T> read sizeof(T) bytes from a graph initializer's raw_data guarded only by the presence bit. Add a length check before the cast and fail shape inference cleanly when raw_data is shorter than the element size. Also clamp the computed element count to >= 0 in CalcRangeDim so shape inference matches the CPU kernel for empty/backward ranges. Add model-load regression tests in range_test.cc (guarded by #ifndef DISABLE_CONTRIB_OPS, Status-only assertions for no-exception builds).

Agent-signed-off: Developer (b2fe149b) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- GetFirstElement<T>: read raw_data via std::memcpy into an aligned local after the
  length check, avoiding an unaligned access on strict-alignment targets.
- CalcRangeDim<T> and the CPU kernel ComputeRange: guard the element-count cast with
  std::isfinite so a non-finite computed count fails cleanly instead of an out-of-range
  cast (fail_shape_inference in the schema, a non-OK Status in the kernel).
- range_test.cc: generalize the model-building helper over element type, dimensions and
  per-input bytes. Truncated raw_data is now modeled as dims=[0] with empty raw_data so
  the cases are tight regressions for the length check; add coverage for start/limit/delta
  positions, float and int64 element types, a zero-delta failure, and an exact-size success
  boundary. Throwing tests are additionally guarded by !defined(ORT_NO_EXCEPTIONS).

Agent-signed-off: Developer (b2fe149b) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cast

The std::isfinite guard catches inf/NaN but not a finite count larger than the
int64 range, where static_cast<int64_t>(count) is out of range. Handle the
non-positive case before the cast (so a large-magnitude negative count cannot
overflow it) and reject counts at or above 2^63 with a neutral message. Applied
identically in the schema CalcRangeDim (fail_shape_inference) and the CPU kernel
ComputeRange (non-OK Status), keeping the two sites consistent.

Add a guarded regression test (start=0, limit=1e19, delta=1) asserting a clean
non-OK status; verified it fails without the new guard.

Agent-signed-off: Developer (b2fe149b) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the broader hardening pass (head f7f73c1)

This PR has grown well beyond its title: 27 files bundling ~10 independent contrib-op hardening changes plus a new SKILL doc. The attention optional-output guard fix (bert_defs.cc > 1> 2, plus the getOutputType bounds check in graph.cc) was reviewed in a prior round and that thread is resolved — it remains correct.

Genuine bug fix: the DynamicQuantizeLSTM change (R_zp_shape = r_zp->Shape(), WeightCheck(R_scale_shape, R_scale)) fixes a real copy-paste bug where the recurrence zero-point/scale were validated against the input weight shapes. Good catch, and it has a test.

Scope / process (non-blocking): per repo guidance (≤10 files, separate unrelated functional changes), this is a strong candidate to split — particularly the independent DynamicQuantizeLSTM correctness fix, which would be far easier to cherry-pick or revert on its own than buried among the defensive guards.

Performance (non-blocking, see inline notes): GatherBlockQuantized and EmbedLayerNormalization now add a device→host readback + cudaStreamSynchronize on every non-capturing run. Memory safety is already guaranteed device-side, so the sync only surfaces a clean error. For GatherBlockQuantized (dequant gather path) a per-call stream sync can hurt latency; worth confirming the tradeoff or gating the readback.

Everything else — Range overflow/alignment hardening, GridSample empty-spatial guards, GatherBlockQuantized shape/bits validation, EmbedLayerNorm device-side id validation, 64-bit LayerNorm offsets, MaxpoolWithMask rank checks — looks correct and consistent with existing patterns.

Comment thread onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc
Comment thread onnxruntime/contrib_ops/cuda/bert/embed_layer_norm.cc
@titaiwangms

Copy link
Copy Markdown
Contributor Author

Superseding this with #29268, a focused PR containing only the optional-output shape-inference guard fix and its test/doc. This PR's branch had been reused as the shared integration branch and accumulated a number of unrelated commits, so its diff no longer reflects the intended change. Please review #29268 instead. Thanks!

auto-merge was automatically disabled June 25, 2026 22:37

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants