Skip to content

Fix outputs and Handle parallel loading and warmup of cached and uncached models#243

Merged
TedThemistokleous merged 14 commits into
rocm7.14_internal_testingfrom
parallel_load_local
Jul 1, 2026
Merged

Fix outputs and Handle parallel loading and warmup of cached and uncached models#243
TedThemistokleous merged 14 commits into
rocm7.14_internal_testingfrom
parallel_load_local

Conversation

@TedThemistokleous

Copy link
Copy Markdown
Collaborator

Description

Changes used to fix concurrency spikes seen with parallel warmups of models.

Found a few issues when we get a thundering heard of warmups/compiles via MIGraphX in the MIGraphX EP

  • Parallel compile does funny things when we
  • Parallel warmup captures invalid streams (multi -node)
  • Parallel startup can run into issues with parallel loads/compiles triggering odd allocation cases due to non parallel compile in MIGraphX

Motivation and Context

Sometimes when running multi-node setups (Multi GPU) We run into cases where all GPUs and sessions are enumerated at once (1 session per GPU) which can trigger multiple compiles of an MIGraphX model. In some cases we should actually only do 1 compile and then N-1 loads for N GPUs to ensure that we don't do the following

  1. Waste compute recompiling a model that's been already
  2. Trigger a bunch of parallel compiles

The changes here add stuff like device pinning for the compile/run to ensure each item is infact doing the hip graph warmup, compile and loads on the same GPU node instance.

TedThemistokleous and others added 14 commits June 12, 2026 23:47
… on fallback

The hip_graph_cache is keyed only by shape_hash and was shared between the
direct-bind and pinned-copy capture paths. These paths partition the program
outputs differently (pre-allocated #output_N params vs. "extra" run_async
results), so replaying an entry captured in one mode from the other mode leaves
some ORT outputs unwritten -- surfacing as "Unsupported OrtValue type" at the
fetch boundary (or stale-address reads).

This happens in practice after the direct-bind pointer-drift fallback flips
use_direct_hip_graph to false while leaving a direct-captured entry in the
cache; the next request takes the pinned path and replays it.

Tag each captured entry with the mode that produced it (direct_bind) and only
replay an entry from the matching dispatcher. When the pinned path finds a stale
direct-captured entry, destroy it and re-capture cleanly in pinned-copy mode.
…pGraph runs

An output OrtValue is only "produced" once some path calls ctx.GetOutput() for
its index. The pinned hipGraph path splits that work between
copy_pinned_outputs_to_ort (pre-allocated #output_N params) and
materialize_extra_outputs (extra run_async results); if an output index falls
through both, ORT fails the fetch with "Unsupported OrtValue type".

Add ensure_all_outputs_allocated() and call it after copy_pinned_outputs_to_ort
on every pinned-copy execution path. GetOutput is idempotent, so this is a no-op
for already-produced outputs and only backstops a missed one.
Batch-1 models with many small inputs (e.g. feed-gen-rec, ~190 inputs)
are dominated by per-input host->device launch overhead. This adds an
opt-in coalesced-input path (ORT_MIGRAPHX_COALESCE_IO) that lays out all
inputs in one device arena fed by a single pinned-host staging buffer,
collapsing ~N hipMemcpyAsync launches into one and binding each input as
a sub-view of the arena.

Details:
- PinnedIOSet gains a single-arena backing (in_arena_dev / in_staging_host
  / input_offsets); inputs[i].data points into the arena so bind and the
  hipGraph zeroing/capture paths are unchanged.
- copy_inputs_to_pinned gains a fast path: when the arena is active, there
  is no padding, and every input is host-resident, gather into pinned
  staging and issue one H2D for the whole arena. All other cases fall back
  to the existing per-input copy (still correct since pin.data is in the
  arena).
- free_pinned_io frees the arena once instead of per-input.
- The flag forces the pinned (non-direct-bind) hipGraph path, since
  coalescing consumes host-resident inputs that direct-bind cannot use.
  Warns if set without hipGraph (the path that allocates pinned I/O).

Gated off by default; AMD-only (no NVIDIA/CUDA paths touched). Outputs are
left per-buffer: at the EP layer each output has a distinct ORT
destination, so there is no copy-count win to coalesce them here.
…ng reuse

Under alternating dynamic batch sizes (e.g. an exact batch-4 request
followed by a batch-3 request that pads to the batch-4 program), the
ultra-fast path could reuse a binding built by the previous run. An
exact-batch run binds via direct-bind (ORT output pointers at the
compiled batch); reusing it for a padded request ran the model at the
compiled batch and handed ORT the compiled output shape instead of the
request's actual shape, producing:

  execution_frame.cc:160 GetOrCreateNodeOutputMLValue
  OrtValue shape verification failed. Current shape:{4,64} Requested shape:{3,64}

This surfaced as request failures when concurrency was raised (the
dynamic batcher then forms mixed batch sizes back-to-back).

Track how cached_prog_params was last bound (binding mode: direct-bind
vs pinned-copy, and the actual batch size it was built for) and only
reuse it in the ultra-fast path when both match this request. On any
mismatch, bail to the fast path, which rebinds correctly: pinned staging
+ output slice-back for padded requests, direct-bind for exact ones.

Steady-state exact-batch reuse (the common case) is unaffected; only
mode/batch transitions take the rebinding fast path.

Note: this fixes the deterministic alternating-batch reuse. True
concurrent Run() calls into a single shared func state (single pinned
arena) would still need per-request buffers or compute-level locking;
tracked separately.
Introduce the locking building blocks used to make multi-instance compile
and disk-cache access safe:

- g_migraphx_compile_mutex: process-wide mutex serializing all MIGraphX
  compiles (parse_onnx_buffer + compile_program are not thread-safe).
- mutex_for_cache_key(): per-cache-file mutex registry for compile-once
  herd control within the process.
- CacheFileLock: cross-process advisory flock on "<cache_file>.lock" for
  shared cache volumes (no-op on Windows / when no cache file is set).

These are wired into the compile and load paths in following commits.

Co-authored-by: Cursor <cursoragent@cursor.com>
save_compiled_model now writes to a per-writer temp file and atomically
renames it over the target .mxr. rename(2) is atomic on a single
filesystem, so concurrent readers and crashes mid-write can never observe
a partially written cache file (a prior cause of "no kernel image is
available for execution" on load). On rename failure the temp file is
cleaned up and the existing cache is left intact.

Co-authored-by: Cursor <cursoragent@cursor.com>
Acquire g_migraphx_compile_mutex at the top of CompileProgramWithBatch,
the single chokepoint all compile paths flow through (parse_onnx_buffer +
calibrate_and_quantize + compile_program). This guarantees at most one
MIGraphX compile per process at a time across every EP instance, which is
required because the parse/codegen path is not thread-safe. The lock is
always acquired after any per-key cache lock, keeping the ordering one-way
and deadlock-free.

Co-authored-by: Cursor <cursoragent@cursor.com>
Wrap the cache-miss path in per-key locks so a given .mxr is compiled and
published exactly once under concurrency:

- in-process mutex_for_cache_key() serializes threads on the same cache key
- cross-process CacheFileLock serializes processes sharing a cache volume
- a double-checked load_precompiled_model() after acquiring the lock lets
  late waiters pick up the freshly published file instead of recompiling

The empty-cache path returns early and compiles directly (still serialized
by the global compile mutex). Lock order is per-key -> global compile mutex,
never reversed.

Co-authored-by: Cursor <cursoragent@cursor.com>
…model

Phase 2 of precompile_all_dynamic_batch_models previously called
CompileProgramWithBatch + save_compiled_model directly, bypassing the
per-key cache lock and atomic-save guarantees. Route it through
load_or_compile_model so it shares the same per-key/cross-process locking,
double-checked load and atomic publish as all other paths, while
compilation stays serialized by the global compile mutex.

Co-authored-by: Cursor <cursoragent@cursor.com>
The direct-bind hipGraph path bakes ORT input/output/scratch device
pointers into the captured graph and re-validates them every run. On a
mismatch it increments direct_recapture_count and, once it exceeds
kMaxDirectRecaptures (3), permanently falls back to eager execution.

The counter was monotonic and never reset, so rare, recoverable drift
(e.g. ORT allocator recycling addresses across batch-size transitions)
accumulated over a long-lived session until it crossed the threshold and
disabled the fast hipGraphLaunch path for good -- showing up as a large
drop in inference throughput.

Reset the counter to 0 on every successful pointer match so only
*consecutive* mismatches can trip the fallback. Transient drift that
recovers no longer counts against the limit.

Co-authored-by: Cursor <cursoragent@cursor.com>
@TedThemistokleous TedThemistokleous force-pushed the rocm7.14_internal_testing branch from 7fe9dad to d01c3c9 Compare June 26, 2026 07:00
@TedThemistokleous TedThemistokleous marked this pull request as ready for review July 1, 2026 05:16
@TedThemistokleous

Copy link
Copy Markdown
Collaborator Author

Expediting this - tested with customerworkload and saw no regression currently with this changeset. Getting this in to get QA cycle. Can backout and patch if there's issues in upcoming week

@TedThemistokleous TedThemistokleous merged commit bc88817 into rocm7.14_internal_testing Jul 1, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant