Fix outputs and Handle parallel loading and warmup of cached and uncached models by TedThemistokleous · Pull Request #243 · ROCm/onnxruntime

TedThemistokleous · 2026-06-23T03:40:35Z

Description

Changes used to fix concurrency spikes seen with parallel warmups of models.

Found a few issues when we get a thundering heard of warmups/compiles via MIGraphX in the MIGraphX EP

Parallel compile does funny things when we
Parallel warmup captures invalid streams (multi -node)
Parallel startup can run into issues with parallel loads/compiles triggering odd allocation cases due to non parallel compile in MIGraphX

Motivation and Context

Sometimes when running multi-node setups (Multi GPU) We run into cases where all GPUs and sessions are enumerated at once (1 session per GPU) which can trigger multiple compiles of an MIGraphX model. In some cases we should actually only do 1 compile and then N-1 loads for N GPUs to ensure that we don't do the following

Waste compute recompiling a model that's been already
Trigger a bunch of parallel compiles

The changes here add stuff like device pinning for the compile/run to ensure each item is infact doing the hip graph warmup, compile and loads on the same GPU node instance.

… on fallback The hip_graph_cache is keyed only by shape_hash and was shared between the direct-bind and pinned-copy capture paths. These paths partition the program outputs differently (pre-allocated #output_N params vs. "extra" run_async results), so replaying an entry captured in one mode from the other mode leaves some ORT outputs unwritten -- surfacing as "Unsupported OrtValue type" at the fetch boundary (or stale-address reads). This happens in practice after the direct-bind pointer-drift fallback flips use_direct_hip_graph to false while leaving a direct-captured entry in the cache; the next request takes the pinned path and replays it. Tag each captured entry with the mode that produced it (direct_bind) and only replay an entry from the matching dispatcher. When the pinned path finds a stale direct-captured entry, destroy it and re-capture cleanly in pinned-copy mode.

…pGraph runs An output OrtValue is only "produced" once some path calls ctx.GetOutput() for its index. The pinned hipGraph path splits that work between copy_pinned_outputs_to_ort (pre-allocated #output_N params) and materialize_extra_outputs (extra run_async results); if an output index falls through both, ORT fails the fetch with "Unsupported OrtValue type". Add ensure_all_outputs_allocated() and call it after copy_pinned_outputs_to_ort on every pinned-copy execution path. GetOutput is idempotent, so this is a no-op for already-produced outputs and only backstops a missed one.

Batch-1 models with many small inputs (e.g. feed-gen-rec, ~190 inputs) are dominated by per-input host->device launch overhead. This adds an opt-in coalesced-input path (ORT_MIGRAPHX_COALESCE_IO) that lays out all inputs in one device arena fed by a single pinned-host staging buffer, collapsing ~N hipMemcpyAsync launches into one and binding each input as a sub-view of the arena. Details: - PinnedIOSet gains a single-arena backing (in_arena_dev / in_staging_host / input_offsets); inputs[i].data points into the arena so bind and the hipGraph zeroing/capture paths are unchanged. - copy_inputs_to_pinned gains a fast path: when the arena is active, there is no padding, and every input is host-resident, gather into pinned staging and issue one H2D for the whole arena. All other cases fall back to the existing per-input copy (still correct since pin.data is in the arena). - free_pinned_io frees the arena once instead of per-input. - The flag forces the pinned (non-direct-bind) hipGraph path, since coalescing consumes host-resident inputs that direct-bind cannot use. Warns if set without hipGraph (the path that allocates pinned I/O). Gated off by default; AMD-only (no NVIDIA/CUDA paths touched). Outputs are left per-buffer: at the EP layer each output has a distinct ORT destination, so there is no copy-count win to coalesce them here.

…ng reuse Under alternating dynamic batch sizes (e.g. an exact batch-4 request followed by a batch-3 request that pads to the batch-4 program), the ultra-fast path could reuse a binding built by the previous run. An exact-batch run binds via direct-bind (ORT output pointers at the compiled batch); reusing it for a padded request ran the model at the compiled batch and handed ORT the compiled output shape instead of the request's actual shape, producing: execution_frame.cc:160 GetOrCreateNodeOutputMLValue OrtValue shape verification failed. Current shape:{4,64} Requested shape:{3,64} This surfaced as request failures when concurrency was raised (the dynamic batcher then forms mixed batch sizes back-to-back). Track how cached_prog_params was last bound (binding mode: direct-bind vs pinned-copy, and the actual batch size it was built for) and only reuse it in the ultra-fast path when both match this request. On any mismatch, bail to the fast path, which rebinds correctly: pinned staging + output slice-back for padded requests, direct-bind for exact ones. Steady-state exact-batch reuse (the common case) is unaffected; only mode/batch transitions take the rebinding fast path. Note: this fixes the deterministic alternating-batch reuse. True concurrent Run() calls into a single shared func state (single pinned arena) would still need per-request buffers or compute-level locking; tracked separately.

Introduce the locking building blocks used to make multi-instance compile and disk-cache access safe: - g_migraphx_compile_mutex: process-wide mutex serializing all MIGraphX compiles (parse_onnx_buffer + compile_program are not thread-safe). - mutex_for_cache_key(): per-cache-file mutex registry for compile-once herd control within the process. - CacheFileLock: cross-process advisory flock on "<cache_file>.lock" for shared cache volumes (no-op on Windows / when no cache file is set). These are wired into the compile and load paths in following commits. Co-authored-by: Cursor <cursoragent@cursor.com>

save_compiled_model now writes to a per-writer temp file and atomically renames it over the target .mxr. rename(2) is atomic on a single filesystem, so concurrent readers and crashes mid-write can never observe a partially written cache file (a prior cause of "no kernel image is available for execution" on load). On rename failure the temp file is cleaned up and the existing cache is left intact. Co-authored-by: Cursor <cursoragent@cursor.com>

Acquire g_migraphx_compile_mutex at the top of CompileProgramWithBatch, the single chokepoint all compile paths flow through (parse_onnx_buffer + calibrate_and_quantize + compile_program). This guarantees at most one MIGraphX compile per process at a time across every EP instance, which is required because the parse/codegen path is not thread-safe. The lock is always acquired after any per-key cache lock, keeping the ordering one-way and deadlock-free. Co-authored-by: Cursor <cursoragent@cursor.com>

Wrap the cache-miss path in per-key locks so a given .mxr is compiled and published exactly once under concurrency: - in-process mutex_for_cache_key() serializes threads on the same cache key - cross-process CacheFileLock serializes processes sharing a cache volume - a double-checked load_precompiled_model() after acquiring the lock lets late waiters pick up the freshly published file instead of recompiling The empty-cache path returns early and compiles directly (still serialized by the global compile mutex). Lock order is per-key -> global compile mutex, never reversed. Co-authored-by: Cursor <cursoragent@cursor.com>

…model Phase 2 of precompile_all_dynamic_batch_models previously called CompileProgramWithBatch + save_compiled_model directly, bypassing the per-key cache lock and atomic-save guarantees. Route it through load_or_compile_model so it shares the same per-key/cross-process locking, double-checked load and atomic publish as all other paths, while compilation stays serialized by the global compile mutex. Co-authored-by: Cursor <cursoragent@cursor.com>

The direct-bind hipGraph path bakes ORT input/output/scratch device pointers into the captured graph and re-validates them every run. On a mismatch it increments direct_recapture_count and, once it exceeds kMaxDirectRecaptures (3), permanently falls back to eager execution. The counter was monotonic and never reset, so rare, recoverable drift (e.g. ORT allocator recycling addresses across batch-size transitions) accumulated over a long-lived session until it crossed the threshold and disabled the fast hipGraphLaunch path for good -- showing up as a large drop in inference throughput. Reset the counter to 0 on every successful pointer match so only *consecutive* mismatches can trip the fallback. Transient drift that recovers no longer counts against the limit. Co-authored-by: Cursor <cursoragent@cursor.com>

TedThemistokleous · 2026-07-01T05:17:51Z

Expediting this - tested with customerworkload and saw no regression currently with this changeset. Getting this in to get QA cycle. Can backout and patch if there's issues in upcoming week

TedThemistokleous and others added 14 commits June 12, 2026 23:47

Add padding to extra outputs

7bfc288

Pin hip device ID to loading of the models

b54660a

Add in device id guard for multi gpuloads

1bc44dc

Move to local vs global thread capture

ee3338a

TedThemistokleous force-pushed the rocm7.14_internal_testing branch from 7fe9dad to d01c3c9 Compare June 26, 2026 07:00

TedThemistokleous marked this pull request as ready for review July 1, 2026 05:16

TedThemistokleous merged commit bc88817 into rocm7.14_internal_testing Jul 1, 2026
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix outputs and Handle parallel loading and warmup of cached and uncached models#243

Fix outputs and Handle parallel loading and warmup of cached and uncached models#243
TedThemistokleous merged 14 commits into
rocm7.14_internal_testingfrom
parallel_load_local

TedThemistokleous commented Jun 23, 2026

Uh oh!

TedThemistokleous commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TedThemistokleous commented Jun 23, 2026

Description

Motivation and Context

Uh oh!

TedThemistokleous commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant