Fix outputs and Handle parallel loading and warmup of cached and uncached models#243
Merged
TedThemistokleous merged 14 commits intoJul 1, 2026
Merged
Conversation
… on fallback The hip_graph_cache is keyed only by shape_hash and was shared between the direct-bind and pinned-copy capture paths. These paths partition the program outputs differently (pre-allocated #output_N params vs. "extra" run_async results), so replaying an entry captured in one mode from the other mode leaves some ORT outputs unwritten -- surfacing as "Unsupported OrtValue type" at the fetch boundary (or stale-address reads). This happens in practice after the direct-bind pointer-drift fallback flips use_direct_hip_graph to false while leaving a direct-captured entry in the cache; the next request takes the pinned path and replays it. Tag each captured entry with the mode that produced it (direct_bind) and only replay an entry from the matching dispatcher. When the pinned path finds a stale direct-captured entry, destroy it and re-capture cleanly in pinned-copy mode.
…pGraph runs An output OrtValue is only "produced" once some path calls ctx.GetOutput() for its index. The pinned hipGraph path splits that work between copy_pinned_outputs_to_ort (pre-allocated #output_N params) and materialize_extra_outputs (extra run_async results); if an output index falls through both, ORT fails the fetch with "Unsupported OrtValue type". Add ensure_all_outputs_allocated() and call it after copy_pinned_outputs_to_ort on every pinned-copy execution path. GetOutput is idempotent, so this is a no-op for already-produced outputs and only backstops a missed one.
Batch-1 models with many small inputs (e.g. feed-gen-rec, ~190 inputs) are dominated by per-input host->device launch overhead. This adds an opt-in coalesced-input path (ORT_MIGRAPHX_COALESCE_IO) that lays out all inputs in one device arena fed by a single pinned-host staging buffer, collapsing ~N hipMemcpyAsync launches into one and binding each input as a sub-view of the arena. Details: - PinnedIOSet gains a single-arena backing (in_arena_dev / in_staging_host / input_offsets); inputs[i].data points into the arena so bind and the hipGraph zeroing/capture paths are unchanged. - copy_inputs_to_pinned gains a fast path: when the arena is active, there is no padding, and every input is host-resident, gather into pinned staging and issue one H2D for the whole arena. All other cases fall back to the existing per-input copy (still correct since pin.data is in the arena). - free_pinned_io frees the arena once instead of per-input. - The flag forces the pinned (non-direct-bind) hipGraph path, since coalescing consumes host-resident inputs that direct-bind cannot use. Warns if set without hipGraph (the path that allocates pinned I/O). Gated off by default; AMD-only (no NVIDIA/CUDA paths touched). Outputs are left per-buffer: at the EP layer each output has a distinct ORT destination, so there is no copy-count win to coalesce them here.
…ng reuse
Under alternating dynamic batch sizes (e.g. an exact batch-4 request
followed by a batch-3 request that pads to the batch-4 program), the
ultra-fast path could reuse a binding built by the previous run. An
exact-batch run binds via direct-bind (ORT output pointers at the
compiled batch); reusing it for a padded request ran the model at the
compiled batch and handed ORT the compiled output shape instead of the
request's actual shape, producing:
execution_frame.cc:160 GetOrCreateNodeOutputMLValue
OrtValue shape verification failed. Current shape:{4,64} Requested shape:{3,64}
This surfaced as request failures when concurrency was raised (the
dynamic batcher then forms mixed batch sizes back-to-back).
Track how cached_prog_params was last bound (binding mode: direct-bind
vs pinned-copy, and the actual batch size it was built for) and only
reuse it in the ultra-fast path when both match this request. On any
mismatch, bail to the fast path, which rebinds correctly: pinned staging
+ output slice-back for padded requests, direct-bind for exact ones.
Steady-state exact-batch reuse (the common case) is unaffected; only
mode/batch transitions take the rebinding fast path.
Note: this fixes the deterministic alternating-batch reuse. True
concurrent Run() calls into a single shared func state (single pinned
arena) would still need per-request buffers or compute-level locking;
tracked separately.
Introduce the locking building blocks used to make multi-instance compile and disk-cache access safe: - g_migraphx_compile_mutex: process-wide mutex serializing all MIGraphX compiles (parse_onnx_buffer + compile_program are not thread-safe). - mutex_for_cache_key(): per-cache-file mutex registry for compile-once herd control within the process. - CacheFileLock: cross-process advisory flock on "<cache_file>.lock" for shared cache volumes (no-op on Windows / when no cache file is set). These are wired into the compile and load paths in following commits. Co-authored-by: Cursor <cursoragent@cursor.com>
save_compiled_model now writes to a per-writer temp file and atomically renames it over the target .mxr. rename(2) is atomic on a single filesystem, so concurrent readers and crashes mid-write can never observe a partially written cache file (a prior cause of "no kernel image is available for execution" on load). On rename failure the temp file is cleaned up and the existing cache is left intact. Co-authored-by: Cursor <cursoragent@cursor.com>
Acquire g_migraphx_compile_mutex at the top of CompileProgramWithBatch, the single chokepoint all compile paths flow through (parse_onnx_buffer + calibrate_and_quantize + compile_program). This guarantees at most one MIGraphX compile per process at a time across every EP instance, which is required because the parse/codegen path is not thread-safe. The lock is always acquired after any per-key cache lock, keeping the ordering one-way and deadlock-free. Co-authored-by: Cursor <cursoragent@cursor.com>
Wrap the cache-miss path in per-key locks so a given .mxr is compiled and published exactly once under concurrency: - in-process mutex_for_cache_key() serializes threads on the same cache key - cross-process CacheFileLock serializes processes sharing a cache volume - a double-checked load_precompiled_model() after acquiring the lock lets late waiters pick up the freshly published file instead of recompiling The empty-cache path returns early and compiles directly (still serialized by the global compile mutex). Lock order is per-key -> global compile mutex, never reversed. Co-authored-by: Cursor <cursoragent@cursor.com>
…model Phase 2 of precompile_all_dynamic_batch_models previously called CompileProgramWithBatch + save_compiled_model directly, bypassing the per-key cache lock and atomic-save guarantees. Route it through load_or_compile_model so it shares the same per-key/cross-process locking, double-checked load and atomic publish as all other paths, while compilation stays serialized by the global compile mutex. Co-authored-by: Cursor <cursoragent@cursor.com>
The direct-bind hipGraph path bakes ORT input/output/scratch device pointers into the captured graph and re-validates them every run. On a mismatch it increments direct_recapture_count and, once it exceeds kMaxDirectRecaptures (3), permanently falls back to eager execution. The counter was monotonic and never reset, so rare, recoverable drift (e.g. ORT allocator recycling addresses across batch-size transitions) accumulated over a long-lived session until it crossed the threshold and disabled the fast hipGraphLaunch path for good -- showing up as a large drop in inference throughput. Reset the counter to 0 on every successful pointer match so only *consecutive* mismatches can trip the fallback. Transient drift that recovers no longer counts against the limit. Co-authored-by: Cursor <cursoragent@cursor.com>
7fe9dad to
d01c3c9
Compare
Collaborator
Author
|
Expediting this - tested with customerworkload and saw no regression currently with this changeset. Getting this in to get QA cycle. Can backout and patch if there's issues in upcoming week |
bc88817
into
rocm7.14_internal_testing
5 of 7 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Changes used to fix concurrency spikes seen with parallel warmups of models.
Found a few issues when we get a thundering heard of warmups/compiles via MIGraphX in the MIGraphX EP
Motivation and Context
Sometimes when running multi-node setups (Multi GPU) We run into cases where all GPUs and sessions are enumerated at once (1 session per GPU) which can trigger multiple compiles of an MIGraphX model. In some cases we should actually only do 1 compile and then N-1 loads for N GPUs to ensure that we don't do the following
The changes here add stuff like device pinning for the compile/run to ensure each item is infact doing the hip graph warmup, compile and loads on the same GPU node instance.