fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model by kovtcharov · Pull Request #1788 · amd/gaia

kovtcharov · 2026-06-19T21:09:13Z

Why this matters

Every embedding-dependent CI job — and every real user's RAG / code-index / agent-memory embedding path — broke on main when Lemonade was bumped to 10.7.0 (#1571, 2026-06-17). v10.7.0's bundled llama.cpp (build b9585) crashes loading nomic-embed-text-v2-moe-GGUF with a generic 500 model_load_error: llama-server failed to start. Regular LLM GGUFs load fine, which is why only the embedding jobs went red.

This pins Lemonade back to 10.2.0 (the proven last-known-good; the 06-17 main run loaded the embedder on device=gpu and passed all six tests) and fixes the deeper reason a version pin alone wasn't enough on the runners.

The real root cause: the backend doesn't follow the version

A Lemonade "version" is two things: the MSI LemonadeServer.exe and the separately-downloaded llama.cpp backend (the actual llama-server binary). The CI reconcile only swapped the MSI exe — it left the previous backend in place. So a downgraded 10.2.0 server kept spawning the stale b9585 backend and hit the exact same crash. The existing backend-wipe only looked at ~/.cache/lemonade/bin and silently found nothing on the strix (SYSTEM-account) runners, where the backend actually lives elsewhere.

The fix makes the reconcile-gated wipe path-agnostic: it searches every known cache and install-tree root, logs each llama-server binary found (with its version, so the layout is visible even on success), and removes only the llamacpp backend dirs — never the MSI's own server exe. A version-matched runner skips it entirely (no per-run cost, no model re-download — GGUF weights live in the separate HF hub cache).

Also: version changes now actually run the embedding tests

The embedding/RAG/lemonade-server workflows watched src/gaia/llm/**, setup.py, etc. — but not src/gaia/version.py, the single source of truth for the backend version. That's why #1571's bump shipped a broken embedder with zero embedding tests firing. version.py is now in all three path filters, so any future bump runs the tests it affects.

Why 10.2.0 (not 10.6.0)

CI bisection: 10.2.0 (pre-b8766) loads the model; 10.6.0 (b9253) and 10.7.0 (b9585) both crash. version.py carries a loud "do not bump past this until verified" note.

Test plan

Test Lemonade Embeddings green on stx — loads nomic-embed-text-v2-moe-GGUF on device=gpu, 6 tests pass
Test RAG (unit + integration) green
Lemonade Server Smoke Test (stx) green
On a runner that reconciles from 10.7.0, logs show the backend scan + llamacpp dir wipe (or a clean "fetch on first load")

Notes for reviewers

Do not merge the weekly Lemonade auto-bump PR until embeddings are verified on the new backend — version.py says so.
Follow-up (separate PR, on feat(npu): NPU-native FLM embedder (embed-gemma-300m-FLM) for the NPU profile (#1744) #1761): switch GPU/CPU profiles off the MoE embedder to a dense model (embeddinggemma-300M-GGUF or curated Qwen3-Embedding) so Lemonade can move forward — needs a RAG eval. Tracked alongside [Bug]: macOS installer (darwin-arm64) fails on first launch — bundled uv never shipped for mac-arm64 #941.

Tracking

Workaround only — this pins/reverts the Lemonade backend to dodge the crash. The root cause is a llama.cpp/llama-server bug (upstream lemonade-sdk/lemonade#612). Tracked in #1831, which stays open until upstream ships a backend that loads nomic-embed-text-v2-moe and we un-pin.

…bedding model Lemonade v10.7.0 (bumped on main in #1571) ships llama.cpp build b9585, whose llama-server crashes loading nomic-embed-text-v2-moe-GGUF — GAIA's default embedding model for RAG, code-index, and agent memory. Loads return a generic 500 "llama-server failed to start" (matches the upstream GGML_ASSERT vocab crash in ggml-org/llama.cpp#13534). Regular LLM GGUFs load fine on b9585, which is why only the embedding CI jobs (Test RAG, Test Lemonade Embeddings) fail while API / chat / code / unit jobs stay green — and why the break is environment-wide across every branch, not any one PR. #1571 changed only version.py, so pinning back to 10.2.0 (the proven last-known-good — the last main run before the bump was green) is a clean, fully reversible functional revert. The install-lemonade action reconciles the runner down to the pinned version and wipes the stale backend cache automatically. Also surface the llama-server child stderr on embedding load failure in test_embeddings.yml and test_rag.yml. Lemonade swallows it behind the 500, so the actual backend crash was invisible; the server job already runs with 2>&1, so a Receive-Job dump in the load-failure path exposes the real assert and lets a future bump confirm the model loads before landing.

github-actions · 2026-06-19T21:12:27Z

Code Review — PR #1788

Summary

Solid, well-evidenced revert: pinning Lemonade back to 10.2.0 is the right call to unbreak the embedding-dependent CI jobs (and the real RAG/memory path for users on 10.7.0), and the Receive-Job stderr dump is a genuinely useful debuggability win that follows the existing pattern in these workflows. However, the PR as written will trade one red CI job for another: changing src/gaia/version.py triggers docs.yml, whose check_doc_versions.py gate fails because three doc files still hardcode 10.7.0. The "docs are out of scope" note in the description is the one thing to reconsider — for these specific files it isn't optional, it's a hard gate that this change activates.

Issues

🟡 Doc version check will fail CI on this PR (src/gaia/version.py:65)

.github/workflows/docs.yml runs python util/check_doc_versions.py as a non-tolerant step, and it triggers on any PR that touches src/gaia/version.py (it's in the workflow's paths:). The check reads LEMONADE_VERSION and flags any doc that references a different Lemonade version. With the pin now at 10.2.0, I ran it locally and it exits 1 with 8 mismatches across 3 files:

docs/cpp/setup.mdx — lines 83, 89, 179, 222
docs/guides/npu.mdx — line 15
cpp/README.md — lines 31, 36, 47

So the PR fixes the embedding jobs but turns the Docs workflow red. Since the whole point is green CI, this is blocking. Two ways to resolve, in order of preference:

Update those three files to 10.2.0 in this PR. They describe the version users actually install, so they should track the pin — this keeps docs correct rather than just quieting the linter. It's a small, mechanical change (release URLs, the tag/v… links, the version-table cell, and the v10.7.0+ text).
If you'd rather not churn the cpp/setup docs for a version you expect to re-bump soon, add them to EXCLUDE_PATTERNS in util/check_doc_versions.py — but that silences the guardrail, so I'd only do it with a comment pointing at the re-bump follow-up.

Either way it needs to happen in this PR, not a follow-up, or the docs job blocks the merge.

🟢 reset-lemonade.ps1 default fallback still 10.7.0 (installer/scripts/reset-lemonade.ps1:70) — if (-not $Version) { $Version = "10.7.0" } is the fallback when run outside a GAIA checkout; it now disagrees with the pin. You already called this out as follow-up and it's not a CI gate (the script prefers the parsed version.py value when available), so deferring is reasonable — flagging only so it doesn't get lost.

🟢 version.py comment length (src/gaia/version.py:55-64) — CLAUDE.md leans toward one-line WHY comments, and this is a 10-line block. I'd leave it as-is: it's a deliberate "do not bump" guardrail aimed at the weekly lemonade-version-bump.yml reviewer, names the unblock condition, and the value clearly justifies the length. Noting it only for completeness, not asking for a change.

Strengths

Exemplary root-cause writeup — discriminator (LLM GGUFs load, only the embedding model fails), timeline (last green pre-bump run vs. first failure), mechanism (b9585 / upstream GGML_ASSERT), and linked failing runs. This is exactly how a revert should be justified.
The diagnostic dump is well-placed and correct — it reuses the $env:LEMONADE_JOB_ID set when the server job starts (test_embeddings.yml:74), lives in the same step so the var is in scope, mirrors the existing Receive-Job usage already in the file, and guards on the var before reading. Identical block in both workflows keeps them in sync.
No unit-test fallout — tests/unit/test_init_command.py reads LEMONADE_VERSION dynamically and its hardcoded "10.2.0" cases pre-date the 10.7.0 bump, so the revert actually re-aligns the suite rather than breaking it.

Verdict

Request changes — the change itself is correct and well-argued, but it can't ship green until the three 10.7.0 doc references are reconciled (update to 10.2.0, or exclude with a tracking comment). Everything else is approve-quality. Once the docs job passes, this is good to merge.

…d to load nomic embed 10.7.0 (llama.cpp b9585) is the proven-bad version; 10.2.0 the proven-good one. The embedding-load regression entered the build range b8766..b9585, so 10.6.0 (b9253) is the most recent release likely to still load nomic-embed-text-v2-moe-GGUF while retaining most of the 10.x feature set. The embedding CI jobs now surface the llama-server stderr, so this pin is verified empirically rather than assumed — if 10.6.0 also fails, the logs pinpoint the bad build and we drop to the 10.2.0 floor. Also drops the overclaimed link to llama.cpp#13534 (that crash was on build b5372, long since fixed and far older than any 10.x backend).

…doc version refs CI on this PR confirmed 10.6.0 (llama.cpp b9253) hits the same embedding-load 500 as 10.7.0, so the regression entered the build range b8766..b9253 — earlier than 10.7's bumps. 10.2.0 is the proven last-known-good; pin there. The embedding jobs on this PR now verify that empirically (green = 10.2.0 loads the model). Sync the hardcoded Lemonade version refs in docs/cpp/setup.mdx, docs/guides/npu.mdx, and cpp/README.md to match the pin — the Doc Version Consistency check (util/ check_doc_versions.py, [10/10] in Code Quality) enforces docs == LEMONADE_VERSION and was failing on the 10.7.0 leftovers.

…d-failing) tests/test_lemonade_embeddings.py built a real LemonadeClient and called embeddings() with no skip guard, so it hard-failed instead of skipping when no Lemonade server was running locally. Gate the client fixture on require_lemonade so it skips like every other real-server test; CI (test_embeddings.yml) still exercises it against a live server.

Most Lemonade capabilities GAIA depends on were only mock-tested, and mocks prove "we called it", not "the call is valid" — they cannot catch contract bugs like #1655, where the model-pull sent recipe= for a built-in model and Lemonade 400s only on a real request. Adds tests/test_lemonade_client_integration.py: require_lemonade-gated real-server tests covering chat/completions (streaming + non-streaming), text completions, the #1655 built-in-pull contract (asserts the outgoing /pull payload carries no recipe), load/_ensure_model_loaded with explicit ctx, scoped/global/no-op unload, non-destructive delete error path, catalog + health/status/system-info/stats introspection, tool-calling round-trip + is_tool_calling_model mapping, an embeddings smoke check, and auth-header behavior. All tests SKIP cleanly when no server is up. Adds .github/workflows/test_lemonade_client.yml to run the suite against a live server (Qwen3-0.6B-GGUF pulled/loaded, GGML_VK_DISABLE_COOPMAT workaround, port 13305), mirroring test_embeddings.yml.

…failure The nomic embedding load 500 reproduces on every Lemonade version (10.2/10.6/10.7) while LLMs load fine and the upstream GGUF is unchanged since 2025 — pointing at a stale/corrupt cached GGUF in the HF hub cache (never wiped by version reconcile), not a server-version regression. Wipe the cached nomic dir before pull to force a clean re-download, and on load failure dump the cached file sizes + tail Lemonade's log files (the prior Receive-Job dump came back empty — Lemonade logs to a file, not the serve job's stdout). Decisive: clean re-download loads => corrupt cache; still fails => runner env/llama.cpp issue. gaia#941.

… MoE crash Root cause (gaia#941): the nomic-embed-text-v2-moe (MoE) model crashes llama-server on the stx runners' Vulkan backend. Proven not a version/model-file/cache issue — it loads + runs fine on Metal locally (Lemonade 10.3.0) and LLMs load on Vulkan here, yet the embedder fails identically across Lemonade 10.2/10.6/10.7. Lemonade launches the embedder with -ngl 99 (full GPU offload), building the crashing Vulkan MoE kernels; loading with llamacpp_args='-ngl 0' keeps it on CPU (embeddings are cheap) and sidesteps the broken path. Also waits for the async pull to finish downloading before load. If this run goes green, root cause + fix are confirmed.

github-actions · 2026-06-21T17:33:15Z

🟡 .github/workflows/test_lemonade_client.yml:277-288 — embedding model pulled but not loaded with -ngl 0, so TestLemonadeEmbeddingsSmoke.test_embeddings_768_dim will likely trigger the same Vulkan MoE crash that was fixed in test_embeddings.yml.

The embeddings workflow was specifically updated (in a prior commit on this PR) to:

Poll until the GGUF file is fully downloaded before proceeding.
Call /api/v1/load with llamacpp_args = "-ngl 0" to force CPU offload and dodge the crash.

This workflow does neither — it fires-and-forgets the pull, waits a static 30 s, then runs the suite. When client.embeddings(model=EMBED_MODEL) triggers an auto-load, Lemonade will use the default -ngl 99, building the crashing Vulkan MoE kernels.

Minimal fix: mirror the embeddings workflow's pull-wait loop and load step here, or skip the 768-dim smoke test in this workflow entirely and rely on test_embeddings.yml for full coverage.

…tch) on CPU A prior cache-inspect step deleted the runner's pre-seeded nomic GGUF, and the bare REST /pull is registration-only on these runners (won't re-download), so the load test couldn't run. Switch to LemonadeClient.load_model(auto_download=True, llamacpp_args='-ngl 0'): a real HF download that restores the model AND exercises GAIA's actual load path, loading the MoE embedder on CPU to dodge the Vulkan crash (gaia#941).

github-actions · 2026-06-21T17:56:03Z

🟡 test_lemonade_client.yml:265-270 — embedding model pulled but not loaded before smoke test

The workflow calls /api/v1/pull for nomic-embed-text-v2-moe-GGUF but never calls /api/v1/load. TestLemonadeEmbeddingsSmoke.test_embeddings_768_dim (line 879) then calls client.embeddings(..., timeout=120) expecting the server to serve it. The 4× timeout vs. test_lemonade_embeddings.py's timeout=30 suggests the author expected an auto-load, but the other two CI workflows that exercise this model (test_embeddings.yml, test_rag.yml) both explicitly load it first — test_embeddings.yml even uses LemonadeClient.load_model(..., llamacpp_args='-ngl 0') to do so. The test docstring also says "pulled/loaded" when the workflow only pulls.

If Lemonade 10.2.0 doesn't auto-load on embeddings requests (which is the pattern this repo has always assumed), test_embeddings_768_dim will fail every CI run of the new workflow.

Fix: add a python -c "... c.load_model('nomic-embed-text-v2-moe-GGUF', auto_download=True, prompt=False, timeout=600)" step (matching the pattern already in test_embeddings.yml) immediately after the pull step, and correct the docstring to say "pulled and loaded".

The -ngl 0 CPU workaround in test_embeddings.yml never worked: Lemonade reserves -ngl as a managed argument and 400s any custom override, so the embeddings job failed on argument validation rather than loading at all. It was also built on a false premise -- the 06-17 main run proved nomic-embed-text-v2-moe loads on device=gpu under 10.2.0 and passes all six embedding tests, so no CPU fallback is needed (and CPU would only degrade real-user embedding throughput). Revert test_embeddings.yml / test_rag.yml / test_lemonade_embeddings.py to the proven-green GPU-load path and drop the unrelated real-server client integration suite. The fix is now solely the 10.2.0 pin (already in version.py) plus the doc version syncs.

github-actions · 2026-06-23T16:19:07Z

🟡 The weekly lemonade-version-bump.yml workflow will automatically open a bump PR to 10.7.0 (or newer) every Thursday — since the new pin (10.2.0) looks "behind" the latest release, Guard 1 in that workflow passes. If a maintainer merges that auto-bump PR without testing embeddings, the regression silently comes back. The comment in version.py won't stop the workflow. A guard needs to be added to the workflow (or the workflow temporarily disabled) until the regression is confirmed fixed.

Also: docs/guides/npu.mdx now reads v10.2.0+, which implies any version ≥ 10.2.0 works — but 10.6.0 and 10.7.0 break RAG embeddings. A user who manually downloads "the latest" Lemonade after reading that page will hit the regression.

🔍 Technical details

Auto-bump problem — .github/workflows/lemonade-version-bump.yml:80-88:

newest=$(printf '%s\n%s\n' "$current" "$target" | sort -V | tail -1)
if [ "$target" = "$current" ] || [ "$newest" = "$current" ]; then
  echo "Already on the latest …"
  exit 0
fi

With current=10.2.0 and latest=10.7.0 (or 10.8.0+ over time), Guard 1 will always pass and open a fresh bump PR for each new Lemonade release. Guard 3 stops re-opening for the same version, but not for a newer one. Add a SKIP_LEMONADE_BUMP environment variable or a sentinel in version.py that the workflow checks before proceeding, so the regression hold is machine-enforced rather than relying on reviewers reading a Python comment.

npu.mdx — docs/guides/npu.mdx:15: change v10.2.0+ to v10.2.0 (exact) and add a note that versions 10.6.0–10.7.0 have a known embedding regression (tracked in #941). Otherwise users following the guide will install the latest broken release.

The embedding regression (#1571) shipped because bumping LEMONADE_VERSION in src/gaia/version.py changed the Lemonade backend but triggered none of the embedding/RAG/lemonade-server workflows -- their path filters watched src/gaia/llm/**, setup.py, and the test files, but not version.py, the single source of truth for the backend version. Add version.py to all three path filters so a future bump (or this pin) actually runs the tests it affects.

github-actions · 2026-06-23T16:22:27Z

🟡 The 9-line comment block above LEMONADE_VERSION in version.py violates the project's "one short line max" comment rule — CLAUDE.md explicitly calls this pattern out as the anti-pattern to avoid. The history and reasoning belong in the PR description and commit message (where they already live), not inline in the source.

Condense to one line that names the constraint and the tracker:

# Pinned at 10.2.0: newer builds crash loading nomic-embed-text-v2-moe-GGUF. See #941.
LEMONADE_VERSION = "10.2.0"

🔍 Technical details

src/gaia/version.py:11-22 — the added block is 9 lines of historical narrative. CLAUDE.md says:

Keep WHY comments to one short line. Multi-paragraph "history of how we got here" blocks are noise — the diff, commit message, and linked issue carry the history.

and

Never write multi-paragraph docstrings or multi-line comment blocks — one short line max.

The canonical CLAUDE.md bad-example is nearly identical in shape (multi-line # Pre-#1030 follow-up… block). The suggested single-line replacement above names the invariant (don't bump past 10.2.0), the symptom (crash on the specific model), and where to find the full context (#941) — everything a future author needs without the wall of text.

…oad failure The 10.2.0 pin did not green CI: nomic-embed-text-v2-moe still fails with the generic 'llama-server failed to start' 500, which hides the backend's own stderr. Add a diagnostic block to the load-failure path that prints the actual Lemonade/llama-server logs, every llama-server binary found (path + size + embedded version), and the lemonade cache/install dir layout -- so we can tell whether a stale newer backend survived the MSI downgrade or a fresh 10.2.0 backend itself crashes on the runner. No behavior change on success.

The reconcile-gated backend wipe only checked <root>\.cache\lemonade\bin and silently found nothing on the strix runners, so an MSI version change swapped LemonadeServer.exe but left the previous backend (the actual llama-server binary) in place -- a downgraded 10.2.0 server then spawned the stale newer backend and the load 500'd with 'llama-server failed to start'. Make the wipe path-agnostic: search every known cache and install-tree root (.cache\lemonade and AppData\Local\lemonade_server, SYSTEM-profile variants included), log each llama-server binary found with its version, and remove only the 'llamacpp' backend dirs -- never the MSI's own server exe. Stays reconcile-gated, so a version-matched runner does nothing. Models live in the separate HF hub cache and are untouched (no model re-download).

kovtcharov · 2026-06-26T16:42:58Z

Closing — the premise (pin to 10.2.0 because v10.7.0 can't load the nomic embedding model) is invalidated by deeper investigation:

The llama-server failed to start failure for nomic-embed-text-v2-moe is transient and version-independent, not a clean 10.7.0 regression. 10.2.0 also fails when the fault is active, and it reproduced on llama.cpp b6510 (pre-b6524). So a downgrade does not make CI reliable.
Main has since moved to 10.8.1 (feat(lemonade): upgrade pinned Lemonade Server to 10.8.1 #1869), so this would also be a multi-month downgrade.
The real fix is a bounded client-side retry on the transient fault — feat(lemonade): retry transient "llama-server failed to start" on model load #1876 (validated on real hardware) — which makes GAIA's RAG/chat/VLM loads resilient. Findings + upstream asks are tracked in Known Issue: llama-server >= b6524 doesnt work with nomic-embed-text-v2-moe.Q8_0.gguf lemonade-sdk/lemonade#612.

Superseded by #1876 (same reason the equivalent revert PR #1872 was closed).

kovtcharov requested a review from kovtcharov-amd as a code owner June 19, 2026 21:09

github-actions Bot added the devops DevOps/infrastructure changes label Jun 19, 2026

kovtcharov changed the title ~~fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model~~ fix(ci): pin Lemonade to 10.6.0 — v10.7.0 backend can't load nomic embedding model Jun 20, 2026

github-actions Bot added documentation Documentation changes cpp labels Jun 20, 2026

kovtcharov added 3 commits June 20, 2026 12:00

github-actions Bot added the tests Test changes label Jun 21, 2026

kovtcharov-amd changed the title ~~fix(ci): pin Lemonade to 10.6.0 — v10.7.0 backend can't load nomic embedding model~~ fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model Jun 23, 2026

Ovtcharov added 2 commits June 23, 2026 09:40

kovtcharov mentioned this pull request Jun 23, 2026

CI: Lemonade v10.7.0 breaks embedding-model jobs (nomic-embed-text-v2-moe / llama-server ≥ b6524) #1831

Open

kovtcharov closed this Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model#1788

fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model#1788
kovtcharov wants to merge 12 commits into
mainfrom
fix/lemonade-pin-10.2.0-embed-regression

kovtcharov commented Jun 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

kovtcharov commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kovtcharov commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this matters

The real root cause: the backend doesn't follow the version

Also: version changes now actually run the embedding tests

Why 10.2.0 (not 10.6.0)

Test plan

Notes for reviewers

Tracking

Uh oh!

github-actions Bot commented Jun 19, 2026

Code Review — PR #1788

Summary

Issues

Strengths

Verdict

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

kovtcharov commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kovtcharov commented Jun 19, 2026 •

edited

Loading