Skip to content

fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model#1788

Closed
kovtcharov wants to merge 12 commits into
mainfrom
fix/lemonade-pin-10.2.0-embed-regression
Closed

fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model#1788
kovtcharov wants to merge 12 commits into
mainfrom
fix/lemonade-pin-10.2.0-embed-regression

Conversation

@kovtcharov

@kovtcharov kovtcharov commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Why this matters

Every embedding-dependent CI job — and every real user's RAG / code-index / agent-memory embedding path — broke on main when Lemonade was bumped to 10.7.0 (#1571, 2026-06-17). v10.7.0's bundled llama.cpp (build b9585) crashes loading nomic-embed-text-v2-moe-GGUF with a generic 500 model_load_error: llama-server failed to start. Regular LLM GGUFs load fine, which is why only the embedding jobs went red.

This pins Lemonade back to 10.2.0 (the proven last-known-good; the 06-17 main run loaded the embedder on device=gpu and passed all six tests) and fixes the deeper reason a version pin alone wasn't enough on the runners.

The real root cause: the backend doesn't follow the version

A Lemonade "version" is two things: the MSI LemonadeServer.exe and the separately-downloaded llama.cpp backend (the actual llama-server binary). The CI reconcile only swapped the MSI exe — it left the previous backend in place. So a downgraded 10.2.0 server kept spawning the stale b9585 backend and hit the exact same crash. The existing backend-wipe only looked at ~/.cache/lemonade/bin and silently found nothing on the strix (SYSTEM-account) runners, where the backend actually lives elsewhere.

The fix makes the reconcile-gated wipe path-agnostic: it searches every known cache and install-tree root, logs each llama-server binary found (with its version, so the layout is visible even on success), and removes only the llamacpp backend dirs — never the MSI's own server exe. A version-matched runner skips it entirely (no per-run cost, no model re-download — GGUF weights live in the separate HF hub cache).

Also: version changes now actually run the embedding tests

The embedding/RAG/lemonade-server workflows watched src/gaia/llm/**, setup.py, etc. — but not src/gaia/version.py, the single source of truth for the backend version. That's why #1571's bump shipped a broken embedder with zero embedding tests firing. version.py is now in all three path filters, so any future bump runs the tests it affects.

Why 10.2.0 (not 10.6.0)

CI bisection: 10.2.0 (pre-b8766) loads the model; 10.6.0 (b9253) and 10.7.0 (b9585) both crash. version.py carries a loud "do not bump past this until verified" note.

Test plan

  • Test Lemonade Embeddings green on stx — loads nomic-embed-text-v2-moe-GGUF on device=gpu, 6 tests pass
  • Test RAG (unit + integration) green
  • Lemonade Server Smoke Test (stx) green
  • On a runner that reconciles from 10.7.0, logs show the backend scan + llamacpp dir wipe (or a clean "fetch on first load")

Notes for reviewers

Tracking

Workaround only — this pins/reverts the Lemonade backend to dodge the crash. The root cause is a llama.cpp/llama-server bug (upstream lemonade-sdk/lemonade#612). Tracked in #1831, which stays open until upstream ships a backend that loads nomic-embed-text-v2-moe and we un-pin.

…bedding model

Lemonade v10.7.0 (bumped on main in #1571) ships llama.cpp build b9585, whose
llama-server crashes loading nomic-embed-text-v2-moe-GGUF — GAIA's default
embedding model for RAG, code-index, and agent memory. Loads return a generic
500 "llama-server failed to start" (matches the upstream GGML_ASSERT vocab crash
in ggml-org/llama.cpp#13534). Regular LLM GGUFs load fine on b9585, which is why
only the embedding CI jobs (Test RAG, Test Lemonade Embeddings) fail while API /
chat / code / unit jobs stay green — and why the break is environment-wide across
every branch, not any one PR.

#1571 changed only version.py, so pinning back to 10.2.0 (the proven
last-known-good — the last main run before the bump was green) is a clean, fully
reversible functional revert. The install-lemonade action reconciles the runner
down to the pinned version and wipes the stale backend cache automatically.

Also surface the llama-server child stderr on embedding load failure in
test_embeddings.yml and test_rag.yml. Lemonade swallows it behind the 500, so the
actual backend crash was invisible; the server job already runs with 2>&1, so a
Receive-Job dump in the load-failure path exposes the real assert and lets a
future bump confirm the model loads before landing.
@github-actions github-actions Bot added the devops DevOps/infrastructure changes label Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Code Review — PR #1788

Summary

Solid, well-evidenced revert: pinning Lemonade back to 10.2.0 is the right call to unbreak the embedding-dependent CI jobs (and the real RAG/memory path for users on 10.7.0), and the Receive-Job stderr dump is a genuinely useful debuggability win that follows the existing pattern in these workflows. However, the PR as written will trade one red CI job for another: changing src/gaia/version.py triggers docs.yml, whose check_doc_versions.py gate fails because three doc files still hardcode 10.7.0. The "docs are out of scope" note in the description is the one thing to reconsider — for these specific files it isn't optional, it's a hard gate that this change activates.

Issues

🟡 Doc version check will fail CI on this PR (src/gaia/version.py:65)

.github/workflows/docs.yml runs python util/check_doc_versions.py as a non-tolerant step, and it triggers on any PR that touches src/gaia/version.py (it's in the workflow's paths:). The check reads LEMONADE_VERSION and flags any doc that references a different Lemonade version. With the pin now at 10.2.0, I ran it locally and it exits 1 with 8 mismatches across 3 files:

  • docs/cpp/setup.mdx — lines 83, 89, 179, 222
  • docs/guides/npu.mdx — line 15
  • cpp/README.md — lines 31, 36, 47

So the PR fixes the embedding jobs but turns the Docs workflow red. Since the whole point is green CI, this is blocking. Two ways to resolve, in order of preference:

  1. Update those three files to 10.2.0 in this PR. They describe the version users actually install, so they should track the pin — this keeps docs correct rather than just quieting the linter. It's a small, mechanical change (release URLs, the tag/v… links, the version-table cell, and the v10.7.0+ text).
  2. If you'd rather not churn the cpp/setup docs for a version you expect to re-bump soon, add them to EXCLUDE_PATTERNS in util/check_doc_versions.py — but that silences the guardrail, so I'd only do it with a comment pointing at the re-bump follow-up.

Either way it needs to happen in this PR, not a follow-up, or the docs job blocks the merge.

🟢 reset-lemonade.ps1 default fallback still 10.7.0 (installer/scripts/reset-lemonade.ps1:70)if (-not $Version) { $Version = "10.7.0" } is the fallback when run outside a GAIA checkout; it now disagrees with the pin. You already called this out as follow-up and it's not a CI gate (the script prefers the parsed version.py value when available), so deferring is reasonable — flagging only so it doesn't get lost.

🟢 version.py comment length (src/gaia/version.py:55-64) — CLAUDE.md leans toward one-line WHY comments, and this is a 10-line block. I'd leave it as-is: it's a deliberate "do not bump" guardrail aimed at the weekly lemonade-version-bump.yml reviewer, names the unblock condition, and the value clearly justifies the length. Noting it only for completeness, not asking for a change.

Strengths

  • Exemplary root-cause writeup — discriminator (LLM GGUFs load, only the embedding model fails), timeline (last green pre-bump run vs. first failure), mechanism (b9585 / upstream GGML_ASSERT), and linked failing runs. This is exactly how a revert should be justified.
  • The diagnostic dump is well-placed and correct — it reuses the $env:LEMONADE_JOB_ID set when the server job starts (test_embeddings.yml:74), lives in the same step so the var is in scope, mirrors the existing Receive-Job usage already in the file, and guards on the var before reading. Identical block in both workflows keeps them in sync.
  • No unit-test fallouttests/unit/test_init_command.py reads LEMONADE_VERSION dynamically and its hardcoded "10.2.0" cases pre-date the 10.7.0 bump, so the revert actually re-aligns the suite rather than breaking it.

Verdict

Request changes — the change itself is correct and well-argued, but it can't ship green until the three 10.7.0 doc references are reconciled (update to 10.2.0, or exclude with a tracking comment). Everything else is approve-quality. Once the docs job passes, this is good to merge.

…d to load nomic embed

10.7.0 (llama.cpp b9585) is the proven-bad version; 10.2.0 the proven-good one.
The embedding-load regression entered the build range b8766..b9585, so 10.6.0
(b9253) is the most recent release likely to still load nomic-embed-text-v2-moe-GGUF
while retaining most of the 10.x feature set. The embedding CI jobs now surface the
llama-server stderr, so this pin is verified empirically rather than assumed — if
10.6.0 also fails, the logs pinpoint the bad build and we drop to the 10.2.0 floor.

Also drops the overclaimed link to llama.cpp#13534 (that crash was on build b5372,
long since fixed and far older than any 10.x backend).
@kovtcharov kovtcharov changed the title fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model fix(ci): pin Lemonade to 10.6.0 — v10.7.0 backend can't load nomic embedding model Jun 20, 2026
…doc version refs

CI on this PR confirmed 10.6.0 (llama.cpp b9253) hits the same embedding-load 500
as 10.7.0, so the regression entered the build range b8766..b9253 — earlier than
10.7's bumps. 10.2.0 is the proven last-known-good; pin there. The embedding jobs
on this PR now verify that empirically (green = 10.2.0 loads the model).

Sync the hardcoded Lemonade version refs in docs/cpp/setup.mdx, docs/guides/npu.mdx,
and cpp/README.md to match the pin — the Doc Version Consistency check (util/
check_doc_versions.py, [10/10] in Code Quality) enforces docs == LEMONADE_VERSION and
was failing on the 10.7.0 leftovers.
@github-actions github-actions Bot added documentation Documentation changes cpp labels Jun 20, 2026
…d-failing)

tests/test_lemonade_embeddings.py built a real LemonadeClient and called
embeddings() with no skip guard, so it hard-failed instead of skipping when no
Lemonade server was running locally. Gate the client fixture on require_lemonade
so it skips like every other real-server test; CI (test_embeddings.yml) still
exercises it against a live server.
Most Lemonade capabilities GAIA depends on were only mock-tested, and
mocks prove "we called it", not "the call is valid" — they cannot catch
contract bugs like #1655, where the model-pull sent recipe= for a
built-in model and Lemonade 400s only on a real request.

Adds tests/test_lemonade_client_integration.py: require_lemonade-gated
real-server tests covering chat/completions (streaming + non-streaming),
text completions, the #1655 built-in-pull contract (asserts the outgoing
/pull payload carries no recipe), load/_ensure_model_loaded with explicit
ctx, scoped/global/no-op unload, non-destructive delete error path,
catalog + health/status/system-info/stats introspection, tool-calling
round-trip + is_tool_calling_model mapping, an embeddings smoke check,
and auth-header behavior. All tests SKIP cleanly when no server is up.

Adds .github/workflows/test_lemonade_client.yml to run the suite against
a live server (Qwen3-0.6B-GGUF pulled/loaded, GGML_VK_DISABLE_COOPMAT
workaround, port 13305), mirroring test_embeddings.yml.
…failure

The nomic embedding load 500 reproduces on every Lemonade version (10.2/10.6/10.7)
while LLMs load fine and the upstream GGUF is unchanged since 2025 — pointing at a
stale/corrupt cached GGUF in the HF hub cache (never wiped by version reconcile),
not a server-version regression. Wipe the cached nomic dir before pull to force a
clean re-download, and on load failure dump the cached file sizes + tail Lemonade's
log files (the prior Receive-Job dump came back empty — Lemonade logs to a file,
not the serve job's stdout). Decisive: clean re-download loads => corrupt cache;
still fails => runner env/llama.cpp issue. gaia#941.
@github-actions github-actions Bot added the tests Test changes label Jun 21, 2026
… MoE crash

Root cause (gaia#941): the nomic-embed-text-v2-moe (MoE) model crashes llama-server
on the stx runners' Vulkan backend. Proven not a version/model-file/cache issue —
it loads + runs fine on Metal locally (Lemonade 10.3.0) and LLMs load on Vulkan
here, yet the embedder fails identically across Lemonade 10.2/10.6/10.7. Lemonade
launches the embedder with -ngl 99 (full GPU offload), building the crashing Vulkan
MoE kernels; loading with llamacpp_args='-ngl 0' keeps it on CPU (embeddings are
cheap) and sidesteps the broken path. Also waits for the async pull to finish
downloading before load. If this run goes green, root cause + fix are confirmed.
@github-actions

Copy link
Copy Markdown
Contributor

🟡 .github/workflows/test_lemonade_client.yml:277-288 — embedding model pulled but not loaded with -ngl 0, so TestLemonadeEmbeddingsSmoke.test_embeddings_768_dim will likely trigger the same Vulkan MoE crash that was fixed in test_embeddings.yml.

The embeddings workflow was specifically updated (in a prior commit on this PR) to:

  1. Poll until the GGUF file is fully downloaded before proceeding.
  2. Call /api/v1/load with llamacpp_args = "-ngl 0" to force CPU offload and dodge the crash.

This workflow does neither — it fires-and-forgets the pull, waits a static 30 s, then runs the suite. When client.embeddings(model=EMBED_MODEL) triggers an auto-load, Lemonade will use the default -ngl 99, building the crashing Vulkan MoE kernels.

Minimal fix: mirror the embeddings workflow's pull-wait loop and load step here, or skip the 768-dim smoke test in this workflow entirely and rely on test_embeddings.yml for full coverage.

…tch) on CPU

A prior cache-inspect step deleted the runner's pre-seeded nomic GGUF, and the bare
REST /pull is registration-only on these runners (won't re-download), so the load
test couldn't run. Switch to LemonadeClient.load_model(auto_download=True,
llamacpp_args='-ngl 0'): a real HF download that restores the model AND exercises
GAIA's actual load path, loading the MoE embedder on CPU to dodge the Vulkan crash
(gaia#941).
@github-actions

Copy link
Copy Markdown
Contributor

🟡 test_lemonade_client.yml:265-270 — embedding model pulled but not loaded before smoke test

The workflow calls /api/v1/pull for nomic-embed-text-v2-moe-GGUF but never calls /api/v1/load. TestLemonadeEmbeddingsSmoke.test_embeddings_768_dim (line 879) then calls client.embeddings(..., timeout=120) expecting the server to serve it. The 4× timeout vs. test_lemonade_embeddings.py's timeout=30 suggests the author expected an auto-load, but the other two CI workflows that exercise this model (test_embeddings.yml, test_rag.yml) both explicitly load it first — test_embeddings.yml even uses LemonadeClient.load_model(..., llamacpp_args='-ngl 0') to do so. The test docstring also says "pulled/loaded" when the workflow only pulls.

If Lemonade 10.2.0 doesn't auto-load on embeddings requests (which is the pattern this repo has always assumed), test_embeddings_768_dim will fail every CI run of the new workflow.

Fix: add a python -c "... c.load_model('nomic-embed-text-v2-moe-GGUF', auto_download=True, prompt=False, timeout=600)" step (matching the pattern already in test_embeddings.yml) immediately after the pull step, and correct the docstring to say "pulled and loaded".

The -ngl 0 CPU workaround in test_embeddings.yml never worked: Lemonade
reserves -ngl as a managed argument and 400s any custom override, so the
embeddings job failed on argument validation rather than loading at all.
It was also built on a false premise -- the 06-17 main run proved
nomic-embed-text-v2-moe loads on device=gpu under 10.2.0 and passes all
six embedding tests, so no CPU fallback is needed (and CPU would only
degrade real-user embedding throughput).

Revert test_embeddings.yml / test_rag.yml / test_lemonade_embeddings.py
to the proven-green GPU-load path and drop the unrelated real-server
client integration suite. The fix is now solely the 10.2.0 pin (already
in version.py) plus the doc version syncs.
@kovtcharov-amd kovtcharov-amd changed the title fix(ci): pin Lemonade to 10.6.0 — v10.7.0 backend can't load nomic embedding model fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🟡 The weekly lemonade-version-bump.yml workflow will automatically open a bump PR to 10.7.0 (or newer) every Thursday — since the new pin (10.2.0) looks "behind" the latest release, Guard 1 in that workflow passes. If a maintainer merges that auto-bump PR without testing embeddings, the regression silently comes back. The comment in version.py won't stop the workflow. A guard needs to be added to the workflow (or the workflow temporarily disabled) until the regression is confirmed fixed.

Also: docs/guides/npu.mdx now reads v10.2.0+, which implies any version ≥ 10.2.0 works — but 10.6.0 and 10.7.0 break RAG embeddings. A user who manually downloads "the latest" Lemonade after reading that page will hit the regression.

🔍 Technical details

Auto-bump problem.github/workflows/lemonade-version-bump.yml:80-88:

newest=$(printf '%s\n%s\n' "$current" "$target" | sort -V | tail -1)
if [ "$target" = "$current" ] || [ "$newest" = "$current" ]; then
  echo "Already on the latest …"
  exit 0
fi

With current=10.2.0 and latest=10.7.0 (or 10.8.0+ over time), Guard 1 will always pass and open a fresh bump PR for each new Lemonade release. Guard 3 stops re-opening for the same version, but not for a newer one. Add a SKIP_LEMONADE_BUMP environment variable or a sentinel in version.py that the workflow checks before proceeding, so the regression hold is machine-enforced rather than relying on reviewers reading a Python comment.

npu.mdxdocs/guides/npu.mdx:15: change v10.2.0+ to v10.2.0 (exact) and add a note that versions 10.6.0–10.7.0 have a known embedding regression (tracked in #941). Otherwise users following the guide will install the latest broken release.

The embedding regression (#1571) shipped because bumping LEMONADE_VERSION
in src/gaia/version.py changed the Lemonade backend but triggered none of
the embedding/RAG/lemonade-server workflows -- their path filters watched
src/gaia/llm/**, setup.py, and the test files, but not version.py, the
single source of truth for the backend version. Add version.py to all
three path filters so a future bump (or this pin) actually runs the tests
it affects.
@github-actions

Copy link
Copy Markdown
Contributor

🟡 The 9-line comment block above LEMONADE_VERSION in version.py violates the project's "one short line max" comment rule — CLAUDE.md explicitly calls this pattern out as the anti-pattern to avoid. The history and reasoning belong in the PR description and commit message (where they already live), not inline in the source.

Condense to one line that names the constraint and the tracker:

# Pinned at 10.2.0: newer builds crash loading nomic-embed-text-v2-moe-GGUF. See #941.
LEMONADE_VERSION = "10.2.0"
🔍 Technical details

src/gaia/version.py:11-22 — the added block is 9 lines of historical narrative. CLAUDE.md says:

Keep WHY comments to one short line. Multi-paragraph "history of how we got here" blocks are noise — the diff, commit message, and linked issue carry the history.

and

Never write multi-paragraph docstrings or multi-line comment blocks — one short line max.

The canonical CLAUDE.md bad-example is nearly identical in shape (multi-line # Pre-#1030 follow-up… block). The suggested single-line replacement above names the invariant (don't bump past 10.2.0), the symptom (crash on the specific model), and where to find the full context (#941) — everything a future author needs without the wall of text.

Ovtcharov added 2 commits June 23, 2026 09:40
…oad failure

The 10.2.0 pin did not green CI: nomic-embed-text-v2-moe still fails with
the generic 'llama-server failed to start' 500, which hides the backend's
own stderr. Add a diagnostic block to the load-failure path that prints the
actual Lemonade/llama-server logs, every llama-server binary found (path +
size + embedded version), and the lemonade cache/install dir layout -- so we
can tell whether a stale newer backend survived the MSI downgrade or a fresh
10.2.0 backend itself crashes on the runner. No behavior change on success.
The reconcile-gated backend wipe only checked <root>\.cache\lemonade\bin
and silently found nothing on the strix runners, so an MSI version change
swapped LemonadeServer.exe but left the previous backend (the actual
llama-server binary) in place -- a downgraded 10.2.0 server then spawned the
stale newer backend and the load 500'd with 'llama-server failed to start'.

Make the wipe path-agnostic: search every known cache and install-tree root
(.cache\lemonade and AppData\Local\lemonade_server, SYSTEM-profile variants
included), log each llama-server binary found with its version, and remove
only the 'llamacpp' backend dirs -- never the MSI's own server exe. Stays
reconcile-gated, so a version-matched runner does nothing. Models live in the
separate HF hub cache and are untouched (no model re-download).
@kovtcharov

Copy link
Copy Markdown
Collaborator Author

Closing — the premise (pin to 10.2.0 because v10.7.0 can't load the nomic embedding model) is invalidated by deeper investigation:

Superseded by #1876 (same reason the equivalent revert PR #1872 was closed).

@kovtcharov kovtcharov closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp devops DevOps/infrastructure changes documentation Documentation changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant