fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model#1788
fix(ci): pin Lemonade to 10.2.0 — v10.7.0 backend can't load nomic embedding model#1788kovtcharov wants to merge 12 commits into
Conversation
…bedding model Lemonade v10.7.0 (bumped on main in #1571) ships llama.cpp build b9585, whose llama-server crashes loading nomic-embed-text-v2-moe-GGUF — GAIA's default embedding model for RAG, code-index, and agent memory. Loads return a generic 500 "llama-server failed to start" (matches the upstream GGML_ASSERT vocab crash in ggml-org/llama.cpp#13534). Regular LLM GGUFs load fine on b9585, which is why only the embedding CI jobs (Test RAG, Test Lemonade Embeddings) fail while API / chat / code / unit jobs stay green — and why the break is environment-wide across every branch, not any one PR. #1571 changed only version.py, so pinning back to 10.2.0 (the proven last-known-good — the last main run before the bump was green) is a clean, fully reversible functional revert. The install-lemonade action reconciles the runner down to the pinned version and wipes the stale backend cache automatically. Also surface the llama-server child stderr on embedding load failure in test_embeddings.yml and test_rag.yml. Lemonade swallows it behind the 500, so the actual backend crash was invisible; the server job already runs with 2>&1, so a Receive-Job dump in the load-failure path exposes the real assert and lets a future bump confirm the model loads before landing.
Code Review — PR #1788SummarySolid, well-evidenced revert: pinning Lemonade back to 10.2.0 is the right call to unbreak the embedding-dependent CI jobs (and the real RAG/memory path for users on 10.7.0), and the Issues🟡 Doc version check will fail CI on this PR (
So the PR fixes the embedding jobs but turns the Docs workflow red. Since the whole point is green CI, this is blocking. Two ways to resolve, in order of preference:
Either way it needs to happen in this PR, not a follow-up, or the docs job blocks the merge. 🟢 🟢 Strengths
VerdictRequest changes — the change itself is correct and well-argued, but it can't ship green until the three |
…d to load nomic embed 10.7.0 (llama.cpp b9585) is the proven-bad version; 10.2.0 the proven-good one. The embedding-load regression entered the build range b8766..b9585, so 10.6.0 (b9253) is the most recent release likely to still load nomic-embed-text-v2-moe-GGUF while retaining most of the 10.x feature set. The embedding CI jobs now surface the llama-server stderr, so this pin is verified empirically rather than assumed — if 10.6.0 also fails, the logs pinpoint the bad build and we drop to the 10.2.0 floor. Also drops the overclaimed link to llama.cpp#13534 (that crash was on build b5372, long since fixed and far older than any 10.x backend).
…doc version refs CI on this PR confirmed 10.6.0 (llama.cpp b9253) hits the same embedding-load 500 as 10.7.0, so the regression entered the build range b8766..b9253 — earlier than 10.7's bumps. 10.2.0 is the proven last-known-good; pin there. The embedding jobs on this PR now verify that empirically (green = 10.2.0 loads the model). Sync the hardcoded Lemonade version refs in docs/cpp/setup.mdx, docs/guides/npu.mdx, and cpp/README.md to match the pin — the Doc Version Consistency check (util/ check_doc_versions.py, [10/10] in Code Quality) enforces docs == LEMONADE_VERSION and was failing on the 10.7.0 leftovers.
…d-failing) tests/test_lemonade_embeddings.py built a real LemonadeClient and called embeddings() with no skip guard, so it hard-failed instead of skipping when no Lemonade server was running locally. Gate the client fixture on require_lemonade so it skips like every other real-server test; CI (test_embeddings.yml) still exercises it against a live server.
Most Lemonade capabilities GAIA depends on were only mock-tested, and mocks prove "we called it", not "the call is valid" — they cannot catch contract bugs like #1655, where the model-pull sent recipe= for a built-in model and Lemonade 400s only on a real request. Adds tests/test_lemonade_client_integration.py: require_lemonade-gated real-server tests covering chat/completions (streaming + non-streaming), text completions, the #1655 built-in-pull contract (asserts the outgoing /pull payload carries no recipe), load/_ensure_model_loaded with explicit ctx, scoped/global/no-op unload, non-destructive delete error path, catalog + health/status/system-info/stats introspection, tool-calling round-trip + is_tool_calling_model mapping, an embeddings smoke check, and auth-header behavior. All tests SKIP cleanly when no server is up. Adds .github/workflows/test_lemonade_client.yml to run the suite against a live server (Qwen3-0.6B-GGUF pulled/loaded, GGML_VK_DISABLE_COOPMAT workaround, port 13305), mirroring test_embeddings.yml.
…failure The nomic embedding load 500 reproduces on every Lemonade version (10.2/10.6/10.7) while LLMs load fine and the upstream GGUF is unchanged since 2025 — pointing at a stale/corrupt cached GGUF in the HF hub cache (never wiped by version reconcile), not a server-version regression. Wipe the cached nomic dir before pull to force a clean re-download, and on load failure dump the cached file sizes + tail Lemonade's log files (the prior Receive-Job dump came back empty — Lemonade logs to a file, not the serve job's stdout). Decisive: clean re-download loads => corrupt cache; still fails => runner env/llama.cpp issue. gaia#941.
… MoE crash Root cause (gaia#941): the nomic-embed-text-v2-moe (MoE) model crashes llama-server on the stx runners' Vulkan backend. Proven not a version/model-file/cache issue — it loads + runs fine on Metal locally (Lemonade 10.3.0) and LLMs load on Vulkan here, yet the embedder fails identically across Lemonade 10.2/10.6/10.7. Lemonade launches the embedder with -ngl 99 (full GPU offload), building the crashing Vulkan MoE kernels; loading with llamacpp_args='-ngl 0' keeps it on CPU (embeddings are cheap) and sidesteps the broken path. Also waits for the async pull to finish downloading before load. If this run goes green, root cause + fix are confirmed.
|
🟡 The embeddings workflow was specifically updated (in a prior commit on this PR) to:
This workflow does neither — it fires-and-forgets the pull, waits a static 30 s, then runs the suite. When Minimal fix: mirror the embeddings workflow's pull-wait loop and load step here, or skip the 768-dim smoke test in this workflow entirely and rely on |
…tch) on CPU A prior cache-inspect step deleted the runner's pre-seeded nomic GGUF, and the bare REST /pull is registration-only on these runners (won't re-download), so the load test couldn't run. Switch to LemonadeClient.load_model(auto_download=True, llamacpp_args='-ngl 0'): a real HF download that restores the model AND exercises GAIA's actual load path, loading the MoE embedder on CPU to dodge the Vulkan crash (gaia#941).
|
🟡 The workflow calls If Lemonade 10.2.0 doesn't auto-load on embeddings requests (which is the pattern this repo has always assumed), Fix: add a |
The -ngl 0 CPU workaround in test_embeddings.yml never worked: Lemonade reserves -ngl as a managed argument and 400s any custom override, so the embeddings job failed on argument validation rather than loading at all. It was also built on a false premise -- the 06-17 main run proved nomic-embed-text-v2-moe loads on device=gpu under 10.2.0 and passes all six embedding tests, so no CPU fallback is needed (and CPU would only degrade real-user embedding throughput). Revert test_embeddings.yml / test_rag.yml / test_lemonade_embeddings.py to the proven-green GPU-load path and drop the unrelated real-server client integration suite. The fix is now solely the 10.2.0 pin (already in version.py) plus the doc version syncs.
|
🟡 The weekly Also: 🔍 Technical detailsAuto-bump problem — With npu.mdx — |
The embedding regression (#1571) shipped because bumping LEMONADE_VERSION in src/gaia/version.py changed the Lemonade backend but triggered none of the embedding/RAG/lemonade-server workflows -- their path filters watched src/gaia/llm/**, setup.py, and the test files, but not version.py, the single source of truth for the backend version. Add version.py to all three path filters so a future bump (or this pin) actually runs the tests it affects.
|
🟡 The 9-line comment block above Condense to one line that names the constraint and the tracker: # Pinned at 10.2.0: newer builds crash loading nomic-embed-text-v2-moe-GGUF. See #941.
LEMONADE_VERSION = "10.2.0"🔍 Technical details
and
The canonical CLAUDE.md bad-example is nearly identical in shape (multi-line |
…oad failure The 10.2.0 pin did not green CI: nomic-embed-text-v2-moe still fails with the generic 'llama-server failed to start' 500, which hides the backend's own stderr. Add a diagnostic block to the load-failure path that prints the actual Lemonade/llama-server logs, every llama-server binary found (path + size + embedded version), and the lemonade cache/install dir layout -- so we can tell whether a stale newer backend survived the MSI downgrade or a fresh 10.2.0 backend itself crashes on the runner. No behavior change on success.
The reconcile-gated backend wipe only checked <root>\.cache\lemonade\bin and silently found nothing on the strix runners, so an MSI version change swapped LemonadeServer.exe but left the previous backend (the actual llama-server binary) in place -- a downgraded 10.2.0 server then spawned the stale newer backend and the load 500'd with 'llama-server failed to start'. Make the wipe path-agnostic: search every known cache and install-tree root (.cache\lemonade and AppData\Local\lemonade_server, SYSTEM-profile variants included), log each llama-server binary found with its version, and remove only the 'llamacpp' backend dirs -- never the MSI's own server exe. Stays reconcile-gated, so a version-matched runner does nothing. Models live in the separate HF hub cache and are untouched (no model re-download).
|
Closing — the premise (pin to 10.2.0 because v10.7.0 can't load the nomic embedding model) is invalidated by deeper investigation:
Superseded by #1876 (same reason the equivalent revert PR #1872 was closed). |
Why this matters
Every embedding-dependent CI job — and every real user's RAG / code-index / agent-memory embedding path — broke on
mainwhen Lemonade was bumped to 10.7.0 (#1571, 2026-06-17). v10.7.0's bundled llama.cpp (build b9585) crashes loadingnomic-embed-text-v2-moe-GGUFwith a generic500 model_load_error: llama-server failed to start. Regular LLM GGUFs load fine, which is why only the embedding jobs went red.This pins Lemonade back to 10.2.0 (the proven last-known-good; the 06-17
mainrun loaded the embedder ondevice=gpuand passed all six tests) and fixes the deeper reason a version pin alone wasn't enough on the runners.The real root cause: the backend doesn't follow the version
A Lemonade "version" is two things: the MSI
LemonadeServer.exeand the separately-downloaded llama.cpp backend (the actualllama-serverbinary). The CI reconcile only swapped the MSI exe — it left the previous backend in place. So a downgraded 10.2.0 server kept spawning the stale b9585 backend and hit the exact same crash. The existing backend-wipe only looked at~/.cache/lemonade/binand silently found nothing on the strix (SYSTEM-account) runners, where the backend actually lives elsewhere.The fix makes the reconcile-gated wipe path-agnostic: it searches every known cache and install-tree root, logs each
llama-serverbinary found (with its version, so the layout is visible even on success), and removes only thellamacppbackend dirs — never the MSI's own server exe. A version-matched runner skips it entirely (no per-run cost, no model re-download — GGUF weights live in the separate HF hub cache).Also: version changes now actually run the embedding tests
The embedding/RAG/lemonade-server workflows watched
src/gaia/llm/**,setup.py, etc. — but notsrc/gaia/version.py, the single source of truth for the backend version. That's why #1571's bump shipped a broken embedder with zero embedding tests firing.version.pyis now in all three path filters, so any future bump runs the tests it affects.Why 10.2.0 (not 10.6.0)
CI bisection: 10.2.0 (pre-b8766) loads the model; 10.6.0 (b9253) and 10.7.0 (b9585) both crash.
version.pycarries a loud "do not bump past this until verified" note.Test plan
stx— loadsnomic-embed-text-v2-moe-GGUFondevice=gpu, 6 tests passllamacppdir wipe (or a clean "fetch on first load")Notes for reviewers
version.pysays so.embeddinggemma-300M-GGUFor curatedQwen3-Embedding) so Lemonade can move forward — needs a RAG eval. Tracked alongside [Bug]: macOS installer (darwin-arm64) fails on first launch — bundled uv never shipped for mac-arm64 #941.Tracking
Workaround only — this pins/reverts the Lemonade backend to dodge the crash. The root cause is a llama.cpp/llama-server bug (upstream lemonade-sdk/lemonade#612). Tracked in #1831, which stays open until upstream ships a backend that loads
nomic-embed-text-v2-moeand we un-pin.