Skip to content

feat(perf): content-keyed embedding cache to skip redundant per-turn embeds#1748

Open
github-actions[bot] wants to merge 2 commits into
mainfrom
autofix/issue-1743
Open

feat(perf): content-keyed embedding cache to skip redundant per-turn embeds#1748
github-actions[bot] wants to merge 2 commits into
mainfrom
autofix/issue-1743

Conversation

@github-actions

Copy link
Copy Markdown
Contributor

Every chat turn re-embedded the query from scratch, so identical text — the same recall(query=…) across turns, or hybrid search re-embedding input a tool call already embedded that turn — paid the Lemonade embed cost twice, adding latency and avoidable backend calls. This adds a content-keyed LRU cache so an identical embed is served from memory and makes zero backend calls.

The cache key is the content — (model_id, dim, sha256(text)) — so a hit is never stale and swapping the embedding model invalidates by construction. It's wired into the two per-turn embed sites (MemoryMixin._embed_text and RAG query encoding); stored memories and doc chunks already persist their vectors, so this targets repeated query embeds only and leaves indexing untouched.

Note: this does not by itself fix the NPU load loop (#1746) — a genuinely new query still embeds once.

Closes #1743

Test plan

  • python -m pytest tests/unit/test_embedding_cache.py tests/unit/test_memory_mixin.py tests/unit/rag/ -q passes
  • python util/lint.py --all passes (Black, isort, Pylint, Flake8 clean on the changed files)
  • New tests assert a second identical embed makes zero backend calls and that a model/dim change invalidates the entry

…embeds

Adds an LRU cache keyed on (model_id, dim, sha256(text)) so identical query
text re-embedded across turns pays the Lemonade embed cost once. The key is
the content, so a hit is never stale and a model swap invalidates by
construction. Wired into MemoryMixin._embed_text and RAGSDK query encoding;
stored memories and doc chunks already persist their vectors, so this targets
repeated query embeds only.

Closes #1743
@github-actions github-actions Bot added rag RAG system changes llm LLM backend changes tests Test changes performance Performance-critical changes agents labels Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents llm LLM backend changes performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(perf): content-keyed embedding cache to skip redundant per-turn embeds

1 participant