Skip to content

feat(agent-email): add GET /v1/email/init readiness preflight (#1795)#1813

Open
kovtcharov wants to merge 23 commits into
mainfrom
claudia/task-4a1065f9
Open

feat(agent-email): add GET /v1/email/init readiness preflight (#1795)#1813
kovtcharov wants to merge 23 commits into
mainfrom
claudia/task-4a1065f9

Conversation

@kovtcharov

Copy link
Copy Markdown
Collaborator

Why this matters

A fresh host passed /health and /version green even with Lemonade down and no model downloaded — then 502'd on the very first triage call. Every readiness signal said "ready" while the stack couldn't actually triage. GET /v1/email/init closes that trap: it probes the whole triage stack (Lemonade reachable + the triage model downloaded) and returns a structured status — 200 when ready, 503 when not, with an actionable hint — so an integrator (and the npm package's startSidecar) can verify "ready to triage," not just "process up."

// GET /v1/email/init  → 200 ready / 503 not ready
{ "ready": false,
  "lemonade": { "reachable": true, "base_url": "http://localhost:8000/api/v1" },
  "model": { "id": "Gemma-4-E4B-it-GGUF", "present": false, "loadable": null },
  "hint": "Model `Gemma-4-E4B-it-GGUF` not downloaded — run `gaia init` (or pull it via Lemonade), then retry." }

Read-only by design: probes only, no model pull or provisioning (a deferred follow-up). loadable is null in v1 — forcing a load is heavy, so present (a cheap model-list lookup) is the readiness signal. The reachability probe reuses the existing short-timeout /health logic (#1677) via a shared base-URL resolver, so "Lemonade down" fails fast instead of hanging on the OS SYN timeout. Failures are loud: even when Lemonade answers /health but its model list can't be read, the endpoint returns 503 with a specific hint rather than silently reporting "absent."

Scope: Python sidecar + its tests/docs only — the npm client wrapper and playground are handled separately.

Test plan

  • pytest tests/unit/agents/email/test_init_endpoint.py — 16 tests: probe call-shape at the boundary (URL suffix, short timeout, auth header), Lemonade-down → 503 + hint, model-missing → 503 + hint, model-list-unreadable → 503 + hint, ready → 200, sidecar mount via packaging/server.py build_app().
  • pytest tests/test_email_openapi_conformance.py hub/agents/python/email/tests/test_rest_contract.py — running-server conformance (200/503) + committed openapi.email.json is drift-free.
  • pytest tests/unit/agents/email/test_spec_html.py — runtime /v1/email/spec page documents the new endpoint.
  • python -m gaia_agent_email.export_openapi --check/v1/email/init present, artifact up to date.
  • python util/lint.py --black --isort --flake8 on the changed files — clean.

Docs synced: runtime /v1/email/spec page (spec_html.py), the hand-maintained specification.html (new #ep-init block + 503 row), and the regenerated openapi.email.json.

Ovtcharov and others added 15 commits June 4, 2026 14:56
DocumentQAAgent and RoutingAgent were the last two agents left in the
core source tree under src/gaia/agents/. They now ship as standalone
gaia-agent-docqa / gaia-agent-routing wheels under hub/agents/python/,
completing the "strip src/gaia/agents/ to framework only" goal for #1102
(only base/, tools/, registry.py, builder/ — plus the chat family and
email — remain in core).

docqa is a building-block RAG agent: it registers via the gaia.agent
entry point as a hidden agent (mirroring fileio), default model
Qwen3.5-35B-A3B-GGUF. routing is infrastructure — a meta-agent loaded by
class path from the OpenAI API server, not a registry agent — so it ships
without a gaia.agent entry point; gaia.api.agent_registry now resolves it
at gaia_agent_routing.agent.RoutingAgent and fails loudly with an install
hint when the wheel is absent.
Self-review follow-up to the docqa/routing migration: the gaia-agent-code
CLI imported RoutingAgent from the old in-tree path
(gaia.agents.routing.agent), which the migration broke. Repoint it at
gaia_agent_routing.agent and declare gaia-agent-routing as a dependency of
gaia-agent-code, since the `gaia-code` query path routes through
RoutingAgent for language/project-type detection. No reverse dependency
(routing → code) — routing resolves CodeAgent through the registry at
runtime, avoiding a cycle.

Also clears the now-dead RoutingAgent allowance in the agent-conventions
checker (it only applied while routing lived under src/gaia/agents/).
# Conflicts:
#	hub/agents/python/docqa/tests/test_docqa_agent.py
# Conflicts:
#	.github/workflows/test_gaia_cli.yml
#	setup.py
Merging main surfaced three stale references the migration missed:

- test_default_max_steps imported the now-migrated gaia.agents.docqa;
  repoint it at the core BuilderAgentConfig, which exercises the same
  field(default_factory=default_max_steps) inheritance.
- test_agent_pypi_publish asserted every published wheel declares a
  gaia.agent entry point, but routing is infrastructure loaded by
  class-path and intentionally ships without one. Exempt it explicitly.
- Routing module path + source links in the docs still pointed at
  src/gaia/agents/routing; repoint to the gaia_agent_routing wheel.

Also preserve the original traceback on the gaia-code ImportError
re-raise (raise ... from e) now that the block is being edited.
gaia-agent-code now depends on gaia-agent-routing>=0.1.0, which isn't
published to PyPI. The Test Code Agent workflow installed code straight
from the hub dir, so uv tried to resolve routing from the registry and
failed. Install the local routing package first so the dep resolves
locally. End users are unaffected — both wheels publish together on tag.
The API streaming tests target the 'gaia-code' model, which routes
through RoutingAgent. Pre-migration routing lived in core, so it
resolved automatically; now it ships as the gaia-agent-routing wheel
that the API Tests job didn't install — so 3 streaming tests hit the
(correct) missing-wheel error instead of a real agent. Install the
local routing+code hub packages, and re-run API tests when either
hub package changes.
CLAUDE.md still pointed DocumentQAAgent/RoutingAgent at the old
src/gaia/agents/{docqa,routing} locations and listed docqa in the source
tree — stale after the hub migration and misleading since CLAUDE.md loads
as context on every session. Point both at their hub wheels and drop the
docqa tree entry.

errors.py FRAMEWORK_PATHS carried a dead 'gaia/agents/routing' entry; the
wheel's frames are already filtered by 'site-packages/'. Remove it and
update the test that asserted its presence.
A fresh host passed /health and /version green even with Lemonade down and no
model downloaded, then 502'd on the first triage call. /v1/email/init probes
the whole triage stack — Lemonade reachable + the triage model present — and
returns a structured status (200 when ready, 503 when not, with an actionable
hint), so integrators can verify "ready to triage," not just "process up."

Read-only: probes only, no model pull or provisioning. `loadable` is reported
null in v1 (forcing a load is heavy); `present` is the readiness signal. The
Lemonade reachability probe reuses the existing short-timeout /health logic
(#1677) via a shared base-URL resolver.

Docs kept in sync: the runtime /v1/email/spec page, the hand-maintained
specification.html, and the regenerated openapi.email.json. Verified the route
also mounts via the frozen sidecar's build_app().
@github-actions github-actions Bot added documentation Documentation changes dependencies Dependency updates devops DevOps/infrastructure changes tests Test changes agents labels Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Review: PR #1813

Summary

The code here is solid and well-tested, but the PR is mistitled and under-described: it's titled feat(agent-email): add GET /v1/email/init … (#1795), yet the actual substance — and the bulk of the commits — is refactor(agents): migrate docqa + routing to hub (#1102). The email-init endpoint it advertises already landed on main (commit 2c9f0cea), so most of that part of the diff is a carry-along that will no-op on rebase. The migration itself is clean and consistent: no dangling gaia.agents.{docqa,routing} references remain anywhere, and setup.py, errors.py, the lint import lists, and the conventions checker were all updated in lockstep. The single most important thing to fix is the title + description so a reviewer (and the squashed commit / changelog) reflects what actually changed.

Issues Found

🟡 Title and description describe #1795, not the change being reviewed
The commit list is dominated by the #1102 docqa/routing hub migration (new gaia-agent-docqa / gaia-agent-routing wheels, two new CI workflows, setup.py extras, registry/error/lint updates). The PR description says "Scope: Python sidecar + its tests/docs only" and never mentions the migration at all — a reviewer reading it would miss the entire point of the PR. Per CLAUDE.md, the title is the technical handle and the description sells the merge; both currently point at the wrong change. Please:

🟢 FRAMEWORK_PATHS now inconsistent with editable hub installs (src/gaia/agents/base/errors.py:14)
This drops gaia/agents/routing, but gaia/agents/code — also migrated to a hub wheel — is still listed. More to the point, in an editable install (pip install -e hub/agents/python/routing) routing frames live at hub/agents/python/routing/…, which matches neither site-packages/ nor any remaining entry, so they'd leak into user-facing error traces. Dev-only impact and consistent with how code is already handled, so non-blocking — but worth a follow-up to filter hub/agents/python/ uniformly.

🟢 gaia-agent-code hard-depends on an unpublished gaia-agent-routing>=0.1.0 (hub/agents/python/code/pyproject.toml:13)
A PyPI install of gaia-agent-code will fail to resolve until gaia-agent-routing is published. CI works around it by installing the local wheel first and the README documents it, so this is a release-ordering note, not a defect — just make sure routing publishes before/with code.

🟢 docs/guides/email.mdx not updated for the readiness preflight (optional)
The new endpoint is documented in the agent's own specification.html / spec_html.py / openapi.email.json, which is the right contract surface. If the email-init work stays in this PR, a one-line mention of /v1/email/init in the user guide would help discoverability. Skip if you rebase the email bits out.

Strengths

  • Test quality on the init endpoint is exemplary (tests/unit/agents/email/test_init_endpoint.py): it asserts the shape of the outgoing probe — /api/v1/health and /models suffixes, the short connect/read timeout pair, and the Authorization: Bearer header — plus 200/503 status mapping, strict no-extra-fields serialization, and the sidecar mount via packaging/server.py. This is exactly the "verify call validity at boundaries" rule from CLAUDE.md, not just "we called it."
  • The migration is genuinely complete. Grep confirms zero remaining gaia.agents.docqa / gaia.agents.routing references; the conventions checker, FRAMEWORK_PATHS test, and both lint import lists were all updated, and the routing package correctly ships without a gaia.agent entry point (it's class-path infrastructure) while docqa ships with one.
  • Fails loudly, no silent fallback (src/gaia/api/agent_registry.py): the gaia-code load path raises ValueError(... ) from e with a concrete pip install hint instead of degrading — matches the No-Silent-Fallbacks rule.
  • Thoughtful CI wiring: path triggers include agent_registry.py for routing (loaded by class path) and tools/** for docqa (uses RAG/file mixins), and the routing-before-code install ordering is documented inline.
  • Running probes through asyncio.to_thread keeps the sync requests calls off the event loop.

Verdict

Request changes — no code defect blocks merge; the blocking item is purely the 🟡 title/description mismatch (and the rebase to drop the already-merged #1795 diff). Fix those and this is an approve. The two remaining 🟢 items are follow-ups, not gates.

build_app()'s app.routes mixes APIRoutes with mounted _IncludedRouter
objects, which have no .path attribute — iterating it raised
AttributeError. Read .path defensively and additionally prove the route
is reachable through the sidecar app with a real request (503, not 404).
Newer FastAPI keeps included routes under a mounted sub-router rather than
flattening them into app.routes, so a .path scan can't find
/v1/email/init even though it is served. Assert reachability with a real
request (503, not 404) — version-robust proof the sidecar app mounts it.
itomek
itomek previously approved these changes Jun 22, 2026
kovtcharov added a commit that referenced this pull request Jun 22, 2026
…layground

Install & setup now drives provisioning via the API instead of copy-paste:
a 'Run gaia init' button POSTs /v1/email/init and streams the output into a
terminal panel (line by line, tolerant of SSE or plain-text framing), with a
running/ok/failed status and an auto health-recheck on success. Built to the
contract the /init PR (#1813) will serve — GET = readiness, POST = provision;
until it lands the button reports the endpoint as unavailable. The manual
steps remain below, and the CLI hint is now 'gaia init --profile email'.
#1795)

GET /v1/email/init tells you the triage stack isn't ready, but a frozen-binary
sidecar had no way to *fix* it. POST /v1/email/init is the provisioning
companion: it tells a running local Lemonade to download the configured email
model and streams newline-delimited (text/plain) progress so a consumer (the
#1814 playground) can render it terminal-style, line by line. A ✓-prefixed
final line means success, ✗ means failure.

Scope is the frozen-binary reality: the sidecar can't run the full `gaia init`
or install Lemonade itself (chicken-and-egg). If Lemonade is unreachable the
verb returns a real 503 with an actionable line and pulls nothing; once a pull
starts the response is a committed 200 (HTTP status can't change mid-stream),
so the trailing ✓/✗ line is the authoritative outcome. The pull posts only
`model_name` (no `recipe`) for the built-in email model — the #1655 trap.

GET behavior is unchanged. POST is a streaming operational verb (like
GET /spec), so it's kept out of the JSON OpenAPI and documented in the HTML
spec (spec_html.py + specification.html) instead.
return StreamingResponse(_unreachable(), media_type=media_type, status_code=503)

return StreamingResponse(
_provision_progress(probe_base, model_id),
… presence (#1795)

GET /v1/email/init said "ready" as long as Lemonade was up and the model was
downloaded — even against a Lemonade too old to run the triage stack, which then
fails at request time. Readiness now also checks the server VERSION: it reads
Lemonade's self-reported version from /health and compares it to the agent's
required minimum, so "ready" means "ready to triage," version included.

The lemonade block gains found-vs-required fields the playground renders:
  lemonade: { reachable, base_url, version, min_version, compatible }
A too-old server → ready=false (503) with an actionable upgrade hint
("Lemonade x.y.z is older than the required a.b.c — upgrade …"). An
unadvertised/unparseable version is reported compatible=null and does NOT block
(mirrors gaia init's don't-block-on-unparseable policy).

Single source of truth: min_lemonade_version lives in gaia-agent.yaml (the
manifest `gaia init` reads) AND as gaia_agent_email.version.MIN_LEMONADE_VERSION
(the RUNTIME value — the frozen sidecar bundles neither gaia.installer nor the
yaml, so the check can't read them at run time). A lock-step test fails if the
two drift. The version-parse helper mirrors InitCommand._parse_version locally
for the same frozen-binary reason.

/health stays liveness-only; POST /v1/email/init (provisioning) is unchanged.
…e triage model)

The frozen email sidecar can't run the full installer, so `gaia init` is the
host-side path that downloads and version-checks the email triage model. Adds an
"email" init profile (Gemma-4-E4B-it-GGUF) and exposes `gaia init --profile
email` as a CLI choice.

Its min_lemonade_version is held in lock-step with the email agent's runtime
MIN_LEMONADE_VERSION (the same minimum GET /v1/email/init enforces), so the
installer and readiness can't disagree on what 'compatible' means — a test
asserts they match.
# Conflicts:
#	CLAUDE.md
#	hub/agents/python/email/gaia_agent_email/spec_html.py
@github-actions github-actions Bot added the cli CLI changes label Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🟡 hub/agents/python/email/gaia_agent_email/api_routes.py — blocking I/O in a sync generator freezes the event loop

email_provision() wraps the initial _probe_lemonade_reachable() in asyncio.to_thread (correct), but then hands a synchronous generator to StreamingResponse. When Starlette iterates that generator it calls next() on the event loop thread — so every blocking requests.* call inside _provision_progress (including _pull_model with its 30-minute (5.0, 1800.0) timeout) stalls the entire server.

Every other blocking route in the file already uses asyncio.to_thread (e.g. send_email, email_triage, email_init). The fix is the same pattern: run the generator body in a thread, or use an async generator with await asyncio.to_thread(...) for each blocking step:

# Option A — simplest: wrap the entire generator in run_in_executor
import asyncio
from starlette.concurrency import iterate_in_threadpool

@router.post("/init", include_in_schema=False)
async def email_provision() -> StreamingResponse:
    ...
    return StreamingResponse(
        iterate_in_threadpool(_provision_progress(probe_base, model_id)),
        media_type=media_type,
        status_code=200,
    )

iterate_in_threadpool is already available in Starlette (used internally by FastAPI for sync response bodies) and is the canonical approach here.

alexey-tyurin pushed a commit to alexey-tyurin/gaia that referenced this pull request Jun 22, 2026
…d#1796) (amd#1814)

A developer evaluating the email agent has no zero-setup way to see it
work — clone the repo, build the package, run a CLI. This adds a
GAIA-styled page the **sidecar serves itself** at
`http://127.0.0.1:8131/v1/email/playground`: visit it and you get a
**stack-health check** (sidecar up + a plain-language Lemonade/model
diagnosis — *"Lemonade not found"*, *"model not downloaded"*), **live
triage and draft** against the running sidecar, a button that
**exercises the `/v1/init` readiness endpoint**, and copy-paste install
shortcuts.

![email-agent
playground](https://raw.githubusercontent.com/amd/gaia/feat/email-agent-playground/docs/assets/img/email-playground.webp)

**Localhost-only is structural, not a promise.** The page is served
same-origin (no CORS, no remote-controlled JS) and the route ships
`Content-Security-Policy: connect-src 'self'`, so the browser *refuses*
any non-local fetch — email content can't leave the machine. Inference
stays on local Lemonade.

The `/init` button consumes the readiness endpoint from **amd#1795**,
implemented in **PR amd#1813** (branch `claudia/task-4a1065f9`). This
branch predates it, so `/v1/email/init` returns 404 here — the button
**fails loudly with a clear message** ("update the sidecar — ships with
amd#1795") rather than breaking, and lights up once amd#1795 merges. The
endpoint is **not** duplicated here; the playground only consumes it.

Closes amd#1796.

### Also in this PR
- **Added a "Playground" section + screenshot to the email agent
README** (`hub/agents/python/email/README.md`), mirroring the npm
package's architecture-diagram embed.
- **Brought the sibling `/v1/email/spec` page on-brand.** It used an
off-brand orange/blue/green palette; restyled to the GAIA dark+gold
tokens (matching the website + playground), self-contained (system
fonts, no webfont), and added a "Convenience pages" section listing
`/spec` and `/playground`.

## Test plan
- [ ] `PYTHONPATH=hub/agents/python/email python -m pytest
tests/unit/agents/email/test_playground.py
tests/unit/agents/email/test_spec_html.py
tests/test_email_openapi_conformance.py -q` — 56 pass (route 200, CSP
pins egress to `'self'`, no external resources, `/playground` excluded
from `/openapi.json`, spec page still self-contained).
- [ ] Start the sidecar (`python
hub/agents/python/email/packaging/server.py --host 127.0.0.1 --port
8131`), open `http://127.0.0.1:8131/v1/email/playground`:
- Stack health shows ✓ Sidecar; the Lemonade/model row diagnoses
correctly (start/stop `lemonade-server serve` to see both states).
- Triage + Draft run live (with Lemonade up); "Run readiness check ·
/v1/init" shows the graceful 404 message on this branch.
- Response header includes `Content-Security-Policy: connect-src
'self'`.
- [ ] Open `http://127.0.0.1:8131/v1/email/spec` — renders in GAIA
dark+gold, lists the playground under "Convenience pages".
pull Bot pushed a commit to bhardwajRahul/gaia that referenced this pull request Jun 23, 2026
…eck (bump 0.2.0) (amd#1822)

## Why this matters

The email package's version lived in eight files of six different types
(Python, YAML, TOML, JSON, Markdown, HTML) with no tool to keep them in
sync, so references drifted silently. On `main` right now,
`binaries.lock.json` still pins both `agentVersion` and `baseUrl` to
`…/agents/email/0.1.0` while every other file already says `0.2.0` — a
static pointer to a *prior* hub deployment that no test caught. After
this change, `AGENT_VERSION` in `version.py` is the one source of truth,
a stamp script syncs every other reference from it, and a `--check` mode
fails the build loudly on any mismatch — so a stale version reference
can never ship again.

Mirrors the Agent UI's existing pattern
(`installer/version/bump-ui-version.mjs`: one source → stamps dependents
→ `--check` gated in CI).

**Stamped file types** (all driven from `AGENT_VERSION`): the YAML
manifest, `pyproject.toml`, npm `package.json`, the lock's
`agentVersion` + `baseUrl`, the two README image URLs, and the
`architecture.html` version badge. `API_VERSION` (the REST/contract
version) is deliberately **not** touched — it's the contract version,
independent of the package build version.

**Cross-branch skip-with-warning:** three npm-side targets (README image
URLs, `assets/architecture.html`) don't exist on `main` yet — they live
on in-flight branches (amd#1776, amd#1814). The script **skips them with a
warning** rather than failing, so it works across the partial state
today and will stamp them correctly once those branches merge. This PR
only touches version strings + the new script + the two workflows + the
new test; it does not touch the playground HTML, the `/v1/email/init`
endpoint, or the npm client (owned by amd#1814/amd#1813/amd#1776).

This PR also fixes the existing `binaries.lock.json` drift (0.1.0 →
0.2.0) as the first run of the new stamper.

## Test plan

- [x] `python hub/agents/python/email/packaging/stamp_version.py
--check` passes on the post-bump tree (exit 0)
- [x] Mutating any target to a wrong version makes `--check` exit
non-zero (covered by `test_stamp_version.py`)
- [x] `python -m pytest
hub/agents/python/email/tests/test_stamp_version.py` — 10 passed
(hermetic, no network)
- [x] Version-contract tests green with the bump:
`test_agent_version_matches_package_export` +
`test_agent_version_matches_package_metadata` (pyproject + in-code
`AGENT_VERSION` both 0.2.0)
- [x] `black` + `isort` clean on the new files
- [x] `--check` wired into `release_agent_email.yml` (before publish)
and `test_email_agent_unit.yml` (early PR drift gate; npm-side paths
added to its triggers)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents cli CLI changes dependencies Dependency updates devops DevOps/infrastructure changes documentation Documentation changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants