feat(rag): index Microsoft Word (.docx) documents#1866
Conversation
DocumentQAAgent and RoutingAgent were the last two agents left in the core source tree under src/gaia/agents/. They now ship as standalone gaia-agent-docqa / gaia-agent-routing wheels under hub/agents/python/, completing the "strip src/gaia/agents/ to framework only" goal for #1102 (only base/, tools/, registry.py, builder/ — plus the chat family and email — remain in core). docqa is a building-block RAG agent: it registers via the gaia.agent entry point as a hidden agent (mirroring fileio), default model Qwen3.5-35B-A3B-GGUF. routing is infrastructure — a meta-agent loaded by class path from the OpenAI API server, not a registry agent — so it ships without a gaia.agent entry point; gaia.api.agent_registry now resolves it at gaia_agent_routing.agent.RoutingAgent and fails loudly with an install hint when the wheel is absent.
Self-review follow-up to the docqa/routing migration: the gaia-agent-code CLI imported RoutingAgent from the old in-tree path (gaia.agents.routing.agent), which the migration broke. Repoint it at gaia_agent_routing.agent and declare gaia-agent-routing as a dependency of gaia-agent-code, since the `gaia-code` query path routes through RoutingAgent for language/project-type detection. No reverse dependency (routing → code) — routing resolves CodeAgent through the registry at runtime, avoiding a cycle. Also clears the now-dead RoutingAgent allowance in the agent-conventions checker (it only applied while routing lived under src/gaia/agents/).
# Conflicts: # hub/agents/python/docqa/tests/test_docqa_agent.py
# Conflicts: # .github/workflows/test_gaia_cli.yml # setup.py
Merging main surfaced three stale references the migration missed: - test_default_max_steps imported the now-migrated gaia.agents.docqa; repoint it at the core BuilderAgentConfig, which exercises the same field(default_factory=default_max_steps) inheritance. - test_agent_pypi_publish asserted every published wheel declares a gaia.agent entry point, but routing is infrastructure loaded by class-path and intentionally ships without one. Exempt it explicitly. - Routing module path + source links in the docs still pointed at src/gaia/agents/routing; repoint to the gaia_agent_routing wheel. Also preserve the original traceback on the gaia-code ImportError re-raise (raise ... from e) now that the block is being edited.
gaia-agent-code now depends on gaia-agent-routing>=0.1.0, which isn't published to PyPI. The Test Code Agent workflow installed code straight from the hub dir, so uv tried to resolve routing from the registry and failed. Install the local routing package first so the dep resolves locally. End users are unaffected — both wheels publish together on tag.
The API streaming tests target the 'gaia-code' model, which routes through RoutingAgent. Pre-migration routing lived in core, so it resolved automatically; now it ships as the gaia-agent-routing wheel that the API Tests job didn't install — so 3 streaming tests hit the (correct) missing-wheel error instead of a real agent. Install the local routing+code hub packages, and re-run API tests when either hub package changes.
CLAUDE.md still pointed DocumentQAAgent/RoutingAgent at the old
src/gaia/agents/{docqa,routing} locations and listed docqa in the source
tree — stale after the hub migration and misleading since CLAUDE.md loads
as context on every session. Point both at their hub wheels and drop the
docqa tree entry.
errors.py FRAMEWORK_PATHS carried a dead 'gaia/agents/routing' entry; the
wheel's frames are already filtered by 'site-packages/'. Remove it and
update the test that asserted its presence.
Word documents previously could not be indexed for RAG — the UI rejected .docx with a "not supported, save as PDF" message and the SDK had no extractor. Users with handbooks, contracts, and reports in .docx had to convert to PDF first. Now .docx indexes directly like PDF/PPTX/XLSX. Extraction walks the document body in order, capturing paragraph text, table cells (including tables nested in a cell), and — importantly for form/template docs — text inside content controls (w:sdt) and hyperlinks, which Word stores outside the direct runs that Paragraph.text exposes. Corrupt/non-.docx files and a missing python-docx install fail loudly with actionable errors. Allow-lists and rejection messaging across the UI backend and frontend are updated so .docx flows end-to-end. Closes #1072
Verdict: Approve with suggestionsWord ( Two things to be aware of, neither blocking:
No correctness or security issues found. 🔍 Technical details🟡 ImportantBundled agent migration is a breaking SDK change not surfaced in the PR description — 🟢 Minor
Strengths
|
Adversarial review of the XML-walk extractor surfaced three cases that silently degraded exactly the form/template/report documents the feature targets: - Textboxes/shapes (mc:AlternateContent) were emitted twice — once from the DrawingML mc:Choice and once from the VML mc:Fallback twin — and glued onto the host paragraph. Skip mc:Fallback so shape text is captured once. - Tabs and line/page breaks (w:tab/w:br/w:cr) were dropped, gluing adjacent words into unsearchable tokens (e.g. "Column1Column2"). Translate them to whitespace while leaving intra-word run splits untouched. - Rows/cells wrapped in repeating-section content controls (w:sdt around w:tr/w:tc) were skipped by the direct-child findall. Descend through the wrappers. Also wrap missing-file / directory / permission OSErrors in the same actionable message as the corrupt-file path instead of a raw traceback. Adds regression tests for each case (textbox single-capture, tab/break whitespace, intra-word integrity, sdt-wrapped rows).
|
Thanks for the review — addressed:
Also merged latest |
Why this matters
Word documents couldn't be indexed for RAG — the UI rejected
.docxwith a "not supported, save as PDF first" message and the SDK shipped no extractor. Anyone with a handbook, contract, or report in.docxhad to convert to PDF before GAIA could answer questions about it. After this change.docxindexes directly, the same as PDF / PPTX / XLSX, with no conversion step.Extraction walks the document body in order and captures paragraph text, table cells (including tables nested inside a cell and rows/cells wrapped in repeating-section content controls), and — critically for form/template documents — the text inside content controls (
w:sdt), hyperlinks, and textboxes, which Word stores outside the direct runs thatParagraph.textexposes. So filled-in form values get indexed, not just the labels. Tabs and line breaks are preserved as whitespace (soColumn1/Column2don't glue into an unsearchable token) and the VMLmc:Fallbacktwin of a textbox is skipped so shape text isn't double-counted. Corrupt / non-.docxfiles and a missingpython-docxinstall fail loudly with actionable, file-named errors (no silent skip). Allow-lists and rejection messaging across the UI backend and React frontend are flipped so.docxflows end-to-end; legacy binary.doc/.ppt/.xlsremain intentionally rejected.Closes #1072
Heads-up: bundled agent migration (not in the title)
This branch also carries the RoutingAgent / DocumentQAAgent migration to standalone hub wheels (
gaia-agent-routing/gaia-agent-docqa, following the #1102 pattern), which arrived via the base branch this work was cut from. It is intentional and the docs/CI are updated in lockstep, but it is a breaking import-path change —from gaia.agents.routing.agent import RoutingAgent/from gaia.agents.docqa.agent import DocumentQAAgentmove underhub/agents/python/{routing,docqa}/. Flagging it here for release notes so the import break isn't a surprise; no deprecation shim is included given the migration's scope.Test plan
pytest tests/unit/rag/test_docx_extraction.py— paragraphs, table cells, document order, content controls (inline + block), nested tables, repeating-section (sdt-wrapped) rows, hyperlinks, textbox single-capture, tab/break whitespace + intra-word integrity, corrupt/missing-file errors, dispatcher routing (15 tests)pytest tests/unit/rag/ tests/unit/chat/ui/test_server.py tests/integration/test_files_router.py— 226 passed, 2 skipped (allow-list + legacy-office rejection tests updated).docx(heading + paragraphs + table), confirmed fullRAGSDK.index_document()indexes it (planted fact + table cell retrievable) and corrupt/missing.docxis rejected with an actionable error naming the filepython util/lint.py --all— Black, isort, Pylint, Flake8 greenUnsupportedFeature.test.tsxupdated to assert.docxis now supported (vitest).docxviagaia chat --index <file>.docxand ask a question about its contents