feat(rag): index Microsoft Word (.docx) documents by kovtcharov · Pull Request #1866 · amd/gaia

kovtcharov · 2026-06-25T21:22:13Z

Why this matters

Word documents couldn't be indexed for RAG — the UI rejected .docx with a "not supported, save as PDF first" message and the SDK shipped no extractor. Anyone with a handbook, contract, or report in .docx had to convert to PDF before GAIA could answer questions about it. After this change .docx indexes directly, the same as PDF / PPTX / XLSX, with no conversion step.

Extraction walks the document body in order and captures paragraph text, table cells (including tables nested inside a cell and rows/cells wrapped in repeating-section content controls), and — critically for form/template documents — the text inside content controls (w:sdt), hyperlinks, and textboxes, which Word stores outside the direct runs that Paragraph.text exposes. So filled-in form values get indexed, not just the labels. Tabs and line breaks are preserved as whitespace (so Column1/Column2 don't glue into an unsearchable token) and the VML mc:Fallback twin of a textbox is skipped so shape text isn't double-counted. Corrupt / non-.docx files and a missing python-docx install fail loudly with actionable, file-named errors (no silent skip). Allow-lists and rejection messaging across the UI backend and React frontend are flipped so .docx flows end-to-end; legacy binary .doc/.ppt/.xls remain intentionally rejected.

Closes #1072

Heads-up: bundled agent migration (not in the title)

This branch also carries the RoutingAgent / DocumentQAAgent migration to standalone hub wheels (gaia-agent-routing / gaia-agent-docqa, following the #1102 pattern), which arrived via the base branch this work was cut from. It is intentional and the docs/CI are updated in lockstep, but it is a breaking import-path change — from gaia.agents.routing.agent import RoutingAgent / from gaia.agents.docqa.agent import DocumentQAAgent move under hub/agents/python/{routing,docqa}/. Flagging it here for release notes so the import break isn't a surprise; no deprecation shim is included given the migration's scope.

Test plan

pytest tests/unit/rag/test_docx_extraction.py — paragraphs, table cells, document order, content controls (inline + block), nested tables, repeating-section (sdt-wrapped) rows, hyperlinks, textbox single-capture, tab/break whitespace + intra-word integrity, corrupt/missing-file errors, dispatcher routing (15 tests)
pytest tests/unit/rag/ tests/unit/chat/ui/test_server.py tests/integration/test_files_router.py — 226 passed, 2 skipped (allow-list + legacy-office rejection tests updated)
End-to-end local run: generated a real .docx (heading + paragraphs + table), confirmed full RAGSDK.index_document() indexes it (planted fact + table cell retrievable) and corrupt/missing .docx is rejected with an actionable error naming the file
python util/lint.py --all — Black, isort, Pylint, Flake8 green
Frontend UnsupportedFeature.test.tsx updated to assert .docx is now supported (vitest)
Reviewer: index a .docx via gaia chat --index <file>.docx and ask a question about its contents

DocumentQAAgent and RoutingAgent were the last two agents left in the core source tree under src/gaia/agents/. They now ship as standalone gaia-agent-docqa / gaia-agent-routing wheels under hub/agents/python/, completing the "strip src/gaia/agents/ to framework only" goal for #1102 (only base/, tools/, registry.py, builder/ — plus the chat family and email — remain in core). docqa is a building-block RAG agent: it registers via the gaia.agent entry point as a hidden agent (mirroring fileio), default model Qwen3.5-35B-A3B-GGUF. routing is infrastructure — a meta-agent loaded by class path from the OpenAI API server, not a registry agent — so it ships without a gaia.agent entry point; gaia.api.agent_registry now resolves it at gaia_agent_routing.agent.RoutingAgent and fails loudly with an install hint when the wheel is absent.

Self-review follow-up to the docqa/routing migration: the gaia-agent-code CLI imported RoutingAgent from the old in-tree path (gaia.agents.routing.agent), which the migration broke. Repoint it at gaia_agent_routing.agent and declare gaia-agent-routing as a dependency of gaia-agent-code, since the `gaia-code` query path routes through RoutingAgent for language/project-type detection. No reverse dependency (routing → code) — routing resolves CodeAgent through the registry at runtime, avoiding a cycle. Also clears the now-dead RoutingAgent allowance in the agent-conventions checker (it only applied while routing lived under src/gaia/agents/).

# Conflicts: # hub/agents/python/docqa/tests/test_docqa_agent.py

# Conflicts: # .github/workflows/test_gaia_cli.yml # setup.py

Merging main surfaced three stale references the migration missed: - test_default_max_steps imported the now-migrated gaia.agents.docqa; repoint it at the core BuilderAgentConfig, which exercises the same field(default_factory=default_max_steps) inheritance. - test_agent_pypi_publish asserted every published wheel declares a gaia.agent entry point, but routing is infrastructure loaded by class-path and intentionally ships without one. Exempt it explicitly. - Routing module path + source links in the docs still pointed at src/gaia/agents/routing; repoint to the gaia_agent_routing wheel. Also preserve the original traceback on the gaia-code ImportError re-raise (raise ... from e) now that the block is being edited.

gaia-agent-code now depends on gaia-agent-routing>=0.1.0, which isn't published to PyPI. The Test Code Agent workflow installed code straight from the hub dir, so uv tried to resolve routing from the registry and failed. Install the local routing package first so the dep resolves locally. End users are unaffected — both wheels publish together on tag.

The API streaming tests target the 'gaia-code' model, which routes through RoutingAgent. Pre-migration routing lived in core, so it resolved automatically; now it ships as the gaia-agent-routing wheel that the API Tests job didn't install — so 3 streaming tests hit the (correct) missing-wheel error instead of a real agent. Install the local routing+code hub packages, and re-run API tests when either hub package changes.

CLAUDE.md still pointed DocumentQAAgent/RoutingAgent at the old src/gaia/agents/{docqa,routing} locations and listed docqa in the source tree — stale after the hub migration and misleading since CLAUDE.md loads as context on every session. Point both at their hub wheels and drop the docqa tree entry. errors.py FRAMEWORK_PATHS carried a dead 'gaia/agents/routing' entry; the wheel's frames are already filtered by 'site-packages/'. Remove it and update the test that asserted its presence.

Word documents previously could not be indexed for RAG — the UI rejected .docx with a "not supported, save as PDF" message and the SDK had no extractor. Users with handbooks, contracts, and reports in .docx had to convert to PDF first. Now .docx indexes directly like PDF/PPTX/XLSX. Extraction walks the document body in order, capturing paragraph text, table cells (including tables nested in a cell), and — importantly for form/template docs — text inside content controls (w:sdt) and hyperlinks, which Word stores outside the direct runs that Paragraph.text exposes. Corrupt/non-.docx files and a missing python-docx install fail loudly with actionable errors. Allow-lists and rejection messaging across the UI backend and frontend are updated so .docx flows end-to-end. Closes #1072

github-actions · 2026-06-25T21:25:35Z

Verdict: Approve with suggestions

Word (.docx) documents now index directly for RAG instead of being rejected with a "save as PDF first" message — the extractor walks the document body in order and, importantly, pulls text out of content controls and hyperlinks that Word hides from the simple paragraph API, so filled-in form values get indexed rather than just the labels. The allow-lists and rejection messaging are flipped consistently across the SDK, UI backend, and React frontend, with a zip-bomb guard mirroring the existing .pptx path. Test coverage is genuinely thorough (paragraphs, tables, nested tables, inline + block content controls, hyperlinks, corrupt-file errors, dispatcher routing).

Two things to be aware of, neither blocking:

The PR bundles a second, larger change the title/description don't mention: RoutingAgent and DocumentQAAgent are migrated out of the core package into standalone gaia-agent-routing / gaia-agent-docqa hub wheels (Agent Hub: Restructure — move production agents to hub/agents/ #1102). This changes their public import paths — anyone doing from gaia.agents.routing.agent import RoutingAgent will break. The migration itself is clean and the docs are updated in lockstep, but it should be called out in the description / release notes so the breaking import change isn't a surprise.
A couple of minor nits below.

No correctness or security issues found.

🔍 Technical details

🟡 Important

Bundled agent migration is a breaking SDK change not surfaced in the PR description — src/gaia/agents/routing/ and src/gaia/agents/docqa/ move to hub/agents/python/{routing,docqa}/, changing the documented import path (docs/sdk/agents/routing.mdx previously showed from gaia.agents.routing.agent import RoutingAgent). This follows the established #1102 pattern (code/jira/docker/blender/sd already migrated), the docs are all updated, and there's no in-core dangling reference (verified grep over src/gaia/ is clean), so it's intentional and well-executed — not a code defect. The only gap is visibility: the description is "feat(rag): index .docx" and says nothing about the migration or the import-path break. Add a line to the description / changelog so release notes capture it. No deprecation shim is needed given the migration's scope.

🟢 Minor

# TODO(#1072) lives inside the docstring (src/gaia/rag/sdk.py:1621) — it renders as literal docstring text with a stray #. Move it below the closing """ as a normal comment, or drop the #:

        Known omissions: header/footer text (separate XML parts, usually
        repeated boilerplate) and embedded images (TODO #1072: VLM
        extraction for images embedded in .docx files).

gaia-code install hint references a doc path (src/gaia/api/agent_registry.py:1029) — the error string points users to docs/spec/agent-hub-restructure.mdx; worth a quick confirm that page exists on the rendered site, otherwise the actionable error sends users to a 404.

Strengths

The content-control / hyperlink handling (_paragraph_text joining every w:t descendant + recursive _emit for block-level w:sdt and nested tables) is the right call — it's exactly the text that form/template .docx files hide from Paragraph.text, and the tests prove it (test_inline_content_control_captured, test_nested_table_in_cell_captured).
Fail-loud error handling done well: corrupt/non-.docx and missing python-docx raise actionable, file-named ValueErrors with raise ... from e, and the zip-bomb guard reuses the proven .pptx pattern.
Migration hygiene is strong — INFRA_ONLY_AGENT_IDS exemption for routing's missing gaia.agent entry point, the pytest.importorskip guards for framework-only envs, the loud install hint in agent_registry.py, and synchronized CI workflows all show the breaking move was threaded through carefully.
Allow-list changes are consistent across all four surfaces (ui/utils.py, ui/routers/files.py, UnsupportedFeature.tsx, frontend tests), so .docx flows end-to-end with no contradicting "unsupported" message left behind.

Adversarial review of the XML-walk extractor surfaced three cases that silently degraded exactly the form/template/report documents the feature targets: - Textboxes/shapes (mc:AlternateContent) were emitted twice — once from the DrawingML mc:Choice and once from the VML mc:Fallback twin — and glued onto the host paragraph. Skip mc:Fallback so shape text is captured once. - Tabs and line/page breaks (w:tab/w:br/w:cr) were dropped, gluing adjacent words into unsearchable tokens (e.g. "Column1Column2"). Translate them to whitespace while leaving intra-word run splits untouched. - Rows/cells wrapped in repeating-section content controls (w:sdt around w:tr/w:tc) were skipped by the direct-child findall. Descend through the wrappers. Also wrap missing-file / directory / permission OSErrors in the same actionable message as the corrupt-file path instead of a raw traceback. Adds regression tests for each case (textbox single-capture, tab/break whitespace, intra-word integrity, sdt-wrapped rows).

# Conflicts: # CLAUDE.md

kovtcharov · 2026-06-25T23:51:41Z

Thanks for the review — addressed:

Bundled routing/docqa migration (Important): surfaced in the PR description under "Heads-up: bundled agent migration," including the breaking import-path change for release notes. The migration rides in from the base branch this was cut from; per the review it's clean/intentional, so it stays — now it's documented rather than silent.
# TODO(#1072) rendering inside the docstring (Minor): moved into the "Known omissions" sentence so there's no stray # in the rendered docstring (fa01b9d3).
gaia-code install-hint doc path (Minor): verified — docs/spec/agent-hub-restructure.mdx exists, so the actionable error doesn't 404. No change needed.

Also merged latest main (only conflict was CLAUDE.md, resolved to main's authoritative version since this PR doesn't touch it). Full suite green after the merge: 226 passed, 2 skipped; lint clean.

Ovtcharov and others added 15 commits June 4, 2026 14:56

Merge remote-tracking branch 'origin/main' into claudia/task-8fa7ecef

42d3220

# Conflicts: # hub/agents/python/docqa/tests/test_docqa_agent.py

Merge remote-tracking branch 'origin/main' into claudia/task-8fa7ecef

7b7379f

# Conflicts: # .github/workflows/test_gaia_cli.yml # setup.py

Merge branch 'main' into claudia/task-8fa7ecef

1674258

Merge branch 'main' into claudia/task-8fa7ecef

3b81417

Merge remote-tracking branch 'origin/main' into claudia/task-8fa7ecef

7fc433d

Merge remote-tracking branch 'origin/main' into pr1455-update

28a643a

Merge remote-tracking branch 'origin/main' into pr1455-update

e8f5f45

Merge remote-tracking branch 'origin/main' into pr1455-update

17221cc

kovtcharov requested a review from kovtcharov-amd as a code owner June 25, 2026 21:22

github-actions Bot added documentation Documentation changes dependencies Dependency updates devops DevOps/infrastructure changes rag RAG system changes tests Test changes performance Performance-critical changes agents labels Jun 25, 2026

kovtcharov requested a review from itomek June 25, 2026 23:48

kovtcharov added 2 commits June 25, 2026 16:49

Merge remote-tracking branch 'origin/main' into feat/rag-docx-indexing

86855ff

# Conflicts: # CLAUDE.md

docs(rag): move .docx TODO out of docstring body (review nit)

fa01b9d

Merge branch 'main' into feat/rag-docx-indexing

0a62aa2

kovtcharov mentioned this pull request Jun 26, 2026

[Feature] Support Microsoft Office (docx, pptx, xls) indexing #1072

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(rag): index Microsoft Word (.docx) documents#1866

feat(rag): index Microsoft Word (.docx) documents#1866
kovtcharov wants to merge 19 commits into
mainfrom
feat/rag-docx-indexing

kovtcharov commented Jun 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 25, 2026

🟡 Important

🟢 Minor

Strengths

Uh oh!

kovtcharov commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kovtcharov commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this matters

Heads-up: bundled agent migration (not in the title)

Test plan

Uh oh!

github-actions Bot commented Jun 25, 2026

Verdict: Approve with suggestions

🟡 Important

🟢 Minor

Strengths

Uh oh!

kovtcharov commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kovtcharov commented Jun 25, 2026 •

edited

Loading