Skip to content

feat(rag): index Microsoft Word (.docx) documents#1866

Open
kovtcharov wants to merge 19 commits into
mainfrom
feat/rag-docx-indexing
Open

feat(rag): index Microsoft Word (.docx) documents#1866
kovtcharov wants to merge 19 commits into
mainfrom
feat/rag-docx-indexing

Conversation

@kovtcharov

@kovtcharov kovtcharov commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Why this matters

Word documents couldn't be indexed for RAG — the UI rejected .docx with a "not supported, save as PDF first" message and the SDK shipped no extractor. Anyone with a handbook, contract, or report in .docx had to convert to PDF before GAIA could answer questions about it. After this change .docx indexes directly, the same as PDF / PPTX / XLSX, with no conversion step.

Extraction walks the document body in order and captures paragraph text, table cells (including tables nested inside a cell and rows/cells wrapped in repeating-section content controls), and — critically for form/template documents — the text inside content controls (w:sdt), hyperlinks, and textboxes, which Word stores outside the direct runs that Paragraph.text exposes. So filled-in form values get indexed, not just the labels. Tabs and line breaks are preserved as whitespace (so Column1/Column2 don't glue into an unsearchable token) and the VML mc:Fallback twin of a textbox is skipped so shape text isn't double-counted. Corrupt / non-.docx files and a missing python-docx install fail loudly with actionable, file-named errors (no silent skip). Allow-lists and rejection messaging across the UI backend and React frontend are flipped so .docx flows end-to-end; legacy binary .doc/.ppt/.xls remain intentionally rejected.

Closes #1072

Heads-up: bundled agent migration (not in the title)

This branch also carries the RoutingAgent / DocumentQAAgent migration to standalone hub wheels (gaia-agent-routing / gaia-agent-docqa, following the #1102 pattern), which arrived via the base branch this work was cut from. It is intentional and the docs/CI are updated in lockstep, but it is a breaking import-path changefrom gaia.agents.routing.agent import RoutingAgent / from gaia.agents.docqa.agent import DocumentQAAgent move under hub/agents/python/{routing,docqa}/. Flagging it here for release notes so the import break isn't a surprise; no deprecation shim is included given the migration's scope.

Test plan

  • pytest tests/unit/rag/test_docx_extraction.py — paragraphs, table cells, document order, content controls (inline + block), nested tables, repeating-section (sdt-wrapped) rows, hyperlinks, textbox single-capture, tab/break whitespace + intra-word integrity, corrupt/missing-file errors, dispatcher routing (15 tests)
  • pytest tests/unit/rag/ tests/unit/chat/ui/test_server.py tests/integration/test_files_router.py — 226 passed, 2 skipped (allow-list + legacy-office rejection tests updated)
  • End-to-end local run: generated a real .docx (heading + paragraphs + table), confirmed full RAGSDK.index_document() indexes it (planted fact + table cell retrievable) and corrupt/missing .docx is rejected with an actionable error naming the file
  • python util/lint.py --all — Black, isort, Pylint, Flake8 green
  • Frontend UnsupportedFeature.test.tsx updated to assert .docx is now supported (vitest)
  • Reviewer: index a .docx via gaia chat --index <file>.docx and ask a question about its contents

Ovtcharov and others added 15 commits June 4, 2026 14:56
DocumentQAAgent and RoutingAgent were the last two agents left in the
core source tree under src/gaia/agents/. They now ship as standalone
gaia-agent-docqa / gaia-agent-routing wheels under hub/agents/python/,
completing the "strip src/gaia/agents/ to framework only" goal for #1102
(only base/, tools/, registry.py, builder/ — plus the chat family and
email — remain in core).

docqa is a building-block RAG agent: it registers via the gaia.agent
entry point as a hidden agent (mirroring fileio), default model
Qwen3.5-35B-A3B-GGUF. routing is infrastructure — a meta-agent loaded by
class path from the OpenAI API server, not a registry agent — so it ships
without a gaia.agent entry point; gaia.api.agent_registry now resolves it
at gaia_agent_routing.agent.RoutingAgent and fails loudly with an install
hint when the wheel is absent.
Self-review follow-up to the docqa/routing migration: the gaia-agent-code
CLI imported RoutingAgent from the old in-tree path
(gaia.agents.routing.agent), which the migration broke. Repoint it at
gaia_agent_routing.agent and declare gaia-agent-routing as a dependency of
gaia-agent-code, since the `gaia-code` query path routes through
RoutingAgent for language/project-type detection. No reverse dependency
(routing → code) — routing resolves CodeAgent through the registry at
runtime, avoiding a cycle.

Also clears the now-dead RoutingAgent allowance in the agent-conventions
checker (it only applied while routing lived under src/gaia/agents/).
# Conflicts:
#	hub/agents/python/docqa/tests/test_docqa_agent.py
# Conflicts:
#	.github/workflows/test_gaia_cli.yml
#	setup.py
Merging main surfaced three stale references the migration missed:

- test_default_max_steps imported the now-migrated gaia.agents.docqa;
  repoint it at the core BuilderAgentConfig, which exercises the same
  field(default_factory=default_max_steps) inheritance.
- test_agent_pypi_publish asserted every published wheel declares a
  gaia.agent entry point, but routing is infrastructure loaded by
  class-path and intentionally ships without one. Exempt it explicitly.
- Routing module path + source links in the docs still pointed at
  src/gaia/agents/routing; repoint to the gaia_agent_routing wheel.

Also preserve the original traceback on the gaia-code ImportError
re-raise (raise ... from e) now that the block is being edited.
gaia-agent-code now depends on gaia-agent-routing>=0.1.0, which isn't
published to PyPI. The Test Code Agent workflow installed code straight
from the hub dir, so uv tried to resolve routing from the registry and
failed. Install the local routing package first so the dep resolves
locally. End users are unaffected — both wheels publish together on tag.
The API streaming tests target the 'gaia-code' model, which routes
through RoutingAgent. Pre-migration routing lived in core, so it
resolved automatically; now it ships as the gaia-agent-routing wheel
that the API Tests job didn't install — so 3 streaming tests hit the
(correct) missing-wheel error instead of a real agent. Install the
local routing+code hub packages, and re-run API tests when either
hub package changes.
CLAUDE.md still pointed DocumentQAAgent/RoutingAgent at the old
src/gaia/agents/{docqa,routing} locations and listed docqa in the source
tree — stale after the hub migration and misleading since CLAUDE.md loads
as context on every session. Point both at their hub wheels and drop the
docqa tree entry.

errors.py FRAMEWORK_PATHS carried a dead 'gaia/agents/routing' entry; the
wheel's frames are already filtered by 'site-packages/'. Remove it and
update the test that asserted its presence.
Word documents previously could not be indexed for RAG — the UI rejected
.docx with a "not supported, save as PDF" message and the SDK had no
extractor. Users with handbooks, contracts, and reports in .docx had to
convert to PDF first. Now .docx indexes directly like PDF/PPTX/XLSX.

Extraction walks the document body in order, capturing paragraph text,
table cells (including tables nested in a cell), and — importantly for
form/template docs — text inside content controls (w:sdt) and hyperlinks,
which Word stores outside the direct runs that Paragraph.text exposes.
Corrupt/non-.docx files and a missing python-docx install fail loudly with
actionable errors. Allow-lists and rejection messaging across the UI
backend and frontend are updated so .docx flows end-to-end.

Closes #1072
@github-actions github-actions Bot added documentation Documentation changes dependencies Dependency updates devops DevOps/infrastructure changes rag RAG system changes tests Test changes performance Performance-critical changes agents labels Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Verdict: Approve with suggestions

Word (.docx) documents now index directly for RAG instead of being rejected with a "save as PDF first" message — the extractor walks the document body in order and, importantly, pulls text out of content controls and hyperlinks that Word hides from the simple paragraph API, so filled-in form values get indexed rather than just the labels. The allow-lists and rejection messaging are flipped consistently across the SDK, UI backend, and React frontend, with a zip-bomb guard mirroring the existing .pptx path. Test coverage is genuinely thorough (paragraphs, tables, nested tables, inline + block content controls, hyperlinks, corrupt-file errors, dispatcher routing).

Two things to be aware of, neither blocking:

  • The PR bundles a second, larger change the title/description don't mention: RoutingAgent and DocumentQAAgent are migrated out of the core package into standalone gaia-agent-routing / gaia-agent-docqa hub wheels (Agent Hub: Restructure — move production agents to hub/agents/ #1102). This changes their public import paths — anyone doing from gaia.agents.routing.agent import RoutingAgent will break. The migration itself is clean and the docs are updated in lockstep, but it should be called out in the description / release notes so the breaking import change isn't a surprise.
  • A couple of minor nits below.

No correctness or security issues found.

🔍 Technical details

🟡 Important

Bundled agent migration is a breaking SDK change not surfaced in the PR descriptionsrc/gaia/agents/routing/ and src/gaia/agents/docqa/ move to hub/agents/python/{routing,docqa}/, changing the documented import path (docs/sdk/agents/routing.mdx previously showed from gaia.agents.routing.agent import RoutingAgent). This follows the established #1102 pattern (code/jira/docker/blender/sd already migrated), the docs are all updated, and there's no in-core dangling reference (verified grep over src/gaia/ is clean), so it's intentional and well-executed — not a code defect. The only gap is visibility: the description is "feat(rag): index .docx" and says nothing about the migration or the import-path break. Add a line to the description / changelog so release notes capture it. No deprecation shim is needed given the migration's scope.

🟢 Minor

  • # TODO(#1072) lives inside the docstring (src/gaia/rag/sdk.py:1621) — it renders as literal docstring text with a stray #. Move it below the closing """ as a normal comment, or drop the #:
            Known omissions: header/footer text (separate XML parts, usually
            repeated boilerplate) and embedded images (TODO #1072: VLM
            extraction for images embedded in .docx files).
    
  • gaia-code install hint references a doc path (src/gaia/api/agent_registry.py:1029) — the error string points users to docs/spec/agent-hub-restructure.mdx; worth a quick confirm that page exists on the rendered site, otherwise the actionable error sends users to a 404.

Strengths

  • The content-control / hyperlink handling (_paragraph_text joining every w:t descendant + recursive _emit for block-level w:sdt and nested tables) is the right call — it's exactly the text that form/template .docx files hide from Paragraph.text, and the tests prove it (test_inline_content_control_captured, test_nested_table_in_cell_captured).
  • Fail-loud error handling done well: corrupt/non-.docx and missing python-docx raise actionable, file-named ValueErrors with raise ... from e, and the zip-bomb guard reuses the proven .pptx pattern.
  • Migration hygiene is strong — INFRA_ONLY_AGENT_IDS exemption for routing's missing gaia.agent entry point, the pytest.importorskip guards for framework-only envs, the loud install hint in agent_registry.py, and synchronized CI workflows all show the breaking move was threaded through carefully.
  • Allow-list changes are consistent across all four surfaces (ui/utils.py, ui/routers/files.py, UnsupportedFeature.tsx, frontend tests), so .docx flows end-to-end with no contradicting "unsupported" message left behind.

Adversarial review of the XML-walk extractor surfaced three cases that
silently degraded exactly the form/template/report documents the feature
targets:

- Textboxes/shapes (mc:AlternateContent) were emitted twice — once from the
  DrawingML mc:Choice and once from the VML mc:Fallback twin — and glued
  onto the host paragraph. Skip mc:Fallback so shape text is captured once.
- Tabs and line/page breaks (w:tab/w:br/w:cr) were dropped, gluing adjacent
  words into unsearchable tokens (e.g. "Column1Column2"). Translate them to
  whitespace while leaving intra-word run splits untouched.
- Rows/cells wrapped in repeating-section content controls (w:sdt around
  w:tr/w:tc) were skipped by the direct-child findall. Descend through the
  wrappers.

Also wrap missing-file / directory / permission OSErrors in the same
actionable message as the corrupt-file path instead of a raw traceback.

Adds regression tests for each case (textbox single-capture, tab/break
whitespace, intra-word integrity, sdt-wrapped rows).
@kovtcharov kovtcharov requested a review from itomek June 25, 2026 23:48
@kovtcharov

Copy link
Copy Markdown
Collaborator Author

Thanks for the review — addressed:

  • Bundled routing/docqa migration (Important): surfaced in the PR description under "Heads-up: bundled agent migration," including the breaking import-path change for release notes. The migration rides in from the base branch this was cut from; per the review it's clean/intentional, so it stays — now it's documented rather than silent.
  • # TODO(#1072) rendering inside the docstring (Minor): moved into the "Known omissions" sentence so there's no stray # in the rendered docstring (fa01b9d3).
  • gaia-code install-hint doc path (Minor): verified — docs/spec/agent-hub-restructure.mdx exists, so the actionable error doesn't 404. No change needed.

Also merged latest main (only conflict was CLAUDE.md, resolved to main's authoritative version since this PR doesn't touch it). Full suite green after the merge: 226 passed, 2 skipped; lint clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents dependencies Dependency updates devops DevOps/infrastructure changes documentation Documentation changes performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Microsoft Office (docx, pptx, xls) indexing

2 participants