Skip to content

feat(search): broaden /api/search — field discovery, field-scoped query, facet drill-down (Phase 2)#58

Draft
joewiz wants to merge 8 commits into
eXist-db:developfrom
joewiz:feat/api-search-field-scope
Draft

feat(search): broaden /api/search — field discovery, field-scoped query, facet drill-down (Phase 2)#58
joewiz wants to merge 8 commits into
eXist-db:developfrom
joewiz:feat/api-search-field-scope

Conversation

@joewiz

@joewiz joewiz commented Jun 11, 2026

Copy link
Copy Markdown
Member

[This PR was co-authored with Claude Code. -Joe]

Summary

Phase-2 broadening of /api/search beyond the fixed-field keyword query — three capabilities that move it toward an ElasticSearch-shaped search surface:

  1. Field discoveryGET /api/search/fields[?scope=…][&field=…] lists the searchable fields/facets/vectors configured under a scope, each with its contract (kind, elements, analyzer, type, returnable), filtered by field-level security. Backed by the native ft:fields.
  2. Field-scoped query&field=<name>&scope=<path> on /api/search: restrict a query to one discovered field, under one or more collections. Standard ft:query, FLS-gated (a field the caller can't see → 403). Works on a stock eXist.
  3. Facet drill-down&facet=<dim>:<value> (repeatable; same dim → OR, different → AND), with ES post_filter semantics: selecting a value narrows the returned hits but the bucket counts stay stable (they reflect the base query). app/section become shortcuts for facet=site-app:…/site-section:…. Works on a stock eXist.

The FLS policy (public/restricted/visible) lives in its own field-policy.xqm with no ft:fields dependency, imported by both search.xqm and fields.xqm — so the field-scoped query and facet drill-down compile and run on a stock eXist even though discovery needs ft:fields.

⚠️ Merge gate

The discovery endpoint's fields.xqm calls ft:fields (eXist-db/exist#6459), and api.xq imports fields.xqm — so until #6459 is in the eXist release this package targets, the app would fail to compile (XPST0017) on a stock eXist. Merge after #6459 ships. (If we want discovery to ship sooner and degrade gracefully on a stock eXist, I can make the ft:fields call dynamic via function-lookup, so the module compiles everywhere and the route returns a clean error when the function is absent — de-gating this PR. Say the word.)

Testing

Validated on the trio integration instance (:19110, an ft:fields-enabled eXist + the real corpus). Cypress: search.cy.js (9), search-fields.cy.js (discovery + FLS), search-field-scope.cy.js (7 — field isolation, scope, guest-403, dba override), search-facet.cy.js (5 — buckets, drill, post_filter count stability, multi-value OR, app-shortcut). All green; the field-scoped + facet paths additionally verified on a stock beta3 (no ft:fields).

Notes

  • Implements the oxygen field-picker contract (the plugin's discovery + field-scoped + facet UI consume these) and design (c).
  • Vector similarity (oxygen (d)) is a follow-up — the core pipeline (vector:embed + ft:query-field-vector) is confirmed present; it needs an embedding corpus to exercise.

joewiz and others added 8 commits June 15, 2026 22:22
…policy

Prototype of the field-discovery half of the broaden-/api/search design: a
consumer (e.g. the Oxygen plugin's field picker) can ask "what can I search
here?" and "what is this field's contract?" before querying.

Two separated layers, per the ES model:
- CATALOG: enumerate every configured field/facet under a collection scope,
  with its contract (kind, indexed element(s), analyzer, type, returnable),
  read with privilege. This XQuery collection.xconf parser is a STAND-IN for
  the native ft:fields($scope) the lucene session will build; configs live
  under /db/system/config (admin-only) and the schema is system-managed, so
  the catalog read is privileged and permission-agnostic. (The privilege need
  is exactly why ft:fields is worth building natively — it reads the resolved
  LuceneConfig via the broker and skips the system-config read entirely.)
- FLS: a group->fields policy decides which catalog entries THIS caller sees,
  applied after the privileged read — keyed off $request?user. Field access
  lives in the policy, never as an ACL on the field (the Elasticsearch lesson).
  Default: public site-* fields for everyone incl. guest; everything else for
  authenticated callers; per-field group restrictions supported.

Name-independent of ft:query-scope/ft:search-scope (those names are still in
review on eXist-db/exist#6455), so this can proceed now; the field-param query
cutover waits for that function to ship.

Validated on the live 3-producer instance: guest sees only public site-* fields;
an authenticated caller additionally sees the docs app's non-public index fields
(category/definition/function-name/term) plus a seeded secret-notes; dba sees
all. site-content's contract correctly dedups to one record listing the 7
elements it is indexed on across apps.

NOT for merge as-is: the privileged read uses system:as-user (prototype); the
production form swaps to ft:fields once it lands, and this gains an api.json
route + XQSuite tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the collection.xconf-parsing + system:as-user stand-in with the
native ft:fields($scope) (now in eXist-db/exist#6459). The catalog read is
now permission-agnostic and credential-free, exactly as designed.

Two ft:fields behaviours had to be handled in the API layer (worth a core
follow-up — see the handoff note):

1. ft:fields does NOT aggregate across collections: it resolves the single
   config for a given collection/doc-set, so ft:fields("/db/apps") is empty
   when each app's config lives on its own data collection, and a sequence
   scope resolves to only the first collection's config. For site-wide
   discovery we union ft:fields over every descendant collection in scope
   (fields:descendant-collections). Collapses to a single ft:fields($scope)
   if it gains native cross-collection aggregation.
2. ft:fields also emits element-level text-index records (a plain <text qname>
   with no named <field> yields a map with only "element"); those aren't
   named, field:(...)-queryable fields, so the catalog drops maps without a
   "field" key.

dedup now surfaces per-field analyzer VARIANCE as an array (a shared field
indexed with different analyzers on different elements — e.g. site-content
StandardAnalyzer vs WordDelimiter — is reported as both, not hidden).

Validated on a 2-app + bundled-apps bed: cross-collection union works
(site-content elements [a,b]); analyzer variance shows both analyzers; FLS
differentiates guest (public site-* only) from authenticated (also sees
function-name/secret-notes) from dba (all).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
eXist-db/exist#6459 (d724759) addressed both integration findings:
ft:fields now aggregates across every collection in scope, and every record
carries field + kind (field/facet/vector). So fields:catalog collapses to a
single ft:fields($scope) call — removing the descendant-collection union walk
and the [exists(?field)] filter.

Verified on the 2-app + bundled-apps bed: ft:fields("/db/apps") unions across
collections (site-content elements [a,b]), analyzer variance still surfaces as
an array, and FLS differentiates guest (public site-* only) / authenticated
(+ function-name, secret-notes, and the test-embedding vector field) / dba.

Confirmed for the core session: the previously field-less records were the
vector case (e.g. test-embedding, now kind:"vector"), not bare element-text
indexes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the field-discovery handler (fields:list) into the API:
- modules/api.xq: import the fields module so function-lookup resolves the
  "fields:list" operationId (same mechanism as search:query).
- modules/api.json: add the GET /api/search/fields path — scope (default
  /db/apps) + optional field params; documented response envelope
  (scope/user/total/fields[] with field/kind/elements/analyzer/type/returnable)
  and an example.
- src/test/cypress/e2e/search-fields.cy.js: self-contained suite (seeds a
  fixture collection with a public site-content field + a non-public field)
  asserting the envelope, the per-field contract, the field= filter, and that
  an authenticated caller sees non-public fields.

Validated at the handler level on the ft:fields bed (fields:list over synthetic
roaster $request maps): guest sees public site-* only; authenticated sees the
non-public fields too; field= narrows to one; default scope applied.

Depends on ft:fields (eXist-db/exist#6459): the route and the Cypress suite
require an eXist that ships ft:fields, so this is branch work until that lands
in a release (CI uses the stock image). Not for merge until then.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…FLS test

End-to-end HTTP testing (existdb-openapi installed on an ft:fields-enabled
eXist) surfaced a bug the handler-level test missed: the scope fallback
`($request?parameters?scope[. ne ""], $fields:default-scope)` always appended
the default, so a provided scope echoed twice (and was passed doubled to
ft:fields). Use an if/else so a provided scope (one or more) is used as-is and
the default applies only when none is given.

The HTTP test also confirmed the route admits unauthenticated callers (identity
resolves to guest), so the guest "public-only" FLS tier is reachable over HTTP.
Added a Cypress assertion for it: a guest sees the public site-* fields but not
the non-public secret-notes; a dba sees both.

Verified on a full PoC bed (producers snapshot + the ft:fields lucene jar):
GET /api/search/fields returns, for the real corpus, site-content unioned
across 6 elements with mixed analyzers [StandardAnalyzer, SimpleAnalyzer];
admin sees 12 fields (field/facet/vector), guest sees 7 (public site-* only).
All five Cypress scenarios pass against the live route.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…earch

Implements the oxygen field-scoped-search contract (existdb-openapi#55):

- `field` (optional): restrict the query to one named field (a value from
  GET /api/search/fields) instead of the default shared site-content/site-title
  query. Built on standard ft:query (a field-qualified query string), so it
  works on a stock eXist — NOT gated on #6455/#6459. Only discovery needs ft:fields.
- `scope` (optional, repeatable): collection path(s) to search under, recursive;
  same semantics as /api/search/fields. Defaults to the sitewide /db/apps.
- Field-level security: a field the caller may not see is not queryable — returns
  403, enforced by the same policy /api/search/fields uses.
- Response shape unchanged (query/total/offset/limit/facets/results), so the
  plugin's parser is untouched; the deferred ~10-line plugin wiring can now land.

To avoid a regression, the FLS policy (public/restricted/visible) is extracted to
a new field-policy.xqm with NO ft:fields dependency, imported by both search.xqm
and fields.xqm. Previously search would have transitively pulled ft:fields via
fields.xqm and failed to compile on a stock eXist (XPST0017); verified fixed —
/api/search compiles and field-scoped search works on stock beta3 (ft:fields absent).

7 self-contained Cypress tests (field isolation, scope, guest-403, public-200,
dba override, stable default); all green. Verified on the trio instance (:19110).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st-db#55 / oxygen c)

Adds &facet=<dimension>:<value> to /api/search (repeatable; same dimension -> OR,
different dimensions -> AND), implementing the oxygen facet drill-down design (c).

ES post_filter semantics: selecting a facet value narrows the returned HITS but
NOT the bucket counts — counts are computed on the base query (q + scope) so they
stay stable as the user drills (the "blog (12)" still shows after filtering to
docs). Implemented as: one base query for the facets map + ft:score ranking; a
second drill-down query only when a facet is selected, to narrow the hits.

- The app/section params become shortcuts for facet=site-app:… / facet=site-section:…
  (generalized into the one mechanism).
- facet (and scope, also documented repeatable) are declared array-typed in
  api.json so roaster accepts repetition; the handler unwraps roaster's array(*)
  to a sequence. Values grouped by dimension with explicit for/where (NOT a
  ?key-in-predicate, which eXist mis-handles as XPTY0004 for >1 item).

Self-contained Cypress suite (5): bucket counts, drill narrows hits, post_filter
count stability, multi-value OR, app-shortcut equivalence. search.cy.js (9) and
search-field-scope.cy.js (7) stay green (no regression from moving facet counts
onto the base query).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… / oxygen c)

The facets-option ft:query (Lucene drill-down) doesn't collect per-match
offsets, so its result nodes can't drive ft:highlight-field-matches/KWIC — the
faceted path returned a full-body snippet with no <mark> and empty highlights.
Intersect the drill set with $base-hits (which carry the match data) by node
identity, so the returned hits are the match-bearing nodes narrowed to the facet
selection. Preserves post_filter narrowing + the base-query facet counts.

Reproduced + fix verified on :19110: faceted hit went from 0 to 2 highlight
matches; counts unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@joewiz joewiz force-pushed the feat/api-search-field-scope branch from 0427632 to df95f00 Compare June 16, 2026 02:22
@joewiz

joewiz commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Moving to draft. This PR uses ft:fields, added upstream in eXist-db/exist#6459, which isn't merged yet — so it's absent from the existdb/existdb:latest image CI runs against.

@joewiz joewiz marked this pull request as draft June 16, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant