feat(search): broaden /api/search — field discovery, field-scoped query, facet drill-down (Phase 2)#58
Draft
joewiz wants to merge 8 commits into
Draft
Conversation
…policy Prototype of the field-discovery half of the broaden-/api/search design: a consumer (e.g. the Oxygen plugin's field picker) can ask "what can I search here?" and "what is this field's contract?" before querying. Two separated layers, per the ES model: - CATALOG: enumerate every configured field/facet under a collection scope, with its contract (kind, indexed element(s), analyzer, type, returnable), read with privilege. This XQuery collection.xconf parser is a STAND-IN for the native ft:fields($scope) the lucene session will build; configs live under /db/system/config (admin-only) and the schema is system-managed, so the catalog read is privileged and permission-agnostic. (The privilege need is exactly why ft:fields is worth building natively — it reads the resolved LuceneConfig via the broker and skips the system-config read entirely.) - FLS: a group->fields policy decides which catalog entries THIS caller sees, applied after the privileged read — keyed off $request?user. Field access lives in the policy, never as an ACL on the field (the Elasticsearch lesson). Default: public site-* fields for everyone incl. guest; everything else for authenticated callers; per-field group restrictions supported. Name-independent of ft:query-scope/ft:search-scope (those names are still in review on eXist-db/exist#6455), so this can proceed now; the field-param query cutover waits for that function to ship. Validated on the live 3-producer instance: guest sees only public site-* fields; an authenticated caller additionally sees the docs app's non-public index fields (category/definition/function-name/term) plus a seeded secret-notes; dba sees all. site-content's contract correctly dedups to one record listing the 7 elements it is indexed on across apps. NOT for merge as-is: the privileged read uses system:as-user (prototype); the production form swaps to ft:fields once it lands, and this gains an api.json route + XQSuite tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the collection.xconf-parsing + system:as-user stand-in with the native ft:fields($scope) (now in eXist-db/exist#6459). The catalog read is now permission-agnostic and credential-free, exactly as designed. Two ft:fields behaviours had to be handled in the API layer (worth a core follow-up — see the handoff note): 1. ft:fields does NOT aggregate across collections: it resolves the single config for a given collection/doc-set, so ft:fields("/db/apps") is empty when each app's config lives on its own data collection, and a sequence scope resolves to only the first collection's config. For site-wide discovery we union ft:fields over every descendant collection in scope (fields:descendant-collections). Collapses to a single ft:fields($scope) if it gains native cross-collection aggregation. 2. ft:fields also emits element-level text-index records (a plain <text qname> with no named <field> yields a map with only "element"); those aren't named, field:(...)-queryable fields, so the catalog drops maps without a "field" key. dedup now surfaces per-field analyzer VARIANCE as an array (a shared field indexed with different analyzers on different elements — e.g. site-content StandardAnalyzer vs WordDelimiter — is reported as both, not hidden). Validated on a 2-app + bundled-apps bed: cross-collection union works (site-content elements [a,b]); analyzer variance shows both analyzers; FLS differentiates guest (public site-* only) from authenticated (also sees function-name/secret-notes) from dba (all). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
eXist-db/exist#6459 (d724759) addressed both integration findings: ft:fields now aggregates across every collection in scope, and every record carries field + kind (field/facet/vector). So fields:catalog collapses to a single ft:fields($scope) call — removing the descendant-collection union walk and the [exists(?field)] filter. Verified on the 2-app + bundled-apps bed: ft:fields("/db/apps") unions across collections (site-content elements [a,b]), analyzer variance still surfaces as an array, and FLS differentiates guest (public site-* only) / authenticated (+ function-name, secret-notes, and the test-embedding vector field) / dba. Confirmed for the core session: the previously field-less records were the vector case (e.g. test-embedding, now kind:"vector"), not bare element-text indexes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the field-discovery handler (fields:list) into the API: - modules/api.xq: import the fields module so function-lookup resolves the "fields:list" operationId (same mechanism as search:query). - modules/api.json: add the GET /api/search/fields path — scope (default /db/apps) + optional field params; documented response envelope (scope/user/total/fields[] with field/kind/elements/analyzer/type/returnable) and an example. - src/test/cypress/e2e/search-fields.cy.js: self-contained suite (seeds a fixture collection with a public site-content field + a non-public field) asserting the envelope, the per-field contract, the field= filter, and that an authenticated caller sees non-public fields. Validated at the handler level on the ft:fields bed (fields:list over synthetic roaster $request maps): guest sees public site-* only; authenticated sees the non-public fields too; field= narrows to one; default scope applied. Depends on ft:fields (eXist-db/exist#6459): the route and the Cypress suite require an eXist that ships ft:fields, so this is branch work until that lands in a release (CI uses the stock image). Not for merge until then. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…FLS test End-to-end HTTP testing (existdb-openapi installed on an ft:fields-enabled eXist) surfaced a bug the handler-level test missed: the scope fallback `($request?parameters?scope[. ne ""], $fields:default-scope)` always appended the default, so a provided scope echoed twice (and was passed doubled to ft:fields). Use an if/else so a provided scope (one or more) is used as-is and the default applies only when none is given. The HTTP test also confirmed the route admits unauthenticated callers (identity resolves to guest), so the guest "public-only" FLS tier is reachable over HTTP. Added a Cypress assertion for it: a guest sees the public site-* fields but not the non-public secret-notes; a dba sees both. Verified on a full PoC bed (producers snapshot + the ft:fields lucene jar): GET /api/search/fields returns, for the real corpus, site-content unioned across 6 elements with mixed analyzers [StandardAnalyzer, SimpleAnalyzer]; admin sees 12 fields (field/facet/vector), guest sees 7 (public site-* only). All five Cypress scenarios pass against the live route. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…earch Implements the oxygen field-scoped-search contract (existdb-openapi#55): - `field` (optional): restrict the query to one named field (a value from GET /api/search/fields) instead of the default shared site-content/site-title query. Built on standard ft:query (a field-qualified query string), so it works on a stock eXist — NOT gated on #6455/#6459. Only discovery needs ft:fields. - `scope` (optional, repeatable): collection path(s) to search under, recursive; same semantics as /api/search/fields. Defaults to the sitewide /db/apps. - Field-level security: a field the caller may not see is not queryable — returns 403, enforced by the same policy /api/search/fields uses. - Response shape unchanged (query/total/offset/limit/facets/results), so the plugin's parser is untouched; the deferred ~10-line plugin wiring can now land. To avoid a regression, the FLS policy (public/restricted/visible) is extracted to a new field-policy.xqm with NO ft:fields dependency, imported by both search.xqm and fields.xqm. Previously search would have transitively pulled ft:fields via fields.xqm and failed to compile on a stock eXist (XPST0017); verified fixed — /api/search compiles and field-scoped search works on stock beta3 (ft:fields absent). 7 self-contained Cypress tests (field isolation, scope, guest-403, public-200, dba override, stable default); all green. Verified on the trio instance (:19110). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st-db#55 / oxygen c) Adds &facet=<dimension>:<value> to /api/search (repeatable; same dimension -> OR, different dimensions -> AND), implementing the oxygen facet drill-down design (c). ES post_filter semantics: selecting a facet value narrows the returned HITS but NOT the bucket counts — counts are computed on the base query (q + scope) so they stay stable as the user drills (the "blog (12)" still shows after filtering to docs). Implemented as: one base query for the facets map + ft:score ranking; a second drill-down query only when a facet is selected, to narrow the hits. - The app/section params become shortcuts for facet=site-app:… / facet=site-section:… (generalized into the one mechanism). - facet (and scope, also documented repeatable) are declared array-typed in api.json so roaster accepts repetition; the handler unwraps roaster's array(*) to a sequence. Values grouped by dimension with explicit for/where (NOT a ?key-in-predicate, which eXist mis-handles as XPTY0004 for >1 item). Self-contained Cypress suite (5): bucket counts, drill narrows hits, post_filter count stability, multi-value OR, app-shortcut equivalence. search.cy.js (9) and search-field-scope.cy.js (7) stay green (no regression from moving facet counts onto the base query). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… / oxygen c) The facets-option ft:query (Lucene drill-down) doesn't collect per-match offsets, so its result nodes can't drive ft:highlight-field-matches/KWIC — the faceted path returned a full-body snippet with no <mark> and empty highlights. Intersect the drill set with $base-hits (which carry the match data) by node identity, so the returned hits are the match-bearing nodes narrowed to the facet selection. Preserves post_filter narrowing + the base-query facet counts. Reproduced + fix verified on :19110: faceted hit went from 0 to 2 highlight matches; counts unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0427632 to
df95f00
Compare
Member
Author
|
Moving to draft. This PR uses |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[This PR was co-authored with Claude Code. -Joe]
Summary
Phase-2 broadening of
/api/searchbeyond the fixed-field keyword query — three capabilities that move it toward an ElasticSearch-shaped search surface:GET /api/search/fields[?scope=…][&field=…]lists the searchable fields/facets/vectors configured under a scope, each with its contract (kind,elements,analyzer,type,returnable), filtered by field-level security. Backed by the nativeft:fields.&field=<name>&scope=<path>on/api/search: restrict a query to one discovered field, under one or more collections. Standardft:query, FLS-gated (a field the caller can't see → 403). Works on a stock eXist.&facet=<dim>:<value>(repeatable; same dim → OR, different → AND), with ESpost_filtersemantics: selecting a value narrows the returned hits but the bucket counts stay stable (they reflect the base query).app/sectionbecome shortcuts forfacet=site-app:…/site-section:…. Works on a stock eXist.The FLS policy (public/restricted/visible) lives in its own
field-policy.xqmwith noft:fieldsdependency, imported by bothsearch.xqmandfields.xqm— so the field-scoped query and facet drill-down compile and run on a stock eXist even though discovery needsft:fields.The discovery endpoint's
fields.xqmcallsft:fields(eXist-db/exist#6459), andapi.xqimportsfields.xqm— so until #6459 is in the eXist release this package targets, the app would fail to compile (XPST0017) on a stock eXist. Merge after #6459 ships. (If we want discovery to ship sooner and degrade gracefully on a stock eXist, I can make theft:fieldscall dynamic viafunction-lookup, so the module compiles everywhere and the route returns a clean error when the function is absent — de-gating this PR. Say the word.)Testing
Validated on the trio integration instance (
:19110, anft:fields-enabled eXist + the real corpus). Cypress:search.cy.js(9),search-fields.cy.js(discovery + FLS),search-field-scope.cy.js(7 — field isolation, scope, guest-403, dba override),search-facet.cy.js(5 — buckets, drill, post_filter count stability, multi-value OR, app-shortcut). All green; the field-scoped + facet paths additionally verified on a stock beta3 (noft:fields).Notes
vector:embed+ft:query-field-vector) is confirmed present; it needs an embedding corpus to exercise.