[feature] ft:fields: introspect configured Lucene fields/facets in a scope by joewiz · Pull Request #6459 · eXist-db/exist

joewiz · 2026-06-09T04:41:31Z

[This PR was co-authored with Claude Code. -Joe]

Stacked on #6455. This branch is based on feature/lucene-search-index (#6455), because ft:fields reuses that PR's LuceneScope scope-resolution helper. Until #6455 merges, the diff here includes its commits; the ft:fields-specific changes are the final two commits. Once #6455 lands, this rebases onto develop to a clean ft:fields-only diff. Names provisional — happy to rename with the rest of the family.

Summary

Adds ft:fields($scope) as map(*)* — the schema-discovery companion to ft:query-scope/ft:search-scope. Where they search an index scope, this describes it: the configured fields, facets, and vector fields and each one's contract. It's the engine for existdb-openapi's GET /api/search/fields (a _mapping-style "what can I search here, and what is each field's contract?" endpoint) and the field-level-security layer built on top of it.

ft:fields("/db/apps") (: => one map per configured field/facet/vector occurrence :)

map { 
    "field": "site-content", 
    "kind": "field", 
    "element": "xqdoc:function",
    "analyzer": "org.apache.lucene.analysis.core.SimpleAnalyzer",
    "type": "xs:string", 
    "returnable": true() 
}
map { 
    "field": "site-app", 
    "kind": "facet", 
    "element": "topic" 
}
map { 
    "field": "site-embedding", 
    "kind": "vector", 
    "element": "doc",
    "dimension": 384, 
    "similarity": "cosine", 
    "model": "all-MiniLM-L6-v2" 
} …

Motivation / why native (not xconf parsing)

Field names you can scrape from collection.xconf; the contract — which analyzer a field uses, its type, whether it's stored/returnable, which element each field is indexed on, and (for vector fields) its dimension, similarity metric, and embedding model — is the resolved LuceneConfig, which a parser would have to reconstruct (config inheritance across nested collections, analyzer-id resolution, merged qname/wildcard/named indexes). existdb-openapi prototyped this Phase-2 layer with an xconf-parser stand-in and hit exactly those walls; ft:fields reads the resolved config via the broker instead. The requirements below come from that prototype against a real 3-producer corpus.

What it returns, per requirement

Permission-agnostic, callable by any user (R1). It reads the resolved LuceneConfig via the broker, not /db/system/config (admin-only), so a guest call returns the full catalog. existdb-openapi then applies field-level security (group→allowed-fields) on top. The stand-in had to wrap config reads in system:as-user("admin", …); this removes that.
Distinguishes field vs facet vs vector (R2): kind: "field" | "facet" | "vector" (the same name is often both field and facet — e.g. site-app is declared as both a <facet dimension> and a <field name>).
Resolved analyzer, per (field × element) (R3): the analyzer is reported as the concrete class, resolving an @analyzer="simple" id to its class and unwrapping eXist's per-field MetaAnalyzer to the actual default/per-field analyzer. A shared field can be indexed with different analyzers on different elements (StandardAnalyzer vs SimpleAnalyzer), and that variance is preserved.
Resolved type + returnable (R4): XDM type (default xs:string) and the stored/returnable flag (default true), with defaults resolved.
Element each field/facet is indexed on (R5).
Scope = collection path(s), recursive, inheritance-aware (R6): the same $scope model and resolution (getLuceneConfig(broker, docs)) as ft:search-scope.
One record per (field × element) occurrence (R7): no pre-dedup, so the per-element analyzer/type variance is visible; the API collapses to a field-level view.
Vector-field metadata (R8): a <vector-field> is reported with kind: "vector" and three extra keys — dimension (xs:integer), similarity ("cosine" | "euclidean" | "dot_product"), and model (the embedding model id, present only when the field embeds text via embedding="local"/"http"; absent for a raw-vector field). This lets a discovery-driven vector-search endpoint resolve the embedding model from ft:fields alone: the client sends {text, field}, the server reads the field's model, embeds the query with it, and runs KNN — no client-side model knowledge. (Driven by existdb-openapi#62.)

What changed (`extensions/indexes/lucene`)

Fields.java (ft:fields) — walks the resolved LuceneConfig and emits a map per field/facet/vector field, reusing LuceneScope.resolveScope for the scope→documents→config bridge. The vector branch adds dimension/similarity/model.
LuceneVectorFieldConfig.getModelId() — new getter exposing the embedding model id (the dimension/similarity getters already existed); returns null for a raw-vector field.
LuceneConfig.getAllIndexConfigurations() — enumerates all index heads (qname paths + wildcardPaths + namedIndexes); the existing getIndexConfigurations() returned only qname paths.
LuceneFieldConfig — getType() / isStore() / isBinary() getters (the fields were protected).
MetaAnalyzer.getConfiguredAnalyzer(field) — exposes the concrete analyzer behind the per-field wrapper for introspection (no behavior change to indexing/search).
LuceneModule registers the signature; tests in ft-fields.xqm + the LuceneQueryScopeTests runner.

Test plan

22 ft:fields XQSuite tests: map shape; field/facet counts; type and returnable (incl. store="no"); the resolved analyzer class for a default-analyzer field (StandardAnalyzer) and an analyzer-id field (SimpleAnalyzer); facet entry shape; unknown scope → empty.
Vector field: kind="vector" count; dimension is an xs:integer (384); similarity (cosine); model (all-MiniLM-L6-v2). The vector field is configured on an element absent from the test doc, so the XQSuite exercises pure config introspection — no vector extension needed at runtime.
R1 permission-agnostic: a guest (system:as-user) gets the full catalog (config lives under admin-only /db/system/config, but ft:fields reads the resolved config via the broker).
62 tests across the query-scope suite, all green.
Codacy/PMD clean on the new/changed files.

[feature] Index-first Lucene search: ft:query-scope (live nodes) and ft:search-scope (ES-shaped map) #6455 — the parent ft:query-scope/ft:search-scope PR this stacks on.
existdb-openapi field-discovery + FLS design and the feat/api-search-fields-discovery prototype this is the drop-in engine for; existdb-openapi#62 (the vector-search endpoint) drives the vector-field metadata keys.

ft:search-index($scope, $query, $options?) queries the Lucene index directly over the documents in $scope and returns ALL matching nodes — of any indexed element type — with their Lucene scores and match highlighting attached, exactly as ft:query results carry them. Unlike ft:query it does not evaluate relative to an XPath context node set, so: - relevance is correct for every hit regardless of how deeply the matched element is nested (it avoids the //* descendant-wildcard ft:score-loss artifact by never using an XPath node set as the query unit), and - it is element-name independent — no need to enumerate or union the contributing element types, so content producers stay decoupled from the search aggregator. The result is an ordinary node set, so ft:score, ft:facets, ft:field and ft:highlight-field-matches compose on it as usual. This is the focused native primitive underpinning the field-first ("eXlasticSearch") search design; the ES _search-style result map (hits/fields/facets/highlights/live-node) is assembled in XQuery on top of this node set. Implementation reuses the existing scored XML-field search path: it builds a DocumentSet from the scope collections and calls LuceneIndexWorker.query(...) with a null contextSet (index-first, no descendant-of constraint) and null qnames (all defined indexes). Tests (ft-search-index.xqm): searchable content in NESTED elements (para/caption) — the case where //* loses ft:score — proving search-index finds them across element types, scores each > 0, is name-independent, composes with ft:facets/ft:field/ft:score/ ft:highlight-field-matches, returns live nodes, sorts by score, and matches all on an empty query. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address review feedback on the ft:search-index draft: - Add the missing LGPL license header to LuceneSearchIndexTests.java so the org CI RAT/license check passes (the sibling LuceneAnalyzersTests has it). - Cover the 3-argument $options form, which was advertised but untested: facet drill-down (OPTION_FACETS, restricting "content:(array)" hits to the para vs caption facet value) and default-operator (flipping eXist's AND default to OR widens "array map" from 2 hits to 3, proving the options arg passes through). A 2-arg control documents the AND default. - Comment SearchIndex.eval to explain that options is positionally the 3rd argument and parseOptions short-circuits to defaults when argCount < 3, so the 2-arg form never dereferences a missing argument. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… companion Rename the index-first live-node function ft:search-index -> ft:query-scope (class SearchIndex -> QueryScope, with the test module/runner renamed to match). The name places it in the ft:query family it actually belongs to: same LuceneIndexWorker.query() path, live nodes, composes with ft:score/ ft:field/ft:facets/ft:highlight-field-matches. "search-index" misread from an Elasticsearch mindset, where "index" is the corpus, not part of the verb. Add ft-search-scope-map.xqm: the executable spec for an ES _search-shaped, map-returning companion (proposed native ft:search-scope), assembled in XQuery over ft:query-scope. It returns total/max-score/hits[]/facets, where each hit carries uri, node-id, score, a "source" map (requested stored fields), and an optional "highlight" snippet. Hit granularity defaults to the indexed element (honest to the index); a collapse option gives the ES-faithful one-hit-per-document view (group by document URI, best-scoring element), modeling the element-vs-document count discrepancy seen in /api/search. 10 tests pin the shape and both granularities; 23 tests total across the query-scope suite, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…panion Replace the XQuery reference module with a native ft:search-scope function, so the ES _search-shaped, map-returning companion lives in the ft: namespace alongside ft:query-scope. It returns map { total, max-score, hits[], facets }, where each hit carries uri, node-id, score, and a "source" map of requested stored fields. The $options map shapes the result: fields, facets (dimensions to aggregate), collapse, limit. Hit granularity defaults to the indexed element; collapse=true() groups to one-hit-per-document (best-scoring element, total = distinct documents), modeling the element-vs-document count discrepancy. Score is summed from the node's Lucene matches (as ft:score does); fields come from the worker's stored-field lookup; facets from each match's FacetsCollector, merged across queries (as ft:facets does). Highlighting and a stored-fields-only fast path (no node materialization) are noted follow-ups. Factor the shared scope-resolution and index-first query execution out of QueryScope into LuceneScope, used by both functions. 26 XQSuite tests across the suite (13 query-scope + 13 search-scope), all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address the two blockers from the existdb-openapi trial against a real corpus (223 docs / 2637 indexed function elements): - "highlight" option (xs:string*): adds a per-hit "highlight" map whose values are the exist:field/exist:match nodes produced by the existing ft:highlight-field-matches engine. ft:search-scope already materializes the live node internally, so it highlights before detaching to the map. Field.highlightMatches is made package-private static for reuse. - "offset" option (alias "from"): pages the ranked hits as ranked[offset, offset+limit). limit alone capped only the first page; total still reports the full count, so APIs can page past page 1. Naming and element-default granularity were confirmed by the same trial. The stored-fields-only fast path (the map form is currently the slowest of the three options) remains the documented follow-up. 31 XQSuite tests across the suite (13 query-scope + 18 search-scope), all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tion Thread a facet drill-down into the search so callers can restrict by a facet value (e.g. a "section" within an app). The "filter" option is a map { dimension: value(s) } that becomes a Lucene DrillDownQuery on the search -- the ES post-filter analog. This must live in the query rather than be applied caller-side: filtering here keeps total/limit/paging consistent, which post-hoc filtering of the hit list cannot. "filter" restricts the query; the other options (fields/highlight/facets/ collapse/offset/limit) still shape the result. Facet aggregation continues to run over the (now filtered) match set. Other Lucene query options (default-operator, ...) are not yet threaded -- a follow-up. Tests: drill-down restricts total and the hits array (kind=para drops the caption hit, 3 -> 2; kind=caption keeps 1). 34 XQSuite tests across the suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…issions Pin the document-level security guarantee that existdb-openapi's field-permission model relies on: neither scope function may return nodes or hits from documents the caller cannot read. They resolve scope through broker.allDocs(...) and materialize hits as persistent nodes through the broker, both of which enforce read permissions -- the same guarantee any collection()//x query honors. scope-dls.xqm stores a public doc (world-readable) and a secret doc (rw-------) both matching a shared term, then queries as guest vs admin via system:as-user: a guest gets only the public hit (count 1, total 1), admin gets both (2); a term indexed only in the secret doc is unreachable to the guest (0) but visible to admin (1); the guest's single search-scope hit is always the public document. Mirrors the visibility checks ft-search-binary.xqm makes for legacy ft:search. 43 XQSuite tests across the suite, all green -- DLS confirmed, not assumed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…scope Add ft:fields($scope) as map(*)* — the schema-discovery companion to ft:query-scope/ft:search-scope (built for existdb-openapi's /api/search/fields field-discovery + field-level-security layer). It returns one map per configured field or facet occurrence: { field, element, kind: "field"|"facet", analyzer, type, returnable }. It is a thin wrapper over the resolved LuceneConfig (via getLuceneConfig over the scope's documents), reusing LuceneScope.resolveScope, so it handles config inheritance, analyzer-id resolution, and merged qname/wildcard/named indexes that a collection.xconf parser cannot. Per the consumer requirements: - permission-agnostic: reads the resolved config via the broker, so a non-dba caller gets the full catalog (the API applies field-level security on top) -- the config lives under admin-only /db/system/config, which is why a parser stand-in could not do this; - distinguishes field vs facet (kind); - reports the RESOLVED analyzer class per (field x element): a field's analyzer-id is resolved to its class, and the default/index analyzer is unwrapped from eXist's per-field MetaAnalyzer wrapper to the concrete class (a shared field can carry different analyzers on different elements); - emits one record per occurrence (no pre-dedup), so per-element analyzer/type variance is preserved for the caller to collapse. Adds getType()/isStore()/isBinary() getters to LuceneFieldConfig, LuceneConfig.getAllIndexConfigurations() (qname + wildcard + named heads), and MetaAnalyzer.getConfiguredAnalyzer() to expose the concrete per-field analyzer. Stacked on the ft:query-scope/ft:search-scope branch (reuses LuceneScope). 11 ft-fields XQSuite tests; 54 across the query-scope suite, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ry record Address two findings from the existdb-openapi Phase 2 integration: 1. Aggregate across collections. LuceneIndexWorker.getLuceneConfig returns only the first collection's config it finds, so ft:fields($scope) over a scope spanning several producer collections (each with its config on its own data collection) returned just one collection's fields -- and a parent scope with no config of its own returned nothing. Iterate every distinct collection in the resolved document set and union their configs, so ft:fields($scope) discovers the full field set the way ft:search-scope aggregates documents. Removes the API-side per-collection union workaround. 2. Self-distinguish non-field/facet records. Records for entries that are neither a named <field> nor a <facet> (e.g. vector fields) previously carried only an "element" key, forcing the consumer to special-case a missing "field". Now every emitted map carries field + kind: vector fields report kind="vector"; any other configured entry reports kind="index" keyed by the element name. (A plain <text qname="..."/> with no named field/facet contributes no record -- it indexes element content but exposes no named, queryable field.) 61 ft-fields tests incl. cross-collection union + every-map-has-field-and-kind; 57 across the query-scope suite, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…arity ft:fields($scope) described a vector field with only {field, element, kind: "vector"}, so a discovery-driven client could see that a field embeds vectors but not how to embed a query against it. The openapi vector-search endpoint (existdb-openapi#62) needs the field's model id, dimension, and similarity metric so the client can send only text + field and have the server resolve the model. Add three keys to a vector field's record: - "dimension": xs:integer — the configured vector dimension - "similarity": xs:string — "cosine" | "euclidean" | "dot_product" - "model": xs:string — the embedding model id, present only when the field embeds text (embedding="local" or "http"); absent for a raw-vector field Adds a getModelId() getter to LuceneVectorFieldConfig (dimension and similarity getters already existed). The ft-fields.xqm XQSuite gains a vector-field fixture and assertions for the three keys; the field is placed on an element absent from the test doc so reindex never invokes the embedding provider (which lives in the separate vector extension). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the collection.xconf-parsing + system:as-user stand-in with the native ft:fields($scope) (now in eXist-db/exist#6459). The catalog read is now permission-agnostic and credential-free, exactly as designed. Two ft:fields behaviours had to be handled in the API layer (worth a core follow-up — see the handoff note): 1. ft:fields does NOT aggregate across collections: it resolves the single config for a given collection/doc-set, so ft:fields("/db/apps") is empty when each app's config lives on its own data collection, and a sequence scope resolves to only the first collection's config. For site-wide discovery we union ft:fields over every descendant collection in scope (fields:descendant-collections). Collapses to a single ft:fields($scope) if it gains native cross-collection aggregation. 2. ft:fields also emits element-level text-index records (a plain <text qname> with no named <field> yields a map with only "element"); those aren't named, field:(...)-queryable fields, so the catalog drops maps without a "field" key. dedup now surfaces per-field analyzer VARIANCE as an array (a shared field indexed with different analyzers on different elements — e.g. site-content StandardAnalyzer vs WordDelimiter — is reported as both, not hidden). Validated on a 2-app + bundled-apps bed: cross-collection union works (site-content elements [a,b]); analyzer variance shows both analyzers; FLS differentiates guest (public site-* only) from authenticated (also sees function-name/secret-notes) from dba (all). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

eXist-db/exist#6459 (d724759) addressed both integration findings: ft:fields now aggregates across every collection in scope, and every record carries field + kind (field/facet/vector). So fields:catalog collapses to a single ft:fields($scope) call — removing the descendant-collection union walk and the [exists(?field)] filter. Verified on the 2-app + bundled-apps bed: ft:fields("/db/apps") unions across collections (site-content elements [a,b]), analyzer variance still surfaces as an array, and FLS differentiates guest (public site-* only) / authenticated (+ function-name, secret-notes, and the test-embedding vector field) / dba. Confirmed for the core session: the previously field-less records were the vector case (e.g. test-embedding, now kind:"vector"), not bare element-text indexes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Wire the field-discovery handler (fields:list) into the API: - modules/api.xq: import the fields module so function-lookup resolves the "fields:list" operationId (same mechanism as search:query). - modules/api.json: add the GET /api/search/fields path — scope (default /db/apps) + optional field params; documented response envelope (scope/user/total/fields[] with field/kind/elements/analyzer/type/returnable) and an example. - src/test/cypress/e2e/search-fields.cy.js: self-contained suite (seeds a fixture collection with a public site-content field + a non-public field) asserting the envelope, the per-field contract, the field= filter, and that an authenticated caller sees non-public fields. Validated at the handler level on the ft:fields bed (fields:list over synthetic roaster $request maps): guest sees public site-* only; authenticated sees the non-public fields too; field= narrows to one; default scope applied. Depends on ft:fields (eXist-db/exist#6459): the route and the Cypress suite require an eXist that ships ft:fields, so this is branch work until that lands in a release (CI uses the stock image). Not for merge until then. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

joewiz and others added 8 commits June 7, 2026 03:31

joewiz requested a review from a team as a code owner June 9, 2026 04:41

joewiz added the enhancement new features, suggestions, etc. label Jun 9, 2026

duncdrum added the Lucene issue is related to Lucene or its integration label Jun 9, 2026

This was referenced Jun 10, 2026

[feature] Search: discover searchable fields via /api/search/fields (Search-in picker) joewiz/existdb-oxygen-plugin#37

Merged

feat(search): broaden /api/search — field discovery, field-scoped query, facet drill-down (Phase 2) eXist-db/existdb-openapi#58

Draft

joewiz mentioned this pull request Jun 12, 2026

feat(search): vector-similarity ("Similar to…") search on /api/search eXist-db/existdb-openapi#60

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature] ft:fields: introspect configured Lucene fields/facets in a scope#6459

[feature] ft:fields: introspect configured Lucene fields/facets in a scope#6459
joewiz wants to merge 10 commits into
eXist-db:developfrom
joewiz:feature/ft-fields

joewiz commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joewiz commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / why native (not xconf parsing)

What it returns, per requirement

What changed (extensions/indexes/lucene)

Test plan

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joewiz commented Jun 9, 2026 •

edited

Loading

What changed (`extensions/indexes/lucene`)