[bugfix] Preserve ft:highlight-field-matches under facet drill-down by joewiz · Pull Request #6454 · eXist-db/exist

joewiz · 2026-06-08T14:43:04Z

[This PR was co-authored with Claude Code. -Joe]

Summary

A faceted ft:query (one with a facets drill-down option) silently returns empty highlights: ft:highlight-field-matches finds nothing, so any KWIC snippet built on it falls back to raw leading text. The same query without a facet filter highlights correctly. This is a pre-existing defect (reproduces on develop and on 7.0.0-beta3), surfaced while wiring an /api/search-style endpoint where ?q=term highlights but ?q=term&facet=value does not.

Root cause

In LuceneIndexWorker.query(), when a facet drill-down is requested the content query is wrapped in a Lucene DrillDownQuery before searching. That same wrapped query is then handed to the hit collector and stored on every LuceneMatch. ft:highlight-field-matches walks the stored match query — getTerms(match.getQuery()) → LuceneUtil.extractTerms(...) — to recover the per-field terms to mark. extractTerms cannot see into a DrillDownQuery, so it recovers zero terms and highlights nothing.

The give-away is that filterByIndexType also wraps the query (in a plain BooleanQuery of MUST + FILTER) and is applied to every query, yet unfaceted queries still highlight — extractTerms traverses a BooleanQuery fine. The opaque wrapper is specifically DrillDownQuery.

Fix

Decouple the search query from the match query:

search with the DrillDownQuery (so drill-down filtering and facet counts are unchanged);
store the pre-drill-down query (still index-type-filtered and boosted) on the LuceneMatch, so term/highlight extraction sees a query it can traverse.

searchAndProcess gains a two-query overload (…, searchQuery, matchQuery, config); the existing single-query form delegates with the same query for both, so every non-faceted path is byte-for-byte unchanged. The drill-down/boost wrapping is reorganized around a small applyBoost helper. Only the query consumed by highlighting changes — hits, scores, and facet counts are unaffected.

What changed

LuceneIndexWorker.java — both query() arities (string and XML) compute searchQuery (with drill-down) and matchQuery (without); searchAndProcess two-query overload passes matchQuery to the hit collector and searches with searchQuery; new applyBoost helper.
facet-drilldown-highlight.xqm (new XQSuite regression test).

Test plan

New regression test facet-drilldown-highlight.xqm: a faceted query now highlights and marks the queried term; a plain query still highlights; facet drill-down still selects and excludes by facet value.
Full lucene XQSuite green — 649 tests, 0 failures (the existing facets, highlighting, field, and search suites all pass, confirming no behavioral change to search/score/facets).
Codacy/PMD clean on the changed lines.

Notes

This is independent of any new function; it fixes faceted highlighting for ft:query itself (and therefore for anything built on it). A highlight-preserving workaround exists for callers in the meantime (run the unfaceted query and filter the hit set in XQuery, computing facet counts from the unfiltered set via ft:facets), but this is the correct fix.

A faceted ft:query wraps the content query in a Lucene DrillDownQuery before searching. The same wrapped query was stored on every LuceneMatch, and ft:highlight-field-matches walks that stored query (via getTerms -> LuceneUtil.extractTerms) to recover per-field term matches. extractTerms cannot see into a DrillDownQuery, so a facet drill-down silently yielded zero matches and produced empty highlights -- every faceted search lost its KWIC snippets, while the same query without a facet filter highlighted fine. Decouple the search query from the match query: search with the DrillDownQuery, but store the pre-drill-down query (still index-type-filtered and boosted) on the LuceneMatch for term/highlight extraction. searchAndProcess gains a two-query overload; the single-query form delegates with the same query for both, so non-faceted paths are unchanged. The drilldown and boost wrapping is restructured around an applyBoost helper; search, scoring, and facet counts are unaffected (only the match query consumed by highlighting changes). Regression test (facet-drilldown-highlight.xqm): a faceted query now highlights and marks the queried term, drill-down still selects/excludes by facet value, and a plain query still highlights. 649 lucene XQSuite tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

duncdrum

The bug is real and this fix works, but I think the change surface is larger than needed.

LuceneUtil.extractTermsFromDrillDown already exists specifically to handle DrillDownQuery
in term extraction — it was added alongside the original facet feature. The problem is its
implementation: it calls query.rewrite(new IndexSearcher(reader)), which in Lucene 10
expands the DrillDownQuery into a BooleanQuery containing both the content query and the dimension-filter clauses. Walking all clauses mixes content terms with internal dimension terms (e.g. $facets:kind$para), which don't appear in document text and apparently prevent correct highlight extraction.

DrillDownQuery exposes getBaseQuery(), which returns the
content query directly — no rewrite, no dimension noise. That reduces the fix to one line in
LuceneUtil:

private static void extractTermsFromDrillDown(DrillDownQuery query, ...) {
    extractTerms(query.getBaseQuery(), terms, reader, includeFields);
}

With that in place, there's no need to separate searchQuery from matchQuery in
searchAndProcess, no applyBoost helper, and no duplication across the two query() overloads. The existing test suite (facet-drilldown-highlight.xqm) would still be the right regression harness.

Did you give this a try, or am I missing something?

duncdrum · 2026-06-09T15:16:10Z

                }
                searchAndProcess(contextId, qname, docs, contextSet, resultSet,
-                        returnAncestor, searcher, query, config);
+                        returnAncestor, searcher, applyBoost(searchQuery, config), applyBoost(query, config), config);


I m not sure why we apply boost here, that is normally handled elsewhere. How is boost is relevant to highlighting?

duncdrum · 2026-06-09T15:16:24Z

                    }
                    searchAndProcess(contextId, qname, docs, contextSet, resultSet,
-                            returnAncestor, searcher, query, config);
+                            returnAncestor, searcher, applyBoost(searchQuery, config), applyBoost(query, config), config);


joewiz requested a review from a team as a code owner June 8, 2026 14:43

joewiz mentioned this pull request Jun 8, 2026

[feature] Index-first Lucene search: ft:query-scope (live nodes) and ft:search-scope (ES-shaped map) #6455

Open

4 tasks

duncdrum added the Lucene issue is related to Lucene or its integration label Jun 9, 2026

duncdrum reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] Preserve ft:highlight-field-matches under facet drill-down#6454

[bugfix] Preserve ft:highlight-field-matches under facet drill-down#6454
joewiz wants to merge 1 commit into
eXist-db:developfrom
joewiz:bugfix/lucene-facet-drilldown-highlight

joewiz commented Jun 8, 2026

Uh oh!

duncdrum left a comment

Uh oh!

duncdrum Jun 9, 2026

Uh oh!

duncdrum Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joewiz commented Jun 8, 2026

Summary

Root cause

Fix

What changed

Test plan

Notes

Uh oh!

duncdrum left a comment

Choose a reason for hiding this comment

Uh oh!

duncdrum Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

duncdrum Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants