Skip to content

[bugfix] Preserve ft:highlight-field-matches under facet drill-down#6454

Open
joewiz wants to merge 1 commit into
eXist-db:developfrom
joewiz:bugfix/lucene-facet-drilldown-highlight
Open

[bugfix] Preserve ft:highlight-field-matches under facet drill-down#6454
joewiz wants to merge 1 commit into
eXist-db:developfrom
joewiz:bugfix/lucene-facet-drilldown-highlight

Conversation

@joewiz

@joewiz joewiz commented Jun 8, 2026

Copy link
Copy Markdown
Member

[This PR was co-authored with Claude Code. -Joe]

Summary

A faceted ft:query (one with a facets drill-down option) silently returns empty highlights: ft:highlight-field-matches finds nothing, so any KWIC snippet built on it falls back to raw leading text. The same query without a facet filter highlights correctly. This is a pre-existing defect (reproduces on develop and on 7.0.0-beta3), surfaced while wiring an /api/search-style endpoint where ?q=term highlights but ?q=term&facet=value does not.

Root cause

In LuceneIndexWorker.query(), when a facet drill-down is requested the content query is wrapped in a Lucene DrillDownQuery before searching. That same wrapped query is then handed to the hit collector and stored on every LuceneMatch. ft:highlight-field-matches walks the stored match query — getTerms(match.getQuery())LuceneUtil.extractTerms(...) — to recover the per-field terms to mark. extractTerms cannot see into a DrillDownQuery, so it recovers zero terms and highlights nothing.

The give-away is that filterByIndexType also wraps the query (in a plain BooleanQuery of MUST + FILTER) and is applied to every query, yet unfaceted queries still highlight — extractTerms traverses a BooleanQuery fine. The opaque wrapper is specifically DrillDownQuery.

Fix

Decouple the search query from the match query:

  • search with the DrillDownQuery (so drill-down filtering and facet counts are unchanged);
  • store the pre-drill-down query (still index-type-filtered and boosted) on the LuceneMatch, so term/highlight extraction sees a query it can traverse.

searchAndProcess gains a two-query overload (…, searchQuery, matchQuery, config); the existing single-query form delegates with the same query for both, so every non-faceted path is byte-for-byte unchanged. The drill-down/boost wrapping is reorganized around a small applyBoost helper. Only the query consumed by highlighting changes — hits, scores, and facet counts are unaffected.

What changed

  • LuceneIndexWorker.java — both query() arities (string and XML) compute searchQuery (with drill-down) and matchQuery (without); searchAndProcess two-query overload passes matchQuery to the hit collector and searches with searchQuery; new applyBoost helper.
  • facet-drilldown-highlight.xqm (new XQSuite regression test).

Test plan

  • New regression test facet-drilldown-highlight.xqm: a faceted query now highlights and marks the queried term; a plain query still highlights; facet drill-down still selects and excludes by facet value.
  • Full lucene XQSuite green — 649 tests, 0 failures (the existing facets, highlighting, field, and search suites all pass, confirming no behavioral change to search/score/facets).
  • Codacy/PMD clean on the changed lines.

Notes

  • This is independent of any new function; it fixes faceted highlighting for ft:query itself (and therefore for anything built on it). A highlight-preserving workaround exists for callers in the meantime (run the unfaceted query and filter the hit set in XQuery, computing facet counts from the unfiltered set via ft:facets), but this is the correct fix.

A faceted ft:query wraps the content query in a Lucene DrillDownQuery before
searching. The same wrapped query was stored on every LuceneMatch, and
ft:highlight-field-matches walks that stored query (via getTerms ->
LuceneUtil.extractTerms) to recover per-field term matches. extractTerms
cannot see into a DrillDownQuery, so a facet drill-down silently yielded zero
matches and produced empty highlights -- every faceted search lost its KWIC
snippets, while the same query without a facet filter highlighted fine.

Decouple the search query from the match query: search with the
DrillDownQuery, but store the pre-drill-down query (still index-type-filtered
and boosted) on the LuceneMatch for term/highlight extraction. searchAndProcess
gains a two-query overload; the single-query form delegates with the same query
for both, so non-faceted paths are unchanged. The drilldown and boost wrapping
is restructured around an applyBoost helper; search, scoring, and facet counts
are unaffected (only the match query consumed by highlighting changes).

Regression test (facet-drilldown-highlight.xqm): a faceted query now highlights
and marks the queried term, drill-down still selects/excludes by facet value,
and a plain query still highlights. 649 lucene XQSuite tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@joewiz joewiz requested a review from a team as a code owner June 8, 2026 14:43
@duncdrum duncdrum added the Lucene issue is related to Lucene or its integration label Jun 9, 2026

@duncdrum duncdrum left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug is real and this fix works, but I think the change surface is larger than needed.

LuceneUtil.extractTermsFromDrillDown already exists specifically to handle DrillDownQuery
in term extraction — it was added alongside the original facet feature. The problem is its
implementation: it calls query.rewrite(new IndexSearcher(reader)), which in Lucene 10
expands the DrillDownQuery into a BooleanQuery containing both the content query and the dimension-filter clauses. Walking all clauses mixes content terms with internal dimension terms (e.g. $facets:kind$para), which don't appear in document text and apparently prevent correct highlight extraction.

DrillDownQuery exposes getBaseQuery(), which returns the
content query directly — no rewrite, no dimension noise. That reduces the fix to one line in
LuceneUtil:

private static void extractTermsFromDrillDown(DrillDownQuery query, ...) {
    extractTerms(query.getBaseQuery(), terms, reader, includeFields);
}

With that in place, there's no need to separate searchQuery from matchQuery in
searchAndProcess, no applyBoost helper, and no duplication across the two query() overloads. The existing test suite (facet-drilldown-highlight.xqm) would still be the right regression harness.

Did you give this a try, or am I missing something?

}
searchAndProcess(contextId, qname, docs, contextSet, resultSet,
returnAncestor, searcher, query, config);
returnAncestor, searcher, applyBoost(searchQuery, config), applyBoost(query, config), config);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I m not sure why we apply boost here, that is normally handled elsewhere. How is boost is relevant to highlighting?

}
searchAndProcess(contextId, qname, docs, contextSet, resultSet,
returnAncestor, searcher, query, config);
returnAncestor, searcher, applyBoost(searchQuery, config), applyBoost(query, config), config);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Lucene issue is related to Lucene or its integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants