Skip to content

fix: lucene_sanitize corrupts all words containing O/R/N/T/A/D#1595

Open
totto wants to merge 1 commit into
getzep:mainfrom
totto:fix/lucene-sanitize-boolean-keywords
Open

fix: lucene_sanitize corrupts all words containing O/R/N/T/A/D#1595
totto wants to merge 1 commit into
getzep:mainfrom
totto:fix/lucene-sanitize-boolean-keywords

Conversation

@totto

@totto totto commented Jun 17, 2026

Copy link
Copy Markdown

What

lucene_sanitize() added individual uppercase letters O, R, N, T, A, D to the character escape map in an attempt to neutralise Lucene's boolean operators (AND, OR, NOT). This corrupted every word containing those letters:

lucene_sanitize("Robot")    # "\R\ob\ot"
lucene_sanitize("Toronto")  # "\T\or\on\t\o"
lucene_sanitize("ORANGE")   # "\O\R\A\N\GE"
lucene_sanitize("NOT")      # "\N\O\T"  (this part worked, sort of)

Why it matters

Any full-text search query containing common English words with capital letters returns corrupted Lucene syntax, producing no results or query parse errors.

Fix

Remove the letter entries from the escape map. Add a word-boundary regex substitution that lowercases only the standalone boolean keywords AND, OR, NOT. Lowercase and/or/not are not Lucene operators and are indexed as literal words:

sanitized = re.sub(r'\b(AND|OR|NOT)\b', lambda m: m.group(0).lower(), sanitized)
lucene_sanitize("Robot")          # "Robot"    ✓
lucene_sanitize("Toronto")        # "Toronto"  ✓
lucene_sanitize("status OR error") # "status or error"  ✓ (OR neutralised)
lucene_sanitize("foo AND bar")    # "foo and bar"  ✓

re is already imported in helpers.py.

…etters

The previous implementation added individual uppercase letters O, R, N, T,
A, D to the character escape map in an attempt to neutralise Lucene's AND,
OR, NOT boolean operators. This corrupted every word containing those
letters -- "Robot" became "\R\ob\ot", "Toronto" became "\T\or\on\t\o", etc.

Fix: remove the letter entries and replace with a word-boundary regex
substitution that lowercases only the standalone keywords AND, OR, NOT.
Lowercase and/or/not are not Lucene operators, so they are indexed as
literal words. All other mixed-case words containing those letters are
now passed through unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zep-cla-assistant

Copy link
Copy Markdown
Contributor


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. For privacy information, see our Privacy Notice. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA behalf on myself, e-mail: example@example.com

or

I have read the CLA Document and I hereby sign the CLA behalf of my company, e-mail: example@example.com

Signature is valid for 6 months.


This bot will be retriggered when the Contributor License Agreement comment has been provided. Posted by the CLA Assistant Lite bot.

@totto

totto commented Jun 19, 2026

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA behalf on myself, e-mail: totto@exoreaction.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant