fix: lucene_sanitize corrupts all words containing O/R/N/T/A/D#1595
fix: lucene_sanitize corrupts all words containing O/R/N/T/A/D#1595totto wants to merge 1 commit into
Conversation
…etters The previous implementation added individual uppercase letters O, R, N, T, A, D to the character escape map in an attempt to neutralise Lucene's AND, OR, NOT boolean operators. This corrupted every word containing those letters -- "Robot" became "\R\ob\ot", "Toronto" became "\T\or\on\t\o", etc. Fix: remove the letter entries and replace with a word-boundary regex substitution that lowercases only the standalone keywords AND, OR, NOT. Lowercase and/or/not are not Lucene operators, so they are indexed as literal words. All other mixed-case words containing those letters are now passed through unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I have read the CLA Document and I hereby sign the CLA behalf on myself, e-mail: example@example.com or I have read the CLA Document and I hereby sign the CLA behalf of my company, e-mail: example@example.com Signature is valid for 6 months. This bot will be retriggered when the Contributor License Agreement comment has been provided. Posted by the CLA Assistant Lite bot. |
|
I have read the CLA Document and I hereby sign the CLA behalf on myself, e-mail: totto@exoreaction.com |
What
lucene_sanitize()added individual uppercase lettersO,R,N,T,A,Dto the character escape map in an attempt to neutralise Lucene's boolean operators (AND,OR,NOT). This corrupted every word containing those letters:Why it matters
Any full-text search query containing common English words with capital letters returns corrupted Lucene syntax, producing no results or query parse errors.
Fix
Remove the letter entries from the escape map. Add a word-boundary regex substitution that lowercases only the standalone boolean keywords
AND,OR,NOT. Lowercaseand/or/notare not Lucene operators and are indexed as literal words:reis already imported inhelpers.py.