8384082: Incorrect surrogate handling in Pattern.hasBaseCharacter for word boundaries#31067
8384082: Incorrect surrogate handling in Pattern.hasBaseCharacter for word boundaries#31067cushon wants to merge 1 commit intoopenjdk:masterfrom
Conversation
|
👋 Welcome back cushon! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
The change means that boundary offsets will change for existing (but maybe buggy) code that uses |
I can do more corpus analysis of existing code. I found this by fuzzing, I haven't seen any occurrences of it with real-world regexes yet. I suspect the compatibility impact is minimal. |
Thanks. I just have a concern that there may be simple tokenizers or parsers using |
Please consider this fix to the handling of word boundaries (
\b) with non-spacing marks with non-BMP base characters, whenUNICODE_CHARACTER_CLASSis not set.See discussion in JDK-8384082.
Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/31067/head:pull/31067$ git checkout pull/31067Update a local copy of the PR:
$ git checkout pull/31067$ git pull https://git.openjdk.org/jdk.git pull/31067/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 31067View PR using the GUI difftool:
$ git pr show -t 31067Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/31067.diff
Using Webrev
Link to Webrev Comment