Skip to content

8384082: Incorrect surrogate handling in Pattern.hasBaseCharacter for word boundaries#31067

Open
cushon wants to merge 1 commit intoopenjdk:masterfrom
cushon:JDK-8384082
Open

8384082: Incorrect surrogate handling in Pattern.hasBaseCharacter for word boundaries#31067
cushon wants to merge 1 commit intoopenjdk:masterfrom
cushon:JDK-8384082

Conversation

@cushon
Copy link
Copy Markdown
Contributor

@cushon cushon commented May 7, 2026

Please consider this fix to the handling of word boundaries (\b) with non-spacing marks with non-BMP base characters, when UNICODE_CHARACTER_CLASS is not set.

See discussion in JDK-8384082.



Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8384082: Incorrect surrogate handling in Pattern.hasBaseCharacter for word boundaries (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/31067/head:pull/31067
$ git checkout pull/31067

Update a local copy of the PR:
$ git checkout pull/31067
$ git pull https://git.openjdk.org/jdk.git pull/31067/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 31067

View PR using the GUI difftool:
$ git pr show -t 31067

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/31067.diff

Using Webrev

Link to Webrev Comment

@cushon cushon marked this pull request as ready for review May 7, 2026 09:29
@bridgekeeper
Copy link
Copy Markdown

bridgekeeper Bot commented May 7, 2026

👋 Welcome back cushon! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented May 7, 2026

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk Bot added the core-libs core-libs-dev@openjdk.org label May 7, 2026
@openjdk
Copy link
Copy Markdown

openjdk Bot commented May 7, 2026

@cushon The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk Bot added the rfr Pull request is ready for review label May 7, 2026
@mlbridge
Copy link
Copy Markdown

mlbridge Bot commented May 7, 2026

Webrevs

@AlanBateman
Copy link
Copy Markdown
Contributor

The change means that boundary offsets will change for existing (but maybe buggy) code that uses \b. So I think we need to think about the compatibility impact and get some sense as to whether this could cause problems with existing code.

@cushon
Copy link
Copy Markdown
Contributor Author

cushon commented May 7, 2026

The change means that boundary offsets will change for existing (but maybe buggy) code that uses \b. So I think we need to think about the compatibility impact and get some sense as to whether this could cause problems with existing code.

I can do more corpus analysis of existing code.

I found this by fuzzing, I haven't seen any occurrences of it with real-world regexes yet. I suspect the compatibility impact is minimal.

@AlanBateman
Copy link
Copy Markdown
Contributor

I can do more corpus analysis of existing code.

I found this by fuzzing, I haven't seen any occurrences of it with real-world regexes yet. I suspect the compatibility impact is minimal.

Thanks. I just have a concern that there may be simple tokenizers or parsers using \b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

2 participants