fix(context): extract non-ASCII (CJK/Hangul) keywords so context works for Korean queries#516
Open
andy-sg wants to merge 1 commit into
Open
Conversation
codegraph_context returned an empty context for non-ASCII task
descriptions (e.g. "로그인") even though codegraph_query found the
symbols. The two keyword extractors that feed context —
extractSymbolsFromQuery (src/context) and extractSearchTerms
(src/search/query-utils) — were built solely from ASCII patterns
([a-zA-Z] + the ASCII \b word boundary), so they yielded zero keywords
for a non-Latin query and searched for nothing.
Both extractors now also pick up runs of Unicode letters/digits and pass
them to the existing FTS path, which already tokenizes non-ASCII text via
unicode61. The change is additive: ASCII extraction is byte-identical
([a-zA-Z0-9] ⊂ [\p{L}\p{N}] and every ASCII separator stays a separator),
and only the FTS keyword-extraction preprocessing is touched — FTS index
columns are unchanged.
Adds Korean unit coverage for extractSearchTerms and an end-to-end
context.test.ts case proving Korean symbols now surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
codegraph context <task>(and thecodegraph_contextMCP tool) returns an empty context for non-ASCII task descriptions — Korean, Japanese, Chinese, etc. — even though symbol search works fine for the same input.Reproduction (before this PR):
Root cause
Symbol search already handles non-ASCII (FTS5's default
unicode61tokenizer + prefix/LIKE matching, andname COLLATE NOCASE =exact lookup). The break is purely in the keyword-extraction preprocessing that feedscontext.Both extractors are built entirely from ASCII patterns —
[a-zA-Z]character classes plus JS's ASCII\bword boundary:extractSymbolsFromQuery(src/context/index.ts) — camelCase / snake_case / SCREAMING / acronym / dot-notation / lowercase, all[a-zA-Z]-anchored.extractSearchTerms(src/search/query-utils.ts) — final tokenization splits on/[^a-zA-Z0-9]+/, which treats every Hangul/CJK character as a separator, so a non-ASCII query tokenizes to nothing.Result: a non-Latin description yields zero keywords, so the search/traversal pipeline receives nothing and the context comes back empty.
Fix (additive only)
extractSymbolsFromQuery: after the existing ASCII patterns, also pull runs of Unicode letters/digits (/[\p{L}\p{N}_]+/gu) and keep only tokens that actually contain a non-ASCII character. ASCII-only tokens stay owned by the existing patterns, so nothing about ASCII behavior changes.extractSearchTerms: change the final word split from/[^a-zA-Z0-9]+/to/[^\p{L}\p{N}]+/u. This is byte-identical for ASCII ([a-zA-Z0-9] ⊂ [\p{L}\p{N}], and every ASCII separator — space, punctuation,_,.— remains a separator) while keeping Unicode letter runs intact. Non-ASCII tokens use a 2-char floor (Hangul/CJK pack a morpheme per character); ASCII keeps its 3-char floor.Extracted non-ASCII tokens flow straight into the existing search path, which already tokenizes them via
unicode61.Out of scope (deliberately): indexing comments/string-literal bodies into FTS. The FTS index still covers only
name/qualified_name/docstring/signature— unchanged. This PR fixes only the context keyword-extraction step.After this PR,
codegraph context "로그인"surfaces the로그인entry point, the인증확인related symbol, and their code blocks.Tests
__tests__/query-utils.test.ts(new): unit coverage forextractSearchTerms— ASCII behavior unchanged (camelCase/snake_case/dot-notation split, stop-word + <3-char drop) plus Korean extraction (single token, multi-word split, 2-char floor, mixed ASCII+Korean).__tests__/context.test.ts(added): end-to-end — index a Korean source file and assertfindRelevantContext/buildContextnow surface로그인/인증확인/사용자관리자(a regression guard for the reported empty-context bug).