fix(context): extract non-ASCII (CJK/Hangul) keywords so context works for Korean queries by andy-sg · Pull Request #516 · colbymchenry/codegraph

andy-sg · 2026-05-28T07:55:23Z

Problem

codegraph context <task> (and the codegraph_context MCP tool) returns an empty context for non-ASCII task descriptions — Korean, Japanese, Chinese, etc. — even though symbol search works fine for the same input.

Reproduction (before this PR):

// src/auth.ts
export function 로그인(사용자명: string): boolean { return 인증확인(사용자명); }
export function 인증확인(사용자명: string): boolean { return 사용자명.length > 0; }
export class 사용자관리자 { 생성하기(이름: string): string { return 이름; } }

$ codegraph query "로그인"      # ✅ finds the 로그인 function
$ codegraph context "로그인"
## Code Context

**Query:** 로그인
                               # ❌ header + query only, zero symbols

Root cause

Symbol search already handles non-ASCII (FTS5's default unicode61 tokenizer + prefix/LIKE matching, and name COLLATE NOCASE = exact lookup). The break is purely in the keyword-extraction preprocessing that feeds context.

Both extractors are built entirely from ASCII patterns — [a-zA-Z] character classes plus JS's ASCII \b word boundary:

extractSymbolsFromQuery (src/context/index.ts) — camelCase / snake_case / SCREAMING / acronym / dot-notation / lowercase, all [a-zA-Z]-anchored.
extractSearchTerms (src/search/query-utils.ts) — final tokenization splits on /[^a-zA-Z0-9]+/, which treats every Hangul/CJK character as a separator, so a non-ASCII query tokenizes to nothing.

Result: a non-Latin description yields zero keywords, so the search/traversal pipeline receives nothing and the context comes back empty.

Fix (additive only)

extractSymbolsFromQuery: after the existing ASCII patterns, also pull runs of Unicode letters/digits (/[\p{L}\p{N}_]+/gu) and keep only tokens that actually contain a non-ASCII character. ASCII-only tokens stay owned by the existing patterns, so nothing about ASCII behavior changes.
extractSearchTerms: change the final word split from /[^a-zA-Z0-9]+/ to /[^\p{L}\p{N}]+/u. This is byte-identical for ASCII ([a-zA-Z0-9] ⊂ [\p{L}\p{N}], and every ASCII separator — space, punctuation, _, . — remains a separator) while keeping Unicode letter runs intact. Non-ASCII tokens use a 2-char floor (Hangul/CJK pack a morpheme per character); ASCII keeps its 3-char floor.

Extracted non-ASCII tokens flow straight into the existing search path, which already tokenizes them via unicode61.

Out of scope (deliberately): indexing comments/string-literal bodies into FTS. The FTS index still covers only name / qualified_name / docstring / signature — unchanged. This PR fixes only the context keyword-extraction step.

After this PR, codegraph context "로그인" surfaces the 로그인 entry point, the 인증확인 related symbol, and their code blocks.

Tests

__tests__/query-utils.test.ts (new): unit coverage for extractSearchTerms — ASCII behavior unchanged (camelCase/snake_case/dot-notation split, stop-word + <3-char drop) plus Korean extraction (single token, multi-word split, 2-char floor, mixed ASCII+Korean).
__tests__/context.test.ts (added): end-to-end — index a Korean source file and assert findRelevantContext / buildContext now surface 로그인 / 인증확인 / 사용자관리자 (a regression guard for the reported empty-context bug).
Full suite: 49 files, 1080 passed / 2 skipped, no regressions. Verified the before (empty) → after (symbols surfaced) transition by building both states against the repro repo.

codegraph_context returned an empty context for non-ASCII task descriptions (e.g. "로그인") even though codegraph_query found the symbols. The two keyword extractors that feed context — extractSymbolsFromQuery (src/context) and extractSearchTerms (src/search/query-utils) — were built solely from ASCII patterns ([a-zA-Z] + the ASCII \b word boundary), so they yielded zero keywords for a non-Latin query and searched for nothing. Both extractors now also pick up runs of Unicode letters/digits and pass them to the existing FTS path, which already tokenizes non-ASCII text via unicode61. The change is additive: ASCII extraction is byte-identical ([a-zA-Z0-9] ⊂ [\p{L}\p{N}] and every ASCII separator stays a separator), and only the FTS keyword-extraction preprocessing is touched — FTS index columns are unchanged. Adds Korean unit coverage for extractSearchTerms and an end-to-end context.test.ts case proving Korean symbols now surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(context): extract non-ASCII (CJK/Hangul) keywords so context works for Korean queries#516

fix(context): extract non-ASCII (CJK/Hangul) keywords so context works for Korean queries#516
andy-sg wants to merge 1 commit into
colbymchenry:mainfrom
andy-sg:fix/context-non-ascii-keyword-extraction

andy-sg commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andy-sg commented May 28, 2026

Problem

Root cause

Fix (additive only)

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant