Skip to content

fix(context): extract non-ASCII (CJK/Hangul) keywords so context works for Korean queries#516

Open
andy-sg wants to merge 1 commit into
colbymchenry:mainfrom
andy-sg:fix/context-non-ascii-keyword-extraction
Open

fix(context): extract non-ASCII (CJK/Hangul) keywords so context works for Korean queries#516
andy-sg wants to merge 1 commit into
colbymchenry:mainfrom
andy-sg:fix/context-non-ascii-keyword-extraction

Conversation

@andy-sg
Copy link
Copy Markdown

@andy-sg andy-sg commented May 28, 2026

Problem

codegraph context <task> (and the codegraph_context MCP tool) returns an empty context for non-ASCII task descriptions — Korean, Japanese, Chinese, etc. — even though symbol search works fine for the same input.

Reproduction (before this PR):

// src/auth.ts
export function 로그인(사용자명: string): boolean { return 인증확인(사용자명); }
export function 인증확인(사용자명: string): boolean { return 사용자명.length > 0; }
export class 사용자관리자 { 생성하기(이름: string): string { return 이름; } }
$ codegraph query "로그인"      # ✅ finds the 로그인 function
$ codegraph context "로그인"
## Code Context

**Query:** 로그인
                               # ❌ header + query only, zero symbols

Root cause

Symbol search already handles non-ASCII (FTS5's default unicode61 tokenizer + prefix/LIKE matching, and name COLLATE NOCASE = exact lookup). The break is purely in the keyword-extraction preprocessing that feeds context.

Both extractors are built entirely from ASCII patterns — [a-zA-Z] character classes plus JS's ASCII \b word boundary:

  • extractSymbolsFromQuery (src/context/index.ts) — camelCase / snake_case / SCREAMING / acronym / dot-notation / lowercase, all [a-zA-Z]-anchored.
  • extractSearchTerms (src/search/query-utils.ts) — final tokenization splits on /[^a-zA-Z0-9]+/, which treats every Hangul/CJK character as a separator, so a non-ASCII query tokenizes to nothing.

Result: a non-Latin description yields zero keywords, so the search/traversal pipeline receives nothing and the context comes back empty.

Fix (additive only)

  • extractSymbolsFromQuery: after the existing ASCII patterns, also pull runs of Unicode letters/digits (/[\p{L}\p{N}_]+/gu) and keep only tokens that actually contain a non-ASCII character. ASCII-only tokens stay owned by the existing patterns, so nothing about ASCII behavior changes.
  • extractSearchTerms: change the final word split from /[^a-zA-Z0-9]+/ to /[^\p{L}\p{N}]+/u. This is byte-identical for ASCII ([a-zA-Z0-9] ⊂ [\p{L}\p{N}], and every ASCII separator — space, punctuation, _, . — remains a separator) while keeping Unicode letter runs intact. Non-ASCII tokens use a 2-char floor (Hangul/CJK pack a morpheme per character); ASCII keeps its 3-char floor.

Extracted non-ASCII tokens flow straight into the existing search path, which already tokenizes them via unicode61.

Out of scope (deliberately): indexing comments/string-literal bodies into FTS. The FTS index still covers only name / qualified_name / docstring / signature — unchanged. This PR fixes only the context keyword-extraction step.

After this PR, codegraph context "로그인" surfaces the 로그인 entry point, the 인증확인 related symbol, and their code blocks.

Tests

  • __tests__/query-utils.test.ts (new): unit coverage for extractSearchTerms — ASCII behavior unchanged (camelCase/snake_case/dot-notation split, stop-word + <3-char drop) plus Korean extraction (single token, multi-word split, 2-char floor, mixed ASCII+Korean).
  • __tests__/context.test.ts (added): end-to-end — index a Korean source file and assert findRelevantContext / buildContext now surface 로그인 / 인증확인 / 사용자관리자 (a regression guard for the reported empty-context bug).
  • Full suite: 49 files, 1080 passed / 2 skipped, no regressions. Verified the before (empty) → after (symbols surfaced) transition by building both states against the repro repo.

codegraph_context returned an empty context for non-ASCII task
descriptions (e.g. "로그인") even though codegraph_query found the
symbols. The two keyword extractors that feed context —
extractSymbolsFromQuery (src/context) and extractSearchTerms
(src/search/query-utils) — were built solely from ASCII patterns
([a-zA-Z] + the ASCII \b word boundary), so they yielded zero keywords
for a non-Latin query and searched for nothing.

Both extractors now also pick up runs of Unicode letters/digits and pass
them to the existing FTS path, which already tokenizes non-ASCII text via
unicode61. The change is additive: ASCII extraction is byte-identical
([a-zA-Z0-9] ⊂ [\p{L}\p{N}] and every ASCII separator stays a separator),
and only the FTS keyword-extraction preprocessing is touched — FTS index
columns are unchanged.

Adds Korean unit coverage for extractSearchTerms and an end-to-end
context.test.ts case proving Korean symbols now surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant