Skip to content

[fix](coordinator) build TPlan within the table lock to prevent NPE in getSchemaByIndexId#63058

Closed
feiniaofeiafei wants to merge 2 commits intoapache:branch-3.1from
feiniaofeiafei:fix/CIR-20142-schema-npe
Closed

[fix](coordinator) build TPlan within the table lock to prevent NPE in getSchemaByIndexId#63058
feiniaofeiafei wants to merge 2 commits intoapache:branch-3.1from
feiniaofeiafei:fix/CIR-20142-schema-npe

Conversation

@feiniaofeiafei
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Fix NPE: Cannot invoke "org.apache.doris.catalog.MaterializedIndexMeta.getSchema()" because the return value of "java.util.Map.get(Object)" is null in OlapTable.getSchemaByIndexId.

Related: DORIS-23676, CIR-20142

Root cause (race condition):

  1. Nereids planner acquires table read lock, selects selectedIndexId = baseIndexId, then releases the lock after planning
  2. A concurrent SchemaChangeJobV2.onFinished (under write lock) calls deleteIndexInfo(originIdxName) which removes originIdxId from indexIdToMeta
  3. Later, OlapScanNode.toThrift() calls getSchemaByIndexId(selectedIndexId) without holding any table lock → returns null → NPE

Fix (same approach as #59298 for master):

  1. NereidsPlanner: call cacheThriftPlans() inside the lock callback (while table read lock is still held), serializing the TPlan Thrift under the lock
  2. PlanFragment: add cacheThriftPlan() that pre-serializes planRoot.treeToThrift() via a memoized Supplier; toThrift() reuses the cached result
  3. OlapTable.getSchemaByIndexId: add null guard that throws a clear RuntimeException with context info instead of an opaque NPE

Release note

None

Check List (For Author)

CIR-20142 / DORIS-23676: NullPointerException in OlapTable.getSchemaByIndexId
caused by a race condition between query execution and schema change completion.

Root cause:
- Nereids planner acquires table read lock, selects selectedIndexId (= baseIndexId
  for DUP_KEYS/UNIQUE tables), then releases the lock after planning completes
- Later, OlapScanNode.toThrift() calls getSchemaByIndexId(selectedIndexId) WITHOUT
  holding any table lock
- A concurrent SchemaChangeJobV2.onFinished (under write lock) calls
  deleteIndexInfo(originIdxName) which removes originIdxId from indexIdToMeta
- After the write lock is released, the query toThrift() gets null -> NPE

Fix (ported from apache#59298):
1. NereidsPlanner: call cacheThriftPlans() inside the lock callback (while table
   read lock is still held), so the TPlan Thrift serialization happens under the lock
2. PlanFragment: add cacheThriftPlan() that lazily serializes planRoot.treeToThrift()
   via a memoized Supplier; toThrift() uses the cached result
3. OlapTable.getSchemaByIndexId: add null guard that throws a clear RuntimeException
   with context info (table, index id, available ids) instead of an opaque NPE

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@feiniaofeiafei feiniaofeiafei requested a review from morrySnow as a code owner May 7, 2026 10:14
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@feiniaofeiafei
Copy link
Copy Markdown
Contributor Author

run buildall

@feiniaofeiafei
Copy link
Copy Markdown
Contributor Author

/review

Fix alphabetical import order: TPartitionType must come before TPlan
(TPartitionType starts with 'Tpa' which sorts before TPlan's 'Tpl').

This was causing the COMPILE CI check to fail due to Checkstyle violation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@feiniaofeiafei feiniaofeiafei reopened this May 7, 2026
@feiniaofeiafei
Copy link
Copy Markdown
Contributor Author

run buildall

@feiniaofeiafei
Copy link
Copy Markdown
Contributor Author

/review

@feiniaofeiafei
Copy link
Copy Markdown
Contributor Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants