[fix](coordinator) build TPlan within the table lock to prevent NPE in getSchemaByIndexId#63058
Closed
feiniaofeiafei wants to merge 2 commits intoapache:branch-3.1from
Closed
[fix](coordinator) build TPlan within the table lock to prevent NPE in getSchemaByIndexId#63058feiniaofeiafei wants to merge 2 commits intoapache:branch-3.1from
feiniaofeiafei wants to merge 2 commits intoapache:branch-3.1from
Conversation
CIR-20142 / DORIS-23676: NullPointerException in OlapTable.getSchemaByIndexId caused by a race condition between query execution and schema change completion. Root cause: - Nereids planner acquires table read lock, selects selectedIndexId (= baseIndexId for DUP_KEYS/UNIQUE tables), then releases the lock after planning completes - Later, OlapScanNode.toThrift() calls getSchemaByIndexId(selectedIndexId) WITHOUT holding any table lock - A concurrent SchemaChangeJobV2.onFinished (under write lock) calls deleteIndexInfo(originIdxName) which removes originIdxId from indexIdToMeta - After the write lock is released, the query toThrift() gets null -> NPE Fix (ported from apache#59298): 1. NereidsPlanner: call cacheThriftPlans() inside the lock callback (while table read lock is still held), so the TPlan Thrift serialization happens under the lock 2. PlanFragment: add cacheThriftPlan() that lazily serializes planRoot.treeToThrift() via a memoized Supplier; toThrift() uses the cached result 3. OlapTable.getSchemaByIndexId: add null guard that throws a clear RuntimeException with context info (table, index id, available ids) instead of an opaque NPE Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
Contributor
Author
|
/review |
Fix alphabetical import order: TPartitionType must come before TPlan (TPartitionType starts with 'Tpa' which sorts before TPlan's 'Tpl'). This was causing the COMPILE CI check to fail due to Checkstyle violation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
run buildall |
Contributor
Author
|
/review |
Contributor
Author
|
run buildall |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Fix NPE:
Cannot invoke "org.apache.doris.catalog.MaterializedIndexMeta.getSchema()" because the return value of "java.util.Map.get(Object)" is nullin OlapTable.getSchemaByIndexId.Related: DORIS-23676, CIR-20142
Root cause (race condition):
selectedIndexId = baseIndexId, then releases the lock after planningSchemaChangeJobV2.onFinished(under write lock) callsdeleteIndexInfo(originIdxName)which removesoriginIdxIdfromindexIdToMetaOlapScanNode.toThrift()callsgetSchemaByIndexId(selectedIndexId)without holding any table lock → returns null → NPEFix (same approach as #59298 for master):
NereidsPlanner: callcacheThriftPlans()inside the lock callback (while table read lock is still held), serializing the TPlan Thrift under the lockPlanFragment: addcacheThriftPlan()that pre-serializesplanRoot.treeToThrift()via a memoizedSupplier;toThrift()reuses the cached resultOlapTable.getSchemaByIndexId: add null guard that throws a clearRuntimeExceptionwith context info instead of an opaque NPERelease note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?