Skip to content

HBASE-30161 Add paginated, single-RPC RegionLocator.getRegionLocations(startKey, limit) API for bulk meta-cache warmup#8236

Open
sanjeet006py wants to merge 5 commits into
apache:branch-2from
sanjeet006py:HBASE-30161
Open

HBASE-30161 Add paginated, single-RPC RegionLocator.getRegionLocations(startKey, limit) API for bulk meta-cache warmup#8236
sanjeet006py wants to merge 5 commits into
apache:branch-2from
sanjeet006py:HBASE-30161

Conversation

@sanjeet006py
Copy link
Copy Markdown
Contributor

JIRA: HBASE-30161

Generated-by: Claude Opus 4.7

Sanjeet Malhotra added 2 commits May 14, 2026 13:49
public List<HRegionLocation> getRegionLocations(byte[] startKey, int limit) throws IOException {
// No need to page as region locations are already in-memory.
return getAllRegionLocations();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually breaking the contract established by the javadoc by always returning all regions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was also wondering same. Thanks for bringing this up. Instead of implementing paging here, shall I throw an exception saying use getAllRegionLocations as I don't see advantage of paging in here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that works too, as long as you call out the fact that all implementation may not support it and that they should fallback to getAllRegionLocations.

* more regions exist
* @throws IOException if a remote or network exception occurs
*/
List<HRegionLocation> getRegionLocations(byte[] startKey, int limit) throws IOException;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is changing a public interface on a stable branch, this will break external code implementing the interface, why not have a default implementation perhaps throw unsupported or return a blank list?

* Ordering: regions are returned in ascending region start-key order (the natural order of
* {@code hbase:meta} rows for a single table). Within each region, replicas are returned in
* ascending replica-id order (replica 0, then 1, then 2, ...). Split parents and offline regions
* are filtered out, which may cause a page to contain fewer than {@code limit} regions but never
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this filtering happening? I didn't see any test coverage either. The existing methods don't do any such filtering correct, so is this even needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering is already implemented and happens in inside MetaTableAccessor.DefaultVisitorBase#visit(). Call chain: HRegionLocator#getRegionLocations() -> MetaTableAccessor.TableVisitorBase#visit() -> MetaTableAccessor.DefaultVisitorBase#visit().

Copy link
Copy Markdown
Contributor Author

@sanjeet006py sanjeet006py May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reusing the filtering logic so, no test coverage needed.

* more regions exist
* @throws IOException if a remote or network exception occurs
*/
List<HRegionLocation> getRegionLocations(byte[] startKey, int limit) throws IOException;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is overloading an existing methods that takes a data row to find its locations, which is a totally different semantic and so reusing that name for this can be very confusing. Perhaps just call this getLocations or may be you can come up with a better name.

sanjeet006py pushed a commit to sanjeet006py/hbase that referenced this pull request May 14, 2026
… and add default-throws

Review comments on apache#8236 by haridsv:
- Adding an abstract method to RegionLocator / AsyncTableRegionLocator breaks
  external implementers. Convert the new method to a default that throws
  UnsupportedOperationException, with javadoc instructing callers to fall
  back to getAllRegionLocations().
- Reusing the getRegionLocations name overloads a method with completely
  different semantic (row -> all replicas of containing region). Rename to
  getRegionLocationsPage(byte[] startKey, int limit) to make the
  pagination/range intent explicit.
- SnapshotRegionLocator no longer needs an override; it inherits the
  default-throws and callers fall back to getAllRegionLocations().

Tests renamed to call getRegionLocationsPage; semantic and assertions
unchanged.
Sanjeet Malhotra added 3 commits May 15, 2026 01:57
… and add default-throws

Review comments by haridsv on apache#8236:
- Adding an abstract method to RegionLocator on a stable branch breaks
  external implementers. Convert the new method to a default that throws
  UnsupportedOperationException, with javadoc instructing callers to fall
  back to getAllRegionLocations().
- Reusing the getRegionLocations name overloads a method with completely
  different semantic (row -> all replicas of containing region). Rename to
  getRegionLocationsPage(byte[] startKey, int limit) to make the
  pagination/range intent explicit.
- SnapshotRegionLocator no longer needs an override; it inherits the
  default-throws and callers fall back to getAllRegionLocations().

Tests renamed to call getRegionLocationsPage; semantics and assertions
unchanged.
Address the canvas review point that the paged meta scan was making
ceil(limit / hbase.meta.scanner.caching) ScannerNext RPCs whenever the
limit exceeded the configured caching, contradicting the
"completes in a single RPC" javadoc on
MetaTableAccessor.scanMetaForTableRegions and the
"at most one RPC per invocation" claim on
RegionLocator.getRegionLocationsPage.

Plumb a private isPagedScan flag through MetaTableAccessor.scanMeta and
getMetaScan. When true, set setCaching to the caller-supplied
rowUpperLimit so the slice returns in a single ScannerNext RPC regardless
of the configured caching default. All other existing scanMeta callers
(scanByRegionEncodedName, getClosestRegionInfo, scanMetaForTableRegions
without rowLimit) keep their existing behavior by passing
isPagedScan = false.
…scans

Mirrors the master-PR test (TestRegionLocatorPagedScanRpcCount on
apache#8237). On the sync path we proxy:

  Connection.getTable(META_TABLE_NAME) -> Table.getScanner(Scan) ->
  ResultScanner.close()

so we can flip Scan.setScanMetricsEnabled(true) on the way in and read
ScanMetrics.countOfRPCcalls before the underlying scanner is closed. The
RPC count is the natural sync analogue of the
AdvancedScanResultConsumer.onNext invocation count we used on master.

Cluster runs with hbase.meta.scanner.caching = 2 against a table of
5 user regions, asserting:
- limit <= caching: 1 ScannerNext RPC (baseline)
- limit > caching: 1 ScannerNext RPC (regression check; without the
                   isPagedScan fix this would be ceil(5/2) = 3)
- unbounded scan:  ceil(NUM_REGIONS / caching) = 3 ScannerNext RPCs
                   (proves the isPagedScan flag does not leak into the
                   non-paged callers)

Verified the test catches the regression: reverting the isPagedScan
branch in MetaTableAccessor.getMetaScan makes
testSingleRpcWhenLimitExceedsCaching fail with countOfRPCcalls = 3.
@sanjeet006py sanjeet006py requested a review from haridsv May 15, 2026 05:45
* Cluster runs with {@code hbase.meta.scanner.caching = 2} so the {@code limit > caching} branch is
* exercised cheaply with a small table (5 user regions).
*/
@Category({ MediumTests.class, ClientTests.class })
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that the purpose is to verify that the scanner caching value is correct, I feel a mini cluster test is unnecessary and adds too much time on the test duration. Please check if you can achieve this using a mock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to find a way to do this with a mock and got a bit stuck — wanted to share where I landed.

The property the test is asserting is "exactly one ScannerNext RPC fires for a paged scan, regardless of hbase.meta.scanner.caching." That isn't really a property of MetaTableAccessor in isolation — it's an emergent property of ClientScanner + ScannerCallable + the RegionServer responding to the scan's caching/limit, and ScanMetrics.countOfRPCcalls is incremented inside that machinery on real responses.

Two mock-based shapes I considered, both of which seem to weaken the test:

  1. Capture the Scan and assert scan.getCaching() == rowLimit. This restates the one-line fix in getMetaScan and wouldn't catch a future regression where caching is set correctly but the client-side batching contract changes.
  2. Hand-roll a fake ResultScanner that simulates ScannerNext batching from caching/limit. That re-implements the contract we're verifying inside the test fake and asserts our fake matches our expectation — feels tautological.

If you see a shape I missed that preserves the RPC-count assertion without a cluster, I'd be happy to switch. Otherwise I'd lean toward keeping this as an integration test.

* Release {@link #userRegionLock} previously acquired via {@link #takeUserRegionLock()} and
* record the held duration in metrics.
* @param lockStartTimeMs value of {@link EnvironmentEdgeManager#currentTime()} captured
* immediately after {@link #takeUserRegionLock()} returned
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder how spotless didn't take care of spaces.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is valid space as we are listing method parameters and so, spotless also didn't flag it. On viewing in editor it is aligned rightly.
Screenshot 2026-05-15 at 9 10 20 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants