Skip to content

HIVE-28755: Statistics Management Task#6438

Open
DanielZhu58 wants to merge 7 commits intoapache:masterfrom
DanielZhu58:HIVE-28755
Open

HIVE-28755: Statistics Management Task#6438
DanielZhu58 wants to merge 7 commits intoapache:masterfrom
DanielZhu58:HIVE-28755

Conversation

@DanielZhu58
Copy link
Copy Markdown
Contributor

@DanielZhu58 DanielZhu58 commented Apr 17, 2026

What changes were proposed in this pull request?

Added StatisticsManagementTask, a new MetastoreTaskThread that periodically scans TAB_COL_STATS and PART_COL_STATS and deletes column statistics whose lastAnalyzed timestamp is older than the configured retention period. Three new configs are introduced: metastore.column.statistics.auto.deletion (default: false), metastore.column.statistics.retention.period (default: 365 days), and metastore.column.statistics.management.task.frequency (default: 7 days). Individual tables can opt out by setting the table property statistics.auto.deletion.exclude.

Why are the changes needed?

Column statistics can become stale over time and consume unnecessary storage. There was no existing mechanism to automatically clean them up based on age, requiring manual intervention.

Does this PR introduce any user-facing change?

Yes. Three new metastore configuration properties are introduced. The feature is opt-in and disabled by default (metastore.column.statistics.auto.deletion=false), so existing deployments are unaffected after upgrade.

How was this patch tested?

Added TestStatisticsManagement with two unit tests: one verifying that expired table-level column statistics are deleted when the task runs, and one verifying that tables marked with the exclude property are left untouched.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Metastore background task intended to automatically delete expired column statistics, and wires it into the metastore task thread list. The PR also introduces a benchmark hook and a new unit test for the statistics-management behavior.

Changes:

  • Introduce StatisticsManagementTask (a MetastoreTaskThread) that deletes column stats older than a configured retention window.
  • Add new metastore configuration knobs for stats-management frequency/retention/enablement and register the task in the task thread list.
  • Add a micro-benchmark entry and a new unit test class for statistics management.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java Adds a benchmark for StatisticsManagementTask (currently simulates/validates the wrong thing).
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/BenchmarkTool.java Wires the new benchmark into the benchmark suite (one suite name is inconsistent).
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java New unit tests for stats auto-deletion (currently missing required statsDesc on the stats object).
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java New background task implementation (contains multiple correctness/compilation/runtime issues).
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java Adds new conf vars and registers the task; docstrings currently don’t match how the feature is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@soumyakanti3578 soumyakanti3578 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed StatisticsManagementTask.java yet because there are too many sonar/checkstyle warning.

Please follow the hive coding conventions to get rid of those warnings: https://hive.apache.org/community/resources/howtocontribute/#coding-conventions

You can also see the checkstyle file for standalone-metastore at standalone-metastore/checkstyle/checkstyle.xml which has the default values that sonar checks against.

I think you can also use mvn checkstyle from the console or download the checkstyle plugin for IntelliJ IDEA to catch these warnings before submitting the PR.

In general, if you're creating a new file, it's a good idea to follow the official coding conventions, and if you're updating small parts of a file then it's okay to follow the existing style to maintain readability. :)

"Automatic partition management will look for tables using the specified table pattern"),

STATISTICS_MANAGEMENT_TASK_FREQUENCY("metastore.statistics.management.task.frequency",
"metastore.statistics.management.task.frequency",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two arguments are the same. Should the second one (hiveName) be something else, like, hive.column.statistics.management.task.frequency? Not sure if it even matters tbh.

Similar issue for the other two props too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second argument (hiveName) exists for backward compatibility — when a config previously lived under a hive. prefix in older Hive versions, HMS needs to recognize both names. Since these three configs are brand new and have no prior hive.-prefixed equivalent, it's ok to keep them same.

String dbName = (String) row[0];
String tblName = (String) row[1];
String excludeVal = (String) row[2];
if (excludeVal != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is excludeVal non null when a user sets statistics.auto.deletion.exclude = false? If the answer is yes, then maybe we should explicitly check if the property is set to true?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged and changed.

msc.deleteColumnStatistics(request);
}
} finally {
if (tblQuery != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if tblQuery is AutoCloseable, we may not have close it manually. Maybe we should see if closeAll() is required at all or just a close() is fine. I believe if close() is sufficient, we can just remove the finally block and use try-with-resources but this is fine as-is. So I will leave it up to you.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query is already wrapped in try-with-resources, so close() is called automatically when the block exits. In DataNucleus's JDO implementation, close() internally delegates to closeAll(), so the behavior is equivalent.

}
}
}
} No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

if (exclude) {
t.getParameters().put(
StatisticsManagementTask.STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY, "true");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are only testing "true". We need another test for "false" or any other string value. I think this property should be explicitly checked for "true" and only then should we exclude the table from deletion. For all other values we should include the table.
What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me. It is a corner case that we need to cover. Added a new test which the value is false, and make sure it's still deleted.

boolean committed = false;
openTransaction();
try {
try (IMetaStoreClient msc = new HiveMetaStoreClient(conf)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why create a HiveMetaStoreClient if we can use the method in RawStore directly?

LOG.debug("Auto statistics deletion started. Cleaning up table/partition column statistics over the retention period.");
long retentionMillis =
MetastoreConf.getTimeVar(conf, MetastoreConf.ConfVars.COLUMN_STATISTICS_RETENTION_PERIOD, TimeUnit.MILLISECONDS);
if (retentionMillis <= 0 || !MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.COLUMN_STATISTICS_AUTO_DELETION)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perform the check in runFrequency as well, so we don't schedule this empty task

if (!LOCK.tryLock()) {
return;
}
try {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary nested try

DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName);
request.setEngine("hive");
request.setTableLevel(true);
msc.deleteColumnStatistics(request);
Copy link
Copy Markdown
Member

@dengzhhu653 dengzhhu653 May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it delete all stats even they are not stale?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

Comment on lines +41 to +43
* <p>When {@code metastore.statistics.auto.deletion} is enabled, this task scans
* {@code TAB_COL_STATS} and {@code PART_COL_STATS} for rows whose {@code lastAnalyzed} timestamp
* is older than {@code metastore.statistics.retention.period}, and deletes them.
@Override
public void setConf(Configuration configuration) {
this.conf = new Configuration(configuration);
super.setConf(configuration);
Comment on lines +91 to +99
long now = System.currentTimeMillis();
long lastAnalyzedThreshold = (now - retentionMillis) / 1000;
PersistenceManager pm = getPersistenceManager();
boolean committed = false;
openTransaction();
try {
try (IMetaStoreClient msc = new HiveMetaStoreClient(conf)) {
deleteExpiredTableColStats(pm, msc, lastAnalyzedThreshold);
deleteExpiredPartitionColStats(pm, msc, lastAnalyzedThreshold);
Comment on lines +130 to +148
// partitionName does not exist on MTableColumnStatistics; omitted here
tblQuery.setResult(
"table.database.name, "
+ "table.tableName, "
+ "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")");
@SuppressWarnings("unchecked")
List<Object[]> tblRows = (List<Object[]>) tblQuery.execute(lastAnalyzedThreshold);
for (Object[] row : tblRows) {
String dbName = (String) row[0];
String tblName = (String) row[1];
String excludeVal = (String) row[2];
if (excludeVal != null) {
LOG.info("Skipping auto deletion of table stats for {}.{} due to exclude property.",
dbName, tblName);
continue;
}
DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName);
request.setEngine("hive");
request.setTableLevel(true);
Comment on lines +182 to +197
for (Object[] row : partRows) {
String dbName = (String) row[0];
String tblName = (String) row[1];
String partName = (String) row[2];
String excludeVal = (String) row[3];
if (excludeVal != null) {
LOG.info("Skipping auto deletion of partition stats for {}.{} due to exclude property.",
dbName, tblName);
continue;
}
DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName);
request.setEngine("hive");
request.setTableLevel(false);
request.addToPart_names(partName);
msc.deleteColumnStatistics(request);
}
Comment on lines +103 to +115
@Test
public void testExpiredTableColStatsAreDeleted() throws Exception {
String dbName = "stats_db1";
String tableName = "tbl1";
createDbAndTable(dbName, tableName, false);
writeTableLevelColStats(dbName, tableName, "c1");
assertHasTableColStats(dbName, tableName, "c1");
makeAllTableColStatsOlderThanRetention(dbName, tableName);

runStatisticsManagementTask(conf);

assertNoTableColStats(dbName, tableName, "c1");
}
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 8, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants