HIVE-28755: Statistics Management Task#6438
HIVE-28755: Statistics Management Task#6438DanielZhu58 wants to merge 7 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Metastore background task intended to automatically delete expired column statistics, and wires it into the metastore task thread list. The PR also introduces a benchmark hook and a new unit test for the statistics-management behavior.
Changes:
- Introduce
StatisticsManagementTask(aMetastoreTaskThread) that deletes column stats older than a configured retention window. - Add new metastore configuration knobs for stats-management frequency/retention/enablement and register the task in the task thread list.
- Add a micro-benchmark entry and a new unit test class for statistics management.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java | Adds a benchmark for StatisticsManagementTask (currently simulates/validates the wrong thing). |
| standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/BenchmarkTool.java | Wires the new benchmark into the benchmark suite (one suite name is inconsistent). |
| standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java | New unit tests for stats auto-deletion (currently missing required statsDesc on the stats object). |
| standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java | New background task implementation (contains multiple correctness/compilation/runtime issues). |
| standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java | Adds new conf vars and registers the task; docstrings currently don’t match how the feature is enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
soumyakanti3578
left a comment
There was a problem hiding this comment.
I haven't reviewed StatisticsManagementTask.java yet because there are too many sonar/checkstyle warning.
Please follow the hive coding conventions to get rid of those warnings: https://hive.apache.org/community/resources/howtocontribute/#coding-conventions
You can also see the checkstyle file for standalone-metastore at standalone-metastore/checkstyle/checkstyle.xml which has the default values that sonar checks against.
I think you can also use mvn checkstyle from the console or download the checkstyle plugin for IntelliJ IDEA to catch these warnings before submitting the PR.
In general, if you're creating a new file, it's a good idea to follow the official coding conventions, and if you're updating small parts of a file then it's okay to follow the existing style to maintain readability. :)
| "Automatic partition management will look for tables using the specified table pattern"), | ||
|
|
||
| STATISTICS_MANAGEMENT_TASK_FREQUENCY("metastore.statistics.management.task.frequency", | ||
| "metastore.statistics.management.task.frequency", |
There was a problem hiding this comment.
The first two arguments are the same. Should the second one (hiveName) be something else, like, hive.column.statistics.management.task.frequency? Not sure if it even matters tbh.
Similar issue for the other two props too.
There was a problem hiding this comment.
The second argument (hiveName) exists for backward compatibility — when a config previously lived under a hive. prefix in older Hive versions, HMS needs to recognize both names. Since these three configs are brand new and have no prior hive.-prefixed equivalent, it's ok to keep them same.
| String dbName = (String) row[0]; | ||
| String tblName = (String) row[1]; | ||
| String excludeVal = (String) row[2]; | ||
| if (excludeVal != null) { |
There was a problem hiding this comment.
Is excludeVal non null when a user sets statistics.auto.deletion.exclude = false? If the answer is yes, then maybe we should explicitly check if the property is set to true?
There was a problem hiding this comment.
Acknowledged and changed.
| msc.deleteColumnStatistics(request); | ||
| } | ||
| } finally { | ||
| if (tblQuery != null) { |
There was a problem hiding this comment.
nit: if tblQuery is AutoCloseable, we may not have close it manually. Maybe we should see if closeAll() is required at all or just a close() is fine. I believe if close() is sufficient, we can just remove the finally block and use try-with-resources but this is fine as-is. So I will leave it up to you.
There was a problem hiding this comment.
Query is already wrapped in try-with-resources, so close() is called automatically when the block exits. In DataNucleus's JDO implementation, close() internally delegates to closeAll(), so the behavior is equivalent.
| } | ||
| } | ||
| } | ||
| } No newline at end of file |
| if (exclude) { | ||
| t.getParameters().put( | ||
| StatisticsManagementTask.STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY, "true"); | ||
| } |
There was a problem hiding this comment.
Here we are only testing "true". We need another test for "false" or any other string value. I think this property should be explicitly checked for "true" and only then should we exclude the table from deletion. For all other values we should include the table.
What do you think?
There was a problem hiding this comment.
Make sense to me. It is a corner case that we need to cover. Added a new test which the value is false, and make sure it's still deleted.
| boolean committed = false; | ||
| openTransaction(); | ||
| try { | ||
| try (IMetaStoreClient msc = new HiveMetaStoreClient(conf)) { |
There was a problem hiding this comment.
why create a HiveMetaStoreClient if we can use the method in RawStore directly?
| LOG.debug("Auto statistics deletion started. Cleaning up table/partition column statistics over the retention period."); | ||
| long retentionMillis = | ||
| MetastoreConf.getTimeVar(conf, MetastoreConf.ConfVars.COLUMN_STATISTICS_RETENTION_PERIOD, TimeUnit.MILLISECONDS); | ||
| if (retentionMillis <= 0 || !MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.COLUMN_STATISTICS_AUTO_DELETION)) { |
There was a problem hiding this comment.
perform the check in runFrequency as well, so we don't schedule this empty task
| if (!LOCK.tryLock()) { | ||
| return; | ||
| } | ||
| try { |
| DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName); | ||
| request.setEngine("hive"); | ||
| request.setTableLevel(true); | ||
| msc.deleteColumnStatistics(request); |
There was a problem hiding this comment.
will it delete all stats even they are not stale?
| * <p>When {@code metastore.statistics.auto.deletion} is enabled, this task scans | ||
| * {@code TAB_COL_STATS} and {@code PART_COL_STATS} for rows whose {@code lastAnalyzed} timestamp | ||
| * is older than {@code metastore.statistics.retention.period}, and deletes them. |
| @Override | ||
| public void setConf(Configuration configuration) { | ||
| this.conf = new Configuration(configuration); | ||
| super.setConf(configuration); |
| long now = System.currentTimeMillis(); | ||
| long lastAnalyzedThreshold = (now - retentionMillis) / 1000; | ||
| PersistenceManager pm = getPersistenceManager(); | ||
| boolean committed = false; | ||
| openTransaction(); | ||
| try { | ||
| try (IMetaStoreClient msc = new HiveMetaStoreClient(conf)) { | ||
| deleteExpiredTableColStats(pm, msc, lastAnalyzedThreshold); | ||
| deleteExpiredPartitionColStats(pm, msc, lastAnalyzedThreshold); |
| // partitionName does not exist on MTableColumnStatistics; omitted here | ||
| tblQuery.setResult( | ||
| "table.database.name, " | ||
| + "table.tableName, " | ||
| + "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")"); | ||
| @SuppressWarnings("unchecked") | ||
| List<Object[]> tblRows = (List<Object[]>) tblQuery.execute(lastAnalyzedThreshold); | ||
| for (Object[] row : tblRows) { | ||
| String dbName = (String) row[0]; | ||
| String tblName = (String) row[1]; | ||
| String excludeVal = (String) row[2]; | ||
| if (excludeVal != null) { | ||
| LOG.info("Skipping auto deletion of table stats for {}.{} due to exclude property.", | ||
| dbName, tblName); | ||
| continue; | ||
| } | ||
| DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName); | ||
| request.setEngine("hive"); | ||
| request.setTableLevel(true); |
| for (Object[] row : partRows) { | ||
| String dbName = (String) row[0]; | ||
| String tblName = (String) row[1]; | ||
| String partName = (String) row[2]; | ||
| String excludeVal = (String) row[3]; | ||
| if (excludeVal != null) { | ||
| LOG.info("Skipping auto deletion of partition stats for {}.{} due to exclude property.", | ||
| dbName, tblName); | ||
| continue; | ||
| } | ||
| DeleteColumnStatisticsRequest request = new DeleteColumnStatisticsRequest(dbName, tblName); | ||
| request.setEngine("hive"); | ||
| request.setTableLevel(false); | ||
| request.addToPart_names(partName); | ||
| msc.deleteColumnStatistics(request); | ||
| } |
| @Test | ||
| public void testExpiredTableColStatsAreDeleted() throws Exception { | ||
| String dbName = "stats_db1"; | ||
| String tableName = "tbl1"; | ||
| createDbAndTable(dbName, tableName, false); | ||
| writeTableLevelColStats(dbName, tableName, "c1"); | ||
| assertHasTableColStats(dbName, tableName, "c1"); | ||
| makeAllTableColStatsOlderThanRetention(dbName, tableName); | ||
|
|
||
| runStatisticsManagementTask(conf); | ||
|
|
||
| assertNoTableColStats(dbName, tableName, "c1"); | ||
| } |
|



What changes were proposed in this pull request?
Added StatisticsManagementTask, a new MetastoreTaskThread that periodically scans TAB_COL_STATS and PART_COL_STATS and deletes column statistics whose lastAnalyzed timestamp is older than the configured retention period. Three new configs are introduced: metastore.column.statistics.auto.deletion (default: false), metastore.column.statistics.retention.period (default: 365 days), and metastore.column.statistics.management.task.frequency (default: 7 days). Individual tables can opt out by setting the table property statistics.auto.deletion.exclude.
Why are the changes needed?
Column statistics can become stale over time and consume unnecessary storage. There was no existing mechanism to automatically clean them up based on age, requiring manual intervention.
Does this PR introduce any user-facing change?
Yes. Three new metastore configuration properties are introduced. The feature is opt-in and disabled by default (metastore.column.statistics.auto.deletion=false), so existing deployments are unaffected after upgrade.
How was this patch tested?
Added TestStatisticsManagement with two unit tests: one verifying that expired table-level column statistics are deleted when the task runs, and one verifying that tables marked with the exclude property are left untouched.