fix: correct accounting in `DictEncoder::estimated_memory_size`, `Interner::estimated_memory_size` by mzabaluev · Pull Request #9720 · apache/arrow-rs

mzabaluev · 2026-04-14T19:04:15Z

Which issue does this PR close?

Closes Incorrect accounting in DictEncoder::estimated_memory_size #9719, Incorrect accounting in Interner::estimated_memory_size #9744.

Rationale for this change

The returned value should estimate the actual memory usage, but instead it uses the evaluation of the encoded size of the dictionary data, and bypasses the hash table memory usage added by the Interner member. The implementation of Storage::estimated_memory_size implementation for the unique key storage was not correct as well, but it was unused.

What changes are included in this PR?

Correct both problems by making the KeyStorage's implementation of estimated_memory_size return the size of the allocated uniques vector added with the values' sizes if applicable, and make DictEncoder::estimated_memory_size delegate to the interner, which calls the method of KeyStorage and adds accounting for its own data structure.

Are these changes tested?

Added tests verifying that at least the expected added amounts are accounted for when values are added. Overreporting is hard to disprove due to dependency on allocation behavior internal to other libraries.

Are there any user-facing changes?

No.

The returned value should estimate the actual memory usage, but instead it used the evaluation of the encoded size of the dictionary data, and bypassed the hash table memory usage added by the Interner. The implementation of Storage::estimated_memory_size for the unique key storage was not correct as well, but it was unused. Correct both problems.

alamb · 2026-04-16T12:49:28Z

Is there some way to add tests for this change?

alamb

Thanks for this @mzabaluev

alamb · 2026-04-16T12:54:41Z

    fn estimated_memory_size(&self) -> usize {
-        self.interner.storage().size_in_bytes + self.indices.len() * std::mem::size_of::<usize>()
+        self.interner.estimated_memory_size() + self.indices.len() * std::mem::size_of::<usize>()


I think this is the right direction (in the sense of account for the size in the structures that hold the memory, rather than the wrapper)

However, it is not clear to me that KeyStorage includes the heap bytes for types like BYTE_ARRAY

I think we need to add some tests for this code to make sure we have it right

Testing upper bounds would require intimate knowledge of reallocation behavior of Vec and hashbrown, but I'll try to get some confirmation.

Oh, you are correct about the byte arrays, I need to account for variable- and fixed length array values.

I have accounted for the byte arrays' allocations and added some tests.

mzabaluev · 2026-04-16T15:49:48Z

Nobody seems to have really tested these methods, because this Interner code is also wrong:
https://github.com/mzabaluev/arrow-rs/blob/19002abc3522af60b6eaa99c0d5d4999b70cd681/parquet/src/util/interner.rs#L82

It should multiply the table capacity by the number of keys, not add them.

Should multiply hash table capacity by the key size, not add them.

alamb · 2026-04-16T15:56:21Z

Nobody seems to have really tested these methods, because this Interner code is also wrong: https://github.com/mzabaluev/arrow-rs/blob/19002abc3522af60b6eaa99c0d5d4999b70cd681/parquet/src/util/interner.rs#L82

It should multiply the table capacity by the number of keys, not add them.

The fact there are no tests and the code is wrong is probably related

To estimate allocation size, the vector capacity is the right thing to use, not the length.

As we cannot test the exact memory allocation behavior or even give the upper bound without delving into implementation details of Vec and hashbrown, the only reliable test is to confirm that the lower bounds are respected, that is, increases in memory use from the empty state (where all vectors are empty and therefore have added no allocations) add at least the expected amount of memory.

alamb · 2026-04-22T14:28:21Z

+}
+
+#[cfg(test)]
+mod tests {


Thank you @mzabaluev

I ran these tests without your code change and they all passed. Thus I don't think the are covering whatever issue you have found

andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ git diff diff --git a/parquet/src/encodings/encoding/dict_encoder.rs b/parquet/src/encodings/encoding/dict_encoder.rs index 37cfdb9ba1..2f32f9c0bb 100644 --- a/parquet/src/encodings/encoding/dict_encoder.rs +++ b/parquet/src/encodings/encoding/dict_encoder.rs @@ -64,12 +64,7 @@ impl<T: DataType> Storage for KeyStorage<T> { } fn estimated_memory_size(&self) -> usize { - let uniques_heap_bytes = match T::get_physical_type() { - Type::FIXED_LEN_BYTE_ARRAY => self.type_length * self.uniques.len(), - _ => <Self::Value as ParquetValueType>::variable_length_bytes(&self.uniques) - .unwrap_or(0) as usize, - }; - self.uniques.capacity() * std::mem::size_of::<T::T>() + uniques_heap_bytes + self.size_in_bytes + self.uniques.capacity() * std::mem::size_of::<T::T>() } } diff --git a/parquet/src/util/interner.rs b/parquet/src/util/interner.rs index deae3720d5..34c7d1390f 100644 --- a/parquet/src/util/interner.rs +++ b/parquet/src/util/interner.rs @@ -77,7 +77,9 @@ impl<S: Storage> Interner<S> { /// Return estimate of the memory used, in bytes #[allow(dead_code)] // not used in parquet_derive, so is dead there pub fn estimated_memory_size(&self) -> usize { - self.storage.estimated_memory_size() + self.dedup.allocation_size() + self.storage.estimated_memory_size() + + // estimate size of dedup hashmap as just th size of the keys + self.dedup.capacity() + std::mem::size_of::<S::Key>() } /// Returns the storage for this interner

And then

andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ cargo test --lib -p parquet -- dict_encoder Finished `test` profile [unoptimized + debuginfo] target(s) in 0.10s Running unittests src/lib.rs (target/debug/deps/parquet-d5e640393e7492a1) running 6 tests test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_primitive_with_duplicates ... ok test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_primitive_all_distinct ... ok test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_fixed_len_byte_array_with_duplicates ... ok test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_byte_array_with_duplicates ... ok test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_fixed_len_byte_array_all_distinct ... ok test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_byte_array_all_distinct ... ok test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 1051 filtered out; finished in 0.00s andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$

github-actions Bot added the parquet Changes to the parquet crate label Apr 14, 2026

alamb reviewed Apr 16, 2026

View reviewed changes

fix(parquet): Interner::estimated_memory_size

70b0912

Should multiply hash table capacity by the key size, not add them.

fix(parquet): use HashTable::allocation_size

ffa40d9

mzabaluev marked this pull request as draft April 16, 2026 16:44

mzabaluev added 3 commits April 16, 2026 20:20

fix: account for allocated bytes in KeyStorage

07aba13

fix: use vector capacity, not length

bf6a03a

To estimate allocation size, the vector capacity is the right thing to use, not the length.

mzabaluev marked this pull request as ready for review April 16, 2026 17:28

chore: rustfmt 1.95.0

986b084

mzabaluev changed the title ~~fix: correct accounting in DictEncoder::estimated_memory_size~~ fix: correct accounting in DictEncoder::estimated_memory_size, Interner::estimated_memory_size Apr 16, 2026

alamb mentioned this pull request Apr 22, 2026

Release arrow-rs / parquet Minor version 58.2.0 (April 2026) #9109

Open

13 tasks

alamb reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct accounting in `DictEncoder::estimated_memory_size`, `Interner::estimated_memory_size`#9720

fix: correct accounting in `DictEncoder::estimated_memory_size`, `Interner::estimated_memory_size`#9720
mzabaluev wants to merge 7 commits intoapache:mainfrom
mzabaluev:fix-estimated-memory-size-on-dict-encoder

mzabaluev commented Apr 14, 2026 •

edited

Loading

Uh oh!

alamb commented Apr 16, 2026

Uh oh!

alamb left a comment

Uh oh!

alamb Apr 16, 2026

Uh oh!

mzabaluev Apr 16, 2026

Uh oh!

mzabaluev Apr 16, 2026

Uh oh!

mzabaluev Apr 16, 2026 •

edited

Loading

Uh oh!

mzabaluev commented Apr 16, 2026

Uh oh!

alamb commented Apr 16, 2026

Uh oh!

alamb Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mzabaluev commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Apr 16, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

mzabaluev Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

mzabaluev Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

mzabaluev Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzabaluev commented Apr 16, 2026

Uh oh!

alamb commented Apr 16, 2026

Uh oh!

alamb Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mzabaluev commented Apr 14, 2026 •

edited

Loading

mzabaluev Apr 16, 2026 •

edited

Loading