fix: correct accounting in DictEncoder::estimated_memory_size, Interner::estimated_memory_size#9720
Conversation
The returned value should estimate the actual memory usage, but instead it used the evaluation of the encoded size of the dictionary data, and bypassed the hash table memory usage added by the Interner. The implementation of Storage::estimated_memory_size for the unique key storage was not correct as well, but it was unused. Correct both problems.
|
Is there some way to add tests for this change? |
alamb
left a comment
There was a problem hiding this comment.
Thanks for this @mzabaluev
| fn estimated_memory_size(&self) -> usize { | ||
| self.interner.storage().size_in_bytes + self.indices.len() * std::mem::size_of::<usize>() | ||
| self.interner.estimated_memory_size() + self.indices.len() * std::mem::size_of::<usize>() |
There was a problem hiding this comment.
I think this is the right direction (in the sense of account for the size in the structures that hold the memory, rather than the wrapper)
However, it is not clear to me that KeyStorage includes the heap bytes for types like BYTE_ARRAY
I think we need to add some tests for this code to make sure we have it right
There was a problem hiding this comment.
Testing upper bounds would require intimate knowledge of reallocation behavior of Vec and hashbrown, but I'll try to get some confirmation.
There was a problem hiding this comment.
Oh, you are correct about the byte arrays, I need to account for variable- and fixed length array values.
There was a problem hiding this comment.
I have accounted for the byte arrays' allocations and added some tests.
|
Nobody seems to have really tested these methods, because this It should multiply the table capacity by the number of keys, not add them. |
Should multiply hash table capacity by the key size, not add them.
The fact there are no tests and the code is wrong is probably related |
To estimate allocation size, the vector capacity is the right thing to use, not the length.
As we cannot test the exact memory allocation behavior or even give the upper bound without delving into implementation details of Vec and hashbrown, the only reliable test is to confirm that the lower bounds are respected, that is, increases in memory use from the empty state (where all vectors are empty and therefore have added no allocations) add at least the expected amount of memory.
DictEncoder::estimated_memory_sizeDictEncoder::estimated_memory_size, Interner::estimated_memory_size
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Thank you @mzabaluev
I ran these tests without your code change and they all passed. Thus I don't think the are covering whatever issue you have found
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ git diff
diff --git a/parquet/src/encodings/encoding/dict_encoder.rs b/parquet/src/encodings/encoding/dict_encoder.rs
index 37cfdb9ba1..2f32f9c0bb 100644
--- a/parquet/src/encodings/encoding/dict_encoder.rs
+++ b/parquet/src/encodings/encoding/dict_encoder.rs
@@ -64,12 +64,7 @@ impl<T: DataType> Storage for KeyStorage<T> {
}
fn estimated_memory_size(&self) -> usize {
- let uniques_heap_bytes = match T::get_physical_type() {
- Type::FIXED_LEN_BYTE_ARRAY => self.type_length * self.uniques.len(),
- _ => <Self::Value as ParquetValueType>::variable_length_bytes(&self.uniques)
- .unwrap_or(0) as usize,
- };
- self.uniques.capacity() * std::mem::size_of::<T::T>() + uniques_heap_bytes
+ self.size_in_bytes + self.uniques.capacity() * std::mem::size_of::<T::T>()
}
}
diff --git a/parquet/src/util/interner.rs b/parquet/src/util/interner.rs
index deae3720d5..34c7d1390f 100644
--- a/parquet/src/util/interner.rs
+++ b/parquet/src/util/interner.rs
@@ -77,7 +77,9 @@ impl<S: Storage> Interner<S> {
/// Return estimate of the memory used, in bytes
#[allow(dead_code)] // not used in parquet_derive, so is dead there
pub fn estimated_memory_size(&self) -> usize {
- self.storage.estimated_memory_size() + self.dedup.allocation_size()
+ self.storage.estimated_memory_size() +
+ // estimate size of dedup hashmap as just th size of the keys
+ self.dedup.capacity() + std::mem::size_of::<S::Key>()
}
/// Returns the storage for this internerAnd then
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ cargo test --lib -p parquet -- dict_encoder
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.10s
Running unittests src/lib.rs (target/debug/deps/parquet-d5e640393e7492a1)
running 6 tests
test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_primitive_with_duplicates ... ok
test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_primitive_all_distinct ... ok
test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_fixed_len_byte_array_with_duplicates ... ok
test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_byte_array_with_duplicates ... ok
test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_fixed_len_byte_array_all_distinct ... ok
test encodings::encoding::dict_encoder::tests::test_estimated_memory_size_byte_array_all_distinct ... ok
test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 1051 filtered out; finished in 0.00s
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$
Which issue does this PR close?
DictEncoder::estimated_memory_size#9719, Incorrect accounting inInterner::estimated_memory_size#9744.Rationale for this change
The returned value should estimate the actual memory usage, but instead it uses the evaluation of the encoded size of the dictionary data, and bypasses the hash table memory usage added by the
Internermember. The implementation ofStorage::estimated_memory_sizeimplementation for the unique key storage was not correct as well, but it was unused.What changes are included in this PR?
Correct both problems by making the
KeyStorage's implementation ofestimated_memory_sizereturn the size of the allocateduniquesvector added with the values' sizes if applicable, and makeDictEncoder::estimated_memory_sizedelegate to theinterner, which calls the method ofKeyStorageand adds accounting for its own data structure.Are these changes tested?
Added tests verifying that at least the expected added amounts are accounted for when values are added. Overreporting is hard to disprove due to dependency on allocation behavior internal to other libraries.
Are there any user-facing changes?
No.