-
Notifications
You must be signed in to change notification settings - Fork 135
perf: reduce allocations on the chunk reading hot path #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
matshch
wants to merge
4
commits into
buildbarn:main
Choose a base branch
from
matshch:fix-buffer-reuse
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
dea11c9
pkg/blobstore/grpcservers: add ByteStream Read benchmarks
matshch 8d97513
pkg/blobstore/buffer: add benchmark for readerBackedChunkReader
matshch 66f0f6f
pkg/blobstore/buffer: reuse chunk buffers across Read() calls
matshch b993b67
pkg/blobstore/grpcservers: use grpc.PreparedMsg to safely reuse chunk…
matshch File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| package buffer | ||
|
|
||
| import "sync" | ||
|
|
||
| // chunkBufferPool caches []byte buffers used by ChunkReader | ||
| // implementations that allocate a fixed-size working buffer per | ||
| // stream. Buffers are stored as *[]byte to avoid the interface-boxing | ||
| // allocation that occurs when storing a non-pointer value in | ||
| // sync.Pool. | ||
| var chunkBufferPool sync.Pool | ||
|
|
||
| func getChunkBuffer(size int) *[]byte { | ||
| if v := chunkBufferPool.Get(); v != nil { | ||
| b := v.(*[]byte) | ||
| if cap(*b) >= size { | ||
| *b = (*b)[:size] | ||
| return b | ||
| } | ||
| } | ||
| b := make([]byte, size) | ||
| return &b | ||
| } | ||
|
|
||
| func putChunkBuffer(b *[]byte) { | ||
| chunkBufferPool.Put(b) | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
82 changes: 82 additions & 0 deletions
82
pkg/blobstore/buffer/reader_backed_chunk_reader_bench_test.go
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| package buffer_test | ||
|
|
||
| import ( | ||
| "bytes" | ||
| "testing" | ||
|
|
||
| "github.com/buildbarn/bb-storage/pkg/blobstore/buffer" | ||
| ) | ||
|
|
||
| type readAtCloser struct{ *bytes.Reader } | ||
|
|
||
| func (readAtCloser) Close() error { return nil } | ||
|
|
||
| const benchChunkSize = 1 << 16 | ||
|
|
||
| var benchBlobSizes = []struct { | ||
| name string | ||
| blobSize int64 | ||
| }{ | ||
| {"BlobSize=4KiB", 4 << 10}, | ||
| {"BlobSize=64KiB", 64 << 10}, | ||
| {"BlobSize=256KiB", 256 << 10}, | ||
| {"BlobSize=4MiB", 4 << 20}, | ||
| {"BlobSize=64MiB", 64 << 20}, | ||
| } | ||
|
|
||
| // NewValidatedBufferFromReaderAt is the cheapest public entry point | ||
| // that resolves to newReaderBackedChunkReader without wrapping | ||
| // decorators. | ||
| func BenchmarkReaderBackedChunkReader(b *testing.B) { | ||
| for _, c := range benchBlobSizes { | ||
| b.Run(c.name, func(b *testing.B) { | ||
| data := make([]byte, c.blobSize) | ||
| for i := range data { | ||
| data[i] = byte(i) | ||
| } | ||
|
|
||
| b.SetBytes(c.blobSize) | ||
| b.ReportAllocs() | ||
| b.ResetTimer() | ||
|
|
||
| for i := 0; i < b.N; i++ { | ||
| buf := buffer.NewValidatedBufferFromReaderAt(readAtCloser{bytes.NewReader(data)}, c.blobSize) | ||
| r := buf.ToChunkReader(0, benchChunkSize) | ||
| for { | ||
| if _, err := r.Read(); err != nil { | ||
| break | ||
| } | ||
| } | ||
| r.Close() | ||
| } | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| func BenchmarkReaderBackedChunkReaderParallel(b *testing.B) { | ||
| for _, c := range benchBlobSizes { | ||
| b.Run(c.name, func(b *testing.B) { | ||
| data := make([]byte, c.blobSize) | ||
| for i := range data { | ||
| data[i] = byte(i) | ||
| } | ||
|
|
||
| b.SetBytes(c.blobSize) | ||
| b.ReportAllocs() | ||
| b.ResetTimer() | ||
|
|
||
| b.RunParallel(func(pb *testing.PB) { | ||
| for pb.Next() { | ||
| buf := buffer.NewValidatedBufferFromReaderAt(readAtCloser{bytes.NewReader(data)}, c.blobSize) | ||
| r := buf.ToChunkReader(0, benchChunkSize) | ||
| for { | ||
| if _, err := r.Read(); err != nil { | ||
| break | ||
| } | ||
| } | ||
| r.Close() | ||
| } | ||
| }) | ||
| }) | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| package buffer_test | ||
|
|
||
| import ( | ||
| "bytes" | ||
| "io" | ||
| "testing" | ||
|
|
||
| "github.com/buildbarn/bb-storage/pkg/blobstore/buffer" | ||
| "github.com/stretchr/testify/require" | ||
| ) | ||
|
|
||
| func TestReaderBackedChunkReaderReusesBuffer(t *testing.T) { | ||
| const ( | ||
| chunkSize = 16 | ||
| blobSize = int64(chunkSize * 4) | ||
| ) | ||
| data := make([]byte, blobSize) | ||
| for i := range data { | ||
| data[i] = byte(i) | ||
| } | ||
|
|
||
| b := buffer.NewValidatedBufferFromReaderAt(readAtCloser{bytes.NewReader(data)}, blobSize) | ||
| r := b.ToChunkReader(0, chunkSize) | ||
| defer r.Close() | ||
|
|
||
| first, err := r.Read() | ||
| require.NoError(t, err) | ||
| require.Equal(t, chunkSize, len(first)) | ||
| firstAddr := &first[0] | ||
|
|
||
| for i := 0; i < 3; i++ { | ||
| next, err := r.Read() | ||
| require.NoError(t, err) | ||
| require.Equal(t, chunkSize, len(next)) | ||
| require.Same(t, firstAddr, &next[0], | ||
| "successive chunks must share backing storage") | ||
| } | ||
|
|
||
| _, err = r.Read() | ||
| require.Equal(t, io.EOF, err) | ||
| } | ||
|
|
||
| // TestReaderBackedChunkReaderClosePoolsBuffer retries because | ||
| // sync.Pool may drop entries between calls (e.g. on GC). | ||
| func TestReaderBackedChunkReaderClosePoolsBuffer(t *testing.T) { | ||
| const chunkSize = 4096 | ||
| data := make([]byte, chunkSize) | ||
|
|
||
| var observedReuse bool | ||
| for attempt := 0; attempt < 100 && !observedReuse; attempt++ { | ||
| b1 := buffer.NewValidatedBufferFromReaderAt(readAtCloser{bytes.NewReader(data)}, int64(chunkSize)) | ||
| r1 := b1.ToChunkReader(0, chunkSize) | ||
| chunk, err := r1.Read() | ||
| require.NoError(t, err) | ||
| addr1 := &chunk[0] | ||
| _, err = r1.Read() | ||
| require.Equal(t, io.EOF, err) | ||
| r1.Close() | ||
|
|
||
| b2 := buffer.NewValidatedBufferFromReaderAt(readAtCloser{bytes.NewReader(data)}, int64(chunkSize)) | ||
| r2 := b2.ToChunkReader(0, chunkSize) | ||
| chunk2, err := r2.Read() | ||
| require.NoError(t, err) | ||
| if &chunk2[0] == addr1 { | ||
| observedReuse = true | ||
| } | ||
| r2.Close() | ||
| } | ||
| require.True(t, observedReuse, | ||
| "expected sync.Pool to return a previously-released buffer at least once across 100 attempts") | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, you're trying to solve the issue that transmitting a large file causes O(n) allocations. However, with this specific change you're trying to reduce this to 'less-than-1' allocation by using a sync.Pool. Is that really necessary?
In other words, why don't we just write this instead?
Sure, it's one more allocation. But are we sure that that's actually part of the bottleneck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue is that bb-storage on our workload is GC-bound:

And the biggest source of allocations is this place:

We have all kinds of blob sizes in CAS. Many of them do fit in the default chunk size, but we also have a huge tail of blobs which are in the realm of gigabytes:

With this distribution of sizes it looks like we need to optimize both small and huge files, that is why I've looked into
sync.Poolfirst, and then found that we can actually reuse one buffer (except for the gRPC part, which is really unfortunate).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, can't we first do some benchmarking to see whether making a single allocation is sufficient?
I am not a fan of using sync.Pool here. I get why sync.Pool exists. Namely, you often want to use it for holding objects that are complex to initialize. But in our case that shouldn't need to apply. We're just talking about simple byte arrays here. If we already need to use sync.Pool for storing simple byte slices, then Go's memory management must be really bad.
I also suspect that a big amount of waste is coming from the fact that we always allocate
maximumChunkSizeBytesinstead of something like:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over-allocating is wasteful, but not the kind of waste that introduces GC overhead. GC overhead is sensitive to the number of allocations and not especially to their size.
I agree that
sync.Poolis a hack here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not exactly true. https://pkg.go.dev/sync#Pool:
https://victoriametrics.com/blog/go-sync-pool/ provides several examples where the standard library utilizes pools for simple objects. https://cs.opensource.google/go/go/+/master:src/net/http/server.go;l=857-870;drc=15b9fc2659f77608548cb279c5e0565b0664cfca;bpv=1;bpt=1 is basically our exact case — just allocating a buffer of constant size.
The point of the pool here is to make sure that we allocate less buffers overall, so GC needs to do less work overall. It is usually negligible when we have other work to do, but in case of bb-storage it might be noticeable depending on the workload.
I have done some benchmarking with both
sync.Pooland without it:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a big amount of waste from the point of allocated memory, but I wouldn't say we have significant problems in that regard. And speaking about CPU usage:
sync.Pool.But there is another interesting angle to all of that — Go has a significantly more complex data path for allocations > 32 KB. Maybe it makes sense to decrease the size of these buffers to 32 KB to get lock-free allocations.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have conducted several more experiments:
sync.Pooldefinitely helps to reduce CPU usage compared to allocating new constant-size buffers each time, especially in case of smaller blobs, as we can reuse them quicker.sync.Poolstill outperforms this idea by 10-30% on blobs <64 KB.