-
-
Notifications
You must be signed in to change notification settings - Fork 34.5k
gh-90533: Implement BytesIO.peek() #30808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 30 commits
b833b83
50a2cfb
eaa7672
00457ae
882579d
c1eed72
afc200c
79ab9a4
b493914
2a1c85c
d398717
26d1e81
9a19ff9
9300ade
d214089
d6691b8
3e51adb
3661b65
cd40d77
04372bd
6b9ae8c
f7406f6
d9528e2
bc8134b
b6ffca8
5fe5645
4126a64
1ea40c2
77e04d6
4d2f2dd
c16bebf
08bd7da
6174fca
b8b8cf4
7ac914e
abbd8f0
07d9e4d
3d57f45
023ad25
203749b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -739,6 +739,15 @@ than raw I/O does. | |
|
|
||
| Return :class:`bytes` containing the entire contents of the buffer. | ||
|
|
||
| .. method:: peek(size=1, /) | ||
|
|
||
| Return bytes from the current position onwards without advancing the position. | ||
| At least one byte of data is returned if not at EOF. | ||
| Return an empty :class:`bytes` object at EOF. | ||
| If the size argument is negative or larger than the number of available bytes, | ||
| a copy of the buffer from the current position until the end is returned. | ||
|
vstinner marked this conversation as resolved.
Outdated
|
||
|
|
||
| .. versionadded:: 3.15 | ||
|
|
||
| .. method:: read1(size=-1, /) | ||
|
|
||
|
|
@@ -772,8 +781,13 @@ than raw I/O does. | |
|
|
||
| .. method:: peek(size=0, /) | ||
|
|
||
| Return bytes from the stream without advancing the position. The number of | ||
| bytes returned may be less or more than requested. If the underlying raw | ||
| Return bytes from the current position onwards without advancing the position. | ||
| At least one byte of data is returned if not at EOF. | ||
| Return an empty :class:`bytes` object at EOF. | ||
| At most one single read on the underlying raw stream is done to satisfy the call. | ||
|
marcelm marked this conversation as resolved.
Outdated
|
||
| The *size* argument is ignored. | ||
| The number of read bytes depends on the buffer size and the current position in the internal buffer. | ||
| If the underlying raw | ||
|
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cmaloney Can you confirm whether this change of the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually think it's inaccurate (it doesn't actually guarantee a single read since PEP-0475 EINTR retry was added...). To me is out of scope for this change (Adding BytesIO.peek). Possibly worth fixing in a separate github issue + discussion. |
||
| stream is non-blocking and the operation would block, returns empty bytes. | ||
|
|
||
| .. method:: read(size=-1, /) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| Add :meth:`io.BytesIO.peek`. |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not default to
size=0likeBufferedReader?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read through the comments. In general anything returning a
bytesrather than abytes-likememoryview is going to require allocating and copying potentially many bytes... If the copy is really concerning I'd lean returning a memoryview rather than abyteswhich is a mandatory copy.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memoryviewcame up in this comment and the three following it: #30808 (comment)I don’t know what the conclusion is given your comment. Should a memoryview be returned instead? Most important to me is compatibility with what
BufferedReader.peek()returns.I am not too concerned about the extra memory for a
bytesobject as long as the default issize=1.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to not using memoryview / definitely a good rationale in that thread.
On 3.14.0 BufferedReader (intentionally limiting buffer size) gives:
Which doesn't match the behavior here or defaults. That in particular concerns me because it seems like people will build and test I/O stack pieces (ex. file parsing) expecting the
.peekbehavior ofBytesIOthen get something different when they read actual files/data...Looking for more alternatives, the documentation page says (https://docs.python.org/3/library/io.html#io.BufferedReader.peek):
To me that leaves an intentional gap where we can always return less data than the total amount available. Peek gives no guarantees. That makes me wonder: Could we default to 0 and just always slice
[:DEFAULT_BUFFER_SIZE]?That would mean you always get
DEFAULT_BUFFER_SIZEwhich matches defaultBufferedReaderbehavior unless there is less data available. If called in a loop yes that's a DEFAULT_BUFFER_SIZE repeated copy (BufferedIO also does that / thebytesrequires a copy), but it's a lot less than "the whole buffer".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m fine with changing the default size, but let me describe what my thought was.
My concern with choosing a default size larger than 1 was that people might test their parsing code using
BytesIOthat relies on the buffer always having a certain size, but then when they useBufferedReader.peekin non-testing code it could fail to work becauseBufferedReader.peekonly guarantees that you get a single byte (if available). And if you’re almost at the end of the buffer, this does happen. Copying from an earlier comment:My thought was that by defaulting to size=1, people would be forced to at least think about this edge case. It felt like the more "conservative" choice.
That said, maybe
peekis mostly used at the start of a file, and in that case, evenBufferedReader.peekcan be relied on to return not just a single byte.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely sympathize with wanting to vary size over time... not sure of a simple way to do that in a single change though. For
BufferedReaderthere end up being two cases here:fd:peek()returns as much data as buffered currently (up to DEFAULT_BUFFER_SIZE), if there is no buffered data will attempt aread()to get some. If that returns no data / would block, will return empty bufferFor both those cases a length 1 peek return can happen as part of regular operation, as can shorter/longer.
I'm most used to
peekbeing used in parsing algorithms where looking ahead is often used as a way to disambiguate. For things where need a fixed readahead though usually have to build some other mechanism so that the read buffer is "guaranteed filled" to a minimum size which CPYthon.peek()on Buffered at the moment doesn't provide.One of my thoughts was potentially a feature extension on top of this work on
BytesIO: Being able to specify a "buffer size" window which is read through which gives the "peek return length varies"... Ideally that in time could pair with changingBufferedReader.peekto try and guarantee a specific length of data available which would make lexers and parsers simpler to write.I'm not sure the risk in changing defaults of arguments in CPython / would need to ask core developers with relevant experience there. In general CPython is pretty hesitant to make incompatible changes (https://peps.python.org/pep-0387/#making-incompatible-changes) and not sure where exactly changing a default arg from
1to0would fit in that framework. That has me hesitant to start with a distinct value fromBufferedReaderfor because it may be difficult to change later.I do think giving your thought of "definitely not whole file" answer to
BytesIO.peekis good design here. The best approach I have so far is to always just limit to a fixed length (which matches BufferedReader in some ways: You get the whole file if it fits). Open to other approaches as well; I have a preference to keep the design internal to BytesIO implementation rather than function signatures/default arguments as that is easier to change.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reached out to some core developers to learn about the risks both from a "duck typing" and backwards incompatibility pieces. Will hopefully have some clear guidance shortly / find out if
size=1is actually risky from people with a lot more experience.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing from the discussion: Adding a cap is within the document and unlikely to break anything. If wanted to document the cap and guarantee the behavior would want to likely add a new argument (ex.
max_size). Changingsize=to be meaningful generally could definitely be nice.So lets match default of
size=0and cap atDEFAULT_BUFFER_SIZEfor now, that's the path I think is viable before feature freeze next Tuesday. Can expand in the future if needed and adjusting/refining in 3.16 is very doable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed the default to
size=0and added a cap ofDEFAULT_BUFFER_SIZE, which is always applied even if an explicitsizeargument was provided. Note thatDEFAULT_BUFFER_SIZEhas recently been increased from 8K to 128K.