fix(postgresql-proxy): prevent SSL COPY stalls by draining nonblocking reads#11
Conversation
e7d5df8 to
1c51600
Compare
pinzon
left a comment
There was a problem hiding this comment.
So basically, the SSL handshake is now done incrementally in the event loop instead of blocking on accept, and once it's complete we keep draining the SSL buffer until it's empty so data doesn't get stranded. Right?
Thank you for also adding a test workflow, now we can test changes here without relying solely on LS. I'll let the final say to @bentsku or @cloutierMat
cloutierMat
left a comment
There was a problem hiding this comment.
This PR does a lot of things that it is hard to review properly. I think the issue is much simpler that this and there is a lot of rewrite of ssl/socket functionality that is most likely handle properly by their respective libraries.
Maybe it is also tackling many other issues I don't know of, but a bit of research revealed this stack overflow post.
My understanding here is that self.listen calls selector.select but as there might still be decrypted data left in the socket, it would be lost and unseen by select. There might be a smarter way to solve but using the sample in the post I could fix the ssl hanging issue by changing the service_connection to handle pending bytes in sock.pending
if recv_data:
LOG.debug('%s received data:\n%s', conn.name, recv_data)
conn.received(recv_data)
# excerpt from https://docs.python.org/3/library/ssl.html#ssl-nonblocking
# Conversely, since the SSL layer has its own framing, a SSL socket may still have data available
# for reading without select() being aware of it. Therefore, you should first call SSLSocket.recv()
# to drain any potentially available data, and then only block on a select() call if still necessary.
while isinstance(sock, ssl.SSLSocket) and sock.pending() > 0:
extra = sock.recv(min(4096, sock.pending()))
if extra:
LOG.debug('%s received pending SSL data:\n%s', conn.name, extra)
conn.received(extra)There might still be something cleaner we can do by calling sock.recv in listen but I think this PR is kinda overboard and makes a lot of changes that would be hard to measure the impact.
I love the addition of a pipeline and cleaner testing, but I would also argue that it would be easier to review and comment, if we would seperate it in multiple PRs
…g reads Co-authored-by: GitHub Copilot <copilot@github.com>
Co-authored-by: GitHub Copilot <copilot@github.com>
1c51600 to
46aa6f6
Compare
|
Thanks for the comment @cloutierMat and for the critical thinking. While reverting, I also caught a related pre-existing bug: on macOS, accepted sockets inherit Total change to |
Motivation
Fixes
UNC-470: Postgres proxy could hang forsslmode=requiresessions running largepsql -f/ COPY-heavy workloads.Root cause
selector.select()operates at the TCP layer and is unaware of data already decrypted and buffered inside the SSL layer. After an initialrecv(), buffered SSL bytes would be silently dropped until the nextselect()wake-up, stalling large transfers.Changes (
postgresql_proxy/proxy.py)recv()inservice_connection(), drain any remaining SSL-buffered bytes usingsock.pending()(per Python SSL non-blocking docs).setblocking(True)on accepted sockets before SSL negotiation. On macOS/BSD, accepted sockets inheritO_NONBLOCKfrom the listening socket (unlike Linux), causing intermittentBlockingIOErroron theMSG_PEEKrecv during SSL startup.Expected impact
Prevents SSL COPY-session stalls for large dump-like workloads without changing the proxy's overall non-blocking architecture.
Test coverage
test_psql_ssl_file_batch_stress_no_hangfor the UNC-470 scenario.