Skip to content

fix(postgresql-proxy): prevent SSL COPY stalls by draining nonblocking reads#11

Open
nik-localstack wants to merge 2 commits intomasterfrom
unc-472-postgresql-proxy-drain-nonblocking-ssl-reads-to-prevent-copy
Open

fix(postgresql-proxy): prevent SSL COPY stalls by draining nonblocking reads#11
nik-localstack wants to merge 2 commits intomasterfrom
unc-472-postgresql-proxy-drain-nonblocking-ssl-reads-to-prevent-copy

Conversation

@nik-localstack
Copy link
Copy Markdown

@nik-localstack nik-localstack commented May 7, 2026

Motivation

Fixes UNC-470: Postgres proxy could hang for sslmode=require sessions running large psql -f / COPY-heavy workloads.

Root cause

selector.select() operates at the TCP layer and is unaware of data already decrypted and buffered inside the SSL layer. After an initial recv(), buffered SSL bytes would be silently dropped until the next select() wake-up, stalling large transfers.

Changes (postgresql_proxy/proxy.py)

  • After each recv() in service_connection(), drain any remaining SSL-buffered bytes using sock.pending() (per Python SSL non-blocking docs).
  • Add explicit setblocking(True) on accepted sockets before SSL negotiation. On macOS/BSD, accepted sockets inherit O_NONBLOCK from the listening socket (unlike Linux), causing intermittent BlockingIOError on the MSG_PEEK recv during SSL startup.

Expected impact

Prevents SSL COPY-session stalls for large dump-like workloads without changing the proxy's overall non-blocking architecture.

Test coverage

  • Added basic test coverage and a specific test test_psql_ssl_file_batch_stress_no_hang for the UNC-470 scenario.
  • Added CI support.

@nik-localstack nik-localstack self-assigned this May 7, 2026
@nik-localstack nik-localstack force-pushed the unc-472-postgresql-proxy-drain-nonblocking-ssl-reads-to-prevent-copy branch 11 times, most recently from e7d5df8 to 1c51600 Compare May 8, 2026 12:00
@nik-localstack nik-localstack marked this pull request as ready for review May 8, 2026 12:15
Copy link
Copy Markdown
Member

@pinzon pinzon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically, the SSL handshake is now done incrementally in the event loop instead of blocking on accept, and once it's complete we keep draining the SSL buffer until it's empty so data doesn't get stranded. Right?

Thank you for also adding a test workflow, now we can test changes here without relying solely on LS. I'll let the final say to @bentsku or @cloutierMat

Copy link
Copy Markdown
Member

@cloutierMat cloutierMat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does a lot of things that it is hard to review properly. I think the issue is much simpler that this and there is a lot of rewrite of ssl/socket functionality that is most likely handle properly by their respective libraries.

Maybe it is also tackling many other issues I don't know of, but a bit of research revealed this stack overflow post.

My understanding here is that self.listen calls selector.select but as there might still be decrypted data left in the socket, it would be lost and unseen by select. There might be a smarter way to solve but using the sample in the post I could fix the ssl hanging issue by changing the service_connection to handle pending bytes in sock.pending

                if recv_data:
                    LOG.debug('%s received data:\n%s', conn.name, recv_data)
                    conn.received(recv_data)
                    # excerpt from https://docs.python.org/3/library/ssl.html#ssl-nonblocking
                    # Conversely, since the SSL layer has its own framing, a SSL socket may still have data available
                    # for reading without select() being aware of it. Therefore, you should first call SSLSocket.recv()
                    # to drain any potentially available data, and then only block on a select() call if still necessary.
                    while isinstance(sock, ssl.SSLSocket) and sock.pending() > 0:
                        extra = sock.recv(min(4096, sock.pending()))
                        if extra:
                            LOG.debug('%s received pending SSL data:\n%s', conn.name, extra)
                            conn.received(extra)

There might still be something cleaner we can do by calling sock.recv in listen but I think this PR is kinda overboard and makes a lot of changes that would be hard to measure the impact.

I love the addition of a pipeline and cleaner testing, but I would also argue that it would be easier to review and comment, if we would seperate it in multiple PRs

nik-localstack and others added 2 commits May 11, 2026 13:11
…g reads

Co-authored-by: GitHub Copilot <copilot@github.com>
Co-authored-by: GitHub Copilot <copilot@github.com>
@nik-localstack nik-localstack force-pushed the unc-472-postgresql-proxy-drain-nonblocking-ssl-reads-to-prevent-copy branch from 1c51600 to 46aa6f6 Compare May 11, 2026 10:12
@nik-localstack
Copy link
Copy Markdown
Author

Thanks for the comment @cloutierMat and for the critical thinking.
Indeed this way is quite simpler and it solved the problem. I reworked the fix to match your suggestion.

While reverting, I also caught a related pre-existing bug: on macOS, accepted sockets inherit O_NONBLOCK from the listening socket which caused _handle_ssl_negotiation to raise BlockingIOError on the MSG_PEEK recv ~1/10 times. Fixed with an explicit setblocking(True) before SSL negotiation.

Total change to proxy.py is now 12 lines. Happy to split it if you still prefer or keep this as a single PR given the scope, let me know what do you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants