Skip to content

Update poll for Mac POLLHUP behavior#128

Open
ddustin wants to merge 1 commit into
rustyrussell:masterfrom
ddustin:ddustin/mac_connect_fix
Open

Update poll for Mac POLLHUP behavior#128
ddustin wants to merge 1 commit into
rustyrussell:masterfrom
ddustin:ddustin/mac_connect_fix

Conversation

@ddustin
Copy link
Copy Markdown
Collaborator

@ddustin ddustin commented May 20, 2026

When connecting on Linux, a rejection typically returns “POLLIN|POLLERR” from poll on Linux.

Mac differs in this situation returning one of two values from poll (depending on Mac version): “POLLHUP|POLLERR”
“POLLOUT|POLLHUP”

Mac also sets the socket’s error to ECONNREFUSED in this case.

We can support this behavior by getting the SO_ERROR off the socket in the POLLHUP case, which has the added benefit of more explicity error messages generally.

Signed-off-by: Dusty Daemon @dusty_daemon

When `connect`ing on Linux, a rejection typically returns “POLLIN|POLLERR” from `poll` on Linux.

Mac differs in this situation returning one of two values from poll (depending on Mac version):
“POLLHUP|POLLERR”
“POLLOUT|POLLHUP”

Mac also sets the socket’s error to ECONNREFUSED in this case.

We can support this behavior by getting the SO_ERROR off the socket in the POLLHUP case, which has the added benefit of more explicity error messages generally.

Signed-off-by: Dusty Daemon @dusty_daemon
@nGoline
Copy link
Copy Markdown
Collaborator

nGoline commented May 21, 2026

I've also made a modification in io.c in do_connect around line 318:

	if (ret < 0) {
		/* On some systems (e.g. macOS) getsockopt(SO_ERROR) can
		 * spuriously return EBADF on a still-valid socket after a
		 * failed connect.  Only propagate EBADF if the fd itself
		 * is genuinely invalid; otherwise use ECONNABORTED so the
		 * caller sees a network-level error, not a descriptor one. */
		if (errno == EBADF && fcntl(fd, F_GETFL) >= 0)
			errno = ECONNABORTED;
		return -1;
	}

This solved for another error I was getting on macOS.

@ddustin
Copy link
Copy Markdown
Collaborator Author

ddustin commented May 27, 2026

Interesting write up of historical pitfalls of non-blocking connect: https://cr.yp.to/docs/connect.html

Apparently this is a problem many have run into.

@ddustin
Copy link
Copy Markdown
Collaborator Author

ddustin commented May 27, 2026

I've also made a modification in io.c in do_connect around line 318:

	if (ret < 0) {
		/* On some systems (e.g. macOS) getsockopt(SO_ERROR) can
		 * spuriously return EBADF on a still-valid socket after a
		 * failed connect.  Only propagate EBADF if the fd itself
		 * is genuinely invalid; otherwise use ECONNABORTED so the
		 * caller sees a network-level error, not a descriptor one. */
		if (errno == EBADF && fcntl(fd, F_GETFL) >= 0)
			errno = ECONNABORTED;
		return -1;
	}

This solved for another error I was getting on macOS.

Which test did this fix?

@ddustin
Copy link
Copy Markdown
Collaborator Author

ddustin commented May 27, 2026

Looking through Mac's kernel, EBADF can be returned by sodefunct which is called when Mac needs to "forcibly invalidate and tear down an active network socket."

This defunct status happens to our fds as mac decides our processes can be semi-purged, for instance if it thinks the process is "in the background" without user interaction and resources get tight.

It looks like other server programs that target mac specify to the kernel that we're important and not to do this. It probably still will but do it less.

It can also happen if the network interface switches over (ie from ethernet to wifi, or changing wifi) but that is probably not relevant here.

The kernel code is a little opaque to my eyes, but there are two instances where the SO_ERROR on the socket itself is set to EBADF in uipc_socket.c. These two instances are while the kernel is handling these 'defunct' situations. EBADF seems like an obnoxious choice by Mac but c'est la vie.

Some of our tests are particularly aggressive and I wouldn't be much surprised at all that they put the kernel into this purging 'defunct' mode, and this might help explain many tests being flaky on mac.

So there are probably two larger things to tackle here:

  1. Make sure we're handing when the kernel goes into purge mode
  2. Request from the kernel that we not be purged

For 2 there are a few more "c style approaches":

/* Tell the kernel we are not a 'background' process */
setpriority(PRIO_DARWIN_PROCESS, 0, 0);
/* Tell kernel we're super important in the POSIX style */
setpriority(PRIO_PROCESS, pid, -20)

There are some Mac specific ways to do it setiopolicy_np might work. But the best one seems to be this swift code:

let activityToken = ProcessInfo.processInfo.beginActivity(
    options: [.userInitiated, .idleSystemSleepDisabled, .suddenTerminationDisabled],
    reason: "Running a critical networking priority service"
)

Lots to consider 🤔

@nGoline
Copy link
Copy Markdown
Collaborator

nGoline commented May 28, 2026

Looking through Mac's kernel, EBADF can be returned by [sodefunct](https://github.com/apple-oss- distributions/xnu/blob/f6217f891ac0bb64f3d375211650a4c1ff8ca1ea/bsd/kern/uipc_socket.c#L7475) which is called when Mac needs to "forcibly invalidate and tear down an active network socket."

That sounds OK though, if the kernel returns this EBADF we just pass it along.

Some of our tests are particularly aggressive and I wouldn't be much surprised at all that they put the kernel into this purging 'defunct' mode, and this might help explain many tests being flaky on mac.

From my experience the test fails even of it's the only test running.

@nGoline
Copy link
Copy Markdown
Collaborator

nGoline commented May 28, 2026

I've also made a modification in io.c in do_connect around line 318:

	if (ret < 0) {
		/* On some systems (e.g. macOS) getsockopt(SO_ERROR) can
		 * spuriously return EBADF on a still-valid socket after a
		 * failed connect.  Only propagate EBADF if the fd itself
		 * is genuinely invalid; otherwise use ECONNABORTED so the
		 * caller sees a network-level error, not a descriptor one. */
		if (errno == EBADF && fcntl(fd, F_GETFL) >= 0)
			errno = ECONNABORTED;
		return -1;
	}

This solved for another error I was getting on macOS.

Which test did this fix?

ElementsProject/lightning#9140 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants