Skip to content

fix(isolation): close rootfs automount reload race (#83)#84

Merged
andypost merged 2 commits into
freeunitorg:pre-1.35.6from
andypost:fix/83-rootfs-automount-reload-race
Jun 24, 2026
Merged

fix(isolation): close rootfs automount reload race (#83)#84
andypost merged 2 commits into
freeunitorg:pre-1.35.6from
andypost:fix/83-rootfs-automount-reload-race

Conversation

@andypost

Copy link
Copy Markdown

Summary

Fixes the rootfs automount reload race tracked in #83 — the root cause behind the flaky test_go_isolation_rootfs_automount_tmpfs[_regression] failure on go-1.25 (green on go-1.26 purely by timing; see PR #78 run 27793786107).

Stacking note: the substantive change is a single commit — 9b3f26cc fix(isolation): close rootfs automount reload race (40 lines in src/nxt_isolation.c). Because the pre-1.35.6 base does not yet carry the in-flight 1.35.6 hardening commits (incl. 03e6b0ee, the openat2(RESOLVE_BENEATH) work this fix sits on top of), the diff currently also shows those commits. Review the last commit. Once the hardening lands on pre-1.35.6, this collapses to the one-commit fix.

Root cause

For a rootfs app with namespaces.mount: true, the child sets up as:
unshare(CLONE_NEWNS)nxt_isolation_prepare_rootfs() mounts <rootfs>/proconly then pivot_root() severs propagation.

unshare(CLONE_NEWNS) copies the host mount tree and its propagation type; on a systemd host / (and /tmp, where the test rootfs lives) is MS_SHARED, so the new namespace starts as a peer of the host. During the mount loop the new <rootfs>/proc is still propagation-linked to the host — exactly while the previous worker generation's host-side teardown does umount2(<rootfs>/proc, MNT_DETACH) on the same path. They collide → mount(...proc...) ENOENT → prototype exits 1 → respawn storm → never converges (status=None) → ~53 leaked router fds.

The fix — two additive, CLONE_NEWNS-gated changes in src/nxt_isolation.c

  1. nxt_isolation_prepare_rootfs() — make the new namespace root recursively private (mount("", "/", MS_REC|MS_PRIVATE)) before the mount loop (above the openat2 check). MS_PRIVATE, not MS_SLAVE: a slave still receives propagation inbound, so a host umount would propagate back in. Blocks both directions.
  2. nxt_isolation_unmount_all() — when the worker had its own mount namespace, return early instead of host-side umount2. Those mounts are reaped with the namespace on exit; the host umount2 was the (22: Invalid argument) no-op that raced the next generation.

The non-namespaced (chroot) path is unchanged and still unmounts in the host namespace. The openat2 RESOLVE_BENEATH check is untouched.

Verification

  • Builds clean (make, no warnings; unitd links).
  • Authoritative check is the go-1.25 matrix job (the failing toolchain) — local repro isn't possible on go-1.26.
    sudo pytest-3 test/test_go_isolation.py::test_go_isolation_rootfs_automount_tmpfs_regression
  • The transient mount(...proc...) (2: No such file or directory) alert should no longer appear, and router fds should stay bounded across the 100× reload loop.

Follow-up (separate commit, after this is green on go-1.25)

Remove skip_alert(_TMPFS_RELOAD_ALERT) in test/test_go_isolation.py (:413/:422) so the alert becomes a hard regression guard rather than being skipped.

Closes #83

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of security hardenings and bug fixes across FreeUnit, focusing on bounds-checking, resource isolation, and memory safety. Key improvements include validating TLS private keys, tightening file-descriptor lifetimes, and adding robust bounds checks for WebSockets, shared memory, and language bindings. The review feedback identifies a prefix-matching bug in the rootfs validation within src/nxt_isolation.c that could lead to false positives, a critical integer underflow vulnerability in src/nxt_unit.c when checking sptr bounds with small buffer sizes, and a recommendation to use NULL instead of empty strings in mount calls for better compatibility.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/nxt_isolation.c Outdated
Comment thread src/nxt_unit.c Outdated
Comment thread src/nxt_isolation.c

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b3f26cc97

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/nxt_file.c Outdated
An app with rootfs + namespaces.mount + automount under a rapid config
reload loop hits a race: the new prototype mounts <rootfs>/proc while the
previous worker generation's host-side teardown does umount2(<rootfs>/proc,
MNT_DETACH) on the same path, with no serialization. The loser gets
mount(...proc...) ENOENT, the prototype exits 1, a respawn storm follows,
the generation never converges and dangling app<->router fds trip the
fd-leak teardown check. The Go toolchain only shifts the timing (surfaces
on go-1.25, hidden on go-1.26).

Root cause: unshare(CLONE_NEWNS) copies the host mount tree and its
propagation type; on a systemd host "/" (and "/tmp") is MS_SHARED, so the
new namespace starts as a peer of the host. pivot_root() only severs
propagation after nxt_isolation_prepare_rootfs() has already mounted proc,
so during the mount loop the new proc mount is still propagation-linked to
the host exactly while the old generation umount2()s the same path.

Two additive, CLONE_NEWNS-gated changes:

  * nxt_isolation_prepare_rootfs(): make the new namespace root recursively
    private (mount("", "/", MS_REC|MS_PRIVATE)) before the mount loop.
    MS_PRIVATE (not MS_SLAVE) is required: a slave still receives propagation
    inbound, so a host umount would propagate back in.

  * nxt_isolation_unmount_all(): when the worker had its own mount namespace,
    return early instead of host-side umount2 — those mounts are reaped with
    the namespace on exit; the host umount2 was the EINVAL no-op that raced
    the next generation.

The non-namespaced (chroot) path is unchanged and still unmounts in the host
namespace.

Refs nginx#83

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@andypost

Copy link
Copy Markdown
Author

@claude review once

@claude

claude Bot commented Jun 22, 2026

Copy link
Copy Markdown

Claude finished @andypost's task in 2m 11s —— View job


Code Review

  • Gather context and read the diff
  • Analyze isolation/mount changes
  • Review against existing review comments
  • Post final review

The two-part fix is logically sound and correctly targets the race. Here's a detailed look at each change.


Change 1 — nxt_isolation_unmount_all(): early return on NEWNS (src/nxt_isolation.c:760-774)

Correct. When a worker exits with CLONE_NEWNS, the kernel tears down its private mount namespace — including every mount inside it — atomically on process exit. Calling umount2(<rootfs>/proc, MNT_DETACH) from the host namespace after that point returns EINVAL because those paths are simply not mounted in the host (hence the (22: Invalid argument) log lines mentioned in the PR). Worse, doing so races the next-generation prototype that is simultaneously mounting the same path. The early return is the right fix.

One thing worth verifying: nxt_isolation_unmount_all is only ever invoked in the parent/router context (via process->isolation.cleanup, set at :602), never from within the child's namespace. From reading the code that appears to be the case, so the guard is placed correctly.


Change 2 — nxt_isolation_prepare_rootfs(): MS_REC|MS_PRIVATE before mount loop (src/nxt_isolation.c:810-832)

Correct and well-timed. The race window is: unshare(CLONE_NEWNS) inherits MS_SHARED propagation from the host → mount loop begins → pivot_root severs propagation. The fix closes this window by making the namespace private before the loop, so no mount event can propagate out to the host peer group or be received from it.

One notable interaction with nxt_isolation_pivot_root (:918): that function opens with:

mount("", "/", "", MS_SLAVE|MS_REC, "")

Once the namespace is already MS_PRIVATE, this call has no master peer to slave to, making it a no-op for propagation purposes (Linux silently accepts it; the mount stays PRIVATE). This is harmless — PRIVATE is strictly stronger than SLAVE — but the MS_SLAVE call in pivot_root is now dead code in the NEWNS path. That function is outside this PR's scope, so just noting it as a follow-up cleanup opportunity.

Similarly, the nxt_isolation_make_private_mount() call inside pivot_root (which reads /proc/self/mounts and privates specific ancestor mounts) is now fully redundant — everything is already PRIVATE. Again harmless, just extra work.


On "" vs NULL in the mount() call (:827)

mount("", "/", "", MS_REC | MS_PRIVATE, "")

The kernel ignores source, filesystemtype, and data for propagation-only remounts, so both "" and NULL work. The author's response to the Gemini comment is correct: this matches the pre-existing pattern at line 918 (MS_SLAVE|MS_REC). The minor inconsistency with line 946 (MS_SLAVE | MS_REC, NULL) already existed before this PR. Not a blocker.


Comments / documentation

The block comments are unusually thorough and accurate, explaining the MS_SHARED systemd host scenario, the PRIVATE-vs-SLAVE distinction, and the race mechanism. Good for future readers.


Summary

Both changes are correct, complementary, and minimal. No issues that would block merge. The main follow-up items (removing skip_alert(_TMPFS_RELOAD_ALERT) in the test, and eventual cleanup of the now-redundant MS_SLAVE in nxt_isolation_pivot_root) are already tracked in the PR description. The authoritative signal is the go-1.25 CI matrix job as called out in the PR.

@andypost

Copy link
Copy Markdown
Author

Folded into the 1.35.6 hardening PR #85 as 75e3b77f — the rootfs automount reload-race fix now ships as part of that stack so it lands together (it was green on go-1.25 standalone here). Closing this PR as subsumed; the fix itself is unchanged.

@andypost andypost closed this Jun 22, 2026
@andypost andypost deleted the fix/83-rootfs-automount-reload-race branch June 22, 2026 16:00
@andypost andypost restored the fix/83-rootfs-automount-reload-race branch June 22, 2026 16:00
@andypost andypost reopened this Jun 22, 2026
@andypost andypost marked this pull request as draft June 22, 2026 16:02
@andypost andypost marked this pull request as ready for review June 22, 2026 16:02
@andypost

Copy link
Copy Markdown
Author

Correction: this PR is not subsumed — disregard the earlier "folded into #85" note. Since #84 is approved (thanks @TuzelKO) and focused/green, it stays the vehicle for the #83 fix. I reverted the fold: #85 was force-pushed back to the pure 14-commit hardening stack (the 75e3b77f cherry-pick is dropped), so this PR is again the standalone fix for #83 and can merge on its own. #85's go-1.25 job will stay red until this lands on pre-1.35.6; I'll rebase #85 afterward.

Add a 1.35.6 Bugfix entry for the rootfs automount reload race fixed in
this PR (freeunitorg#83).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@andypost

Copy link
Copy Markdown
Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a rootfs automount reload race condition. It modifies nxt_isolation_unmount_all to skip unmounting when CLONE_NEWNS is active, since the kernel automatically reaps those mounts upon namespace termination. It also updates nxt_isolation_prepare_rootfs to detach the namespace from the host peer group using MS_REC | MS_PRIVATE before mounting, preventing propagation races. There are no review comments, and I have no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@andypost andypost merged commit eb0801f into freeunitorg:pre-1.35.6 Jun 24, 2026
21 checks passed
@andypost andypost deleted the fix/83-rootfs-automount-reload-race branch June 24, 2026 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants