fix(isolation): close rootfs automount reload race (#83)#84
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of security hardenings and bug fixes across FreeUnit, focusing on bounds-checking, resource isolation, and memory safety. Key improvements include validating TLS private keys, tightening file-descriptor lifetimes, and adding robust bounds checks for WebSockets, shared memory, and language bindings. The review feedback identifies a prefix-matching bug in the rootfs validation within src/nxt_isolation.c that could lead to false positives, a critical integer underflow vulnerability in src/nxt_unit.c when checking sptr bounds with small buffer sizes, and a recommendation to use NULL instead of empty strings in mount calls for better compatibility.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9b3f26cc97
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
An app with rootfs + namespaces.mount + automount under a rapid config
reload loop hits a race: the new prototype mounts <rootfs>/proc while the
previous worker generation's host-side teardown does umount2(<rootfs>/proc,
MNT_DETACH) on the same path, with no serialization. The loser gets
mount(...proc...) ENOENT, the prototype exits 1, a respawn storm follows,
the generation never converges and dangling app<->router fds trip the
fd-leak teardown check. The Go toolchain only shifts the timing (surfaces
on go-1.25, hidden on go-1.26).
Root cause: unshare(CLONE_NEWNS) copies the host mount tree and its
propagation type; on a systemd host "/" (and "/tmp") is MS_SHARED, so the
new namespace starts as a peer of the host. pivot_root() only severs
propagation after nxt_isolation_prepare_rootfs() has already mounted proc,
so during the mount loop the new proc mount is still propagation-linked to
the host exactly while the old generation umount2()s the same path.
Two additive, CLONE_NEWNS-gated changes:
* nxt_isolation_prepare_rootfs(): make the new namespace root recursively
private (mount("", "/", MS_REC|MS_PRIVATE)) before the mount loop.
MS_PRIVATE (not MS_SLAVE) is required: a slave still receives propagation
inbound, so a host umount would propagate back in.
* nxt_isolation_unmount_all(): when the worker had its own mount namespace,
return early instead of host-side umount2 — those mounts are reaped with
the namespace on exit; the host umount2 was the EINVAL no-op that raced
the next generation.
The non-namespaced (chroot) path is unchanged and still unmounts in the host
namespace.
Refs nginx#83
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9b3f26c to
129a60a
Compare
|
@claude review once |
|
Claude finished @andypost's task in 2m 11s —— View job Code Review
The two-part fix is logically sound and correctly targets the race. Here's a detailed look at each change. Change 1 —
|
|
Folded into the 1.35.6 hardening PR #85 as |
|
Correction: this PR is not subsumed — disregard the earlier "folded into #85" note. Since #84 is approved (thanks @TuzelKO) and focused/green, it stays the vehicle for the #83 fix. I reverted the fold: #85 was force-pushed back to the pure 14-commit hardening stack (the |
Add a 1.35.6 Bugfix entry for the rootfs automount reload race fixed in this PR (freeunitorg#83). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request addresses a rootfs automount reload race condition. It modifies nxt_isolation_unmount_all to skip unmounting when CLONE_NEWNS is active, since the kernel automatically reaps those mounts upon namespace termination. It also updates nxt_isolation_prepare_rootfs to detach the namespace from the host peer group using MS_REC | MS_PRIVATE before mounting, preventing propagation races. There are no review comments, and I have no additional feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Summary
Fixes the rootfs automount reload race tracked in #83 — the root cause behind the flaky
test_go_isolation_rootfs_automount_tmpfs[_regression]failure on go-1.25 (green on go-1.26 purely by timing; see PR #78 run27793786107).Root cause
For a
rootfsapp withnamespaces.mount: true, the child sets up as:unshare(CLONE_NEWNS)→nxt_isolation_prepare_rootfs()mounts<rootfs>/proc→ only thenpivot_root()severs propagation.unshare(CLONE_NEWNS)copies the host mount tree and its propagation type; on a systemd host/(and/tmp, where the test rootfs lives) isMS_SHARED, so the new namespace starts as a peer of the host. During the mount loop the new<rootfs>/procis still propagation-linked to the host — exactly while the previous worker generation's host-side teardown doesumount2(<rootfs>/proc, MNT_DETACH)on the same path. They collide →mount(...proc...) ENOENT→ prototype exits 1 → respawn storm → never converges (status=None) → ~53 leaked router fds.The fix — two additive,
CLONE_NEWNS-gated changes insrc/nxt_isolation.cnxt_isolation_prepare_rootfs()— make the new namespace root recursively private (mount("", "/", MS_REC|MS_PRIVATE)) before the mount loop (above theopenat2check).MS_PRIVATE, notMS_SLAVE: a slave still receives propagation inbound, so a host umount would propagate back in. Blocks both directions.nxt_isolation_unmount_all()— when the worker had its own mount namespace, return early instead of host-sideumount2. Those mounts are reaped with the namespace on exit; the hostumount2was the(22: Invalid argument)no-op that raced the next generation.The non-namespaced (chroot) path is unchanged and still unmounts in the host namespace. The
openat2 RESOLVE_BENEATHcheck is untouched.Verification
make, no warnings;unitdlinks).sudo pytest-3 test/test_go_isolation.py::test_go_isolation_rootfs_automount_tmpfs_regressionmount(...proc...) (2: No such file or directory)alert should no longer appear, and router fds should stay bounded across the 100× reload loop.Follow-up (separate commit, after this is green on go-1.25)
Remove
skip_alert(_TMPFS_RELOAD_ALERT)intest/test_go_isolation.py(:413/:422) so the alert becomes a hard regression guard rather than being skipped.Closes #83
🤖 Generated with Claude Code