Skip to content

machine-config-daemon-firstboot: disable ostree fsync during bootstrap#6122

Open
sdodson wants to merge 1 commit into
openshift:mainfrom
sdodson:mcd-firstboot-fsync
Open

machine-config-daemon-firstboot: disable ostree fsync during bootstrap#6122
sdodson wants to merge 1 commit into
openshift:mainfrom
sdodson:mcd-firstboot-fsync

Conversation

@sdodson
Copy link
Copy Markdown
Member

@sdodson sdodson commented Jun 2, 2026

Saves a few seconds on ostree rebase during firstboot, measured at ~3-5s on AWS m6a.xlarge (n=18). Assumes we tolerate the outcome that corruption due to power failure means we have to re-ignite a node from scratch. Suggested by @dustymabe and original analysis at https://github.com/dustymabe/20250529-ostree-fsync-analysis

- What I did

Bind mount a tmpfs-backed copy of the ostree repo config over /sysroot/ostree/repo/config before the rebase. Setting core.fsync=false is then ephemeral: if the host crashes or reboots before ExecStopPost runs, the kernel tears down the bind mount and the original on-disk config is automatically restored.

We append a second [core] section to the tmpfs copy (GKeyFile merges duplicate groups, last value wins) rather than using ostree config set. The reason: ostree writes config changes via an atomic rename, which fails with EBUSY when the target file is itself a bind-mount point. Discovered during testing — journald shows renameat(./tmp.zXuFKh, config): Device or resource busy if ostree config set is attempted after the bind mount is in place.

- How to verify it

Prior to this change, the node journal will contain many Starting/Completed syncfs() for system repo entries from rpm-ostree during the rebase. After this change, those per-object syncfs calls are absent during the rebase window (the bind mount is confirmed active by sysroot-ostree-repo-config.mount: Deactivated successfully at service stop).

- Description for the changelog

Disabled ostree fsync during bootstrapping via ephemeral bind mount to save ~3-5s on rebase

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Walkthrough

Adds transient ExecStartPre steps to copy and modify the ostree repo config into a tmpfs, bind-mount it over /sysroot/ostree/repo/config to set core.fsync=false for the unit lifetime, runs existing podman firstboot commands, and unmounts the bind in ExecStopPost.

Changes

OStree fsync toggle during firstboot

Layer / File(s) Summary
Machine-config-daemon firstboot fsync toggle
templates/common/_base/units/machine-config-daemon-firstboot.service.yaml
Adds ExecStartPre steps that create a tmpfs-backed copy of the ostree repo config, append [core] fsync = false, and bind-mount it over /sysroot/ostree/repo/config before running the existing podman firstboot-complete-machineconfig ExecStart commands; adds ExecStopPost to unmount the bind mount afterward.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Title check ✅ Passed The title 'machine-config-daemon-firstboot: disable ostree fsync during bootstrap' directly describes the main change: disabling ostree fsync during the firstboot phase by bind-mounting a tmpfs-backed config copy.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR only modifies a YAML template file (machine-config-daemon-firstboot.service) and does not introduce or modify any Ginkgo test files. The check is not applicable to non-test files.
Test Structure And Quality ✅ Passed This PR modifies only a systemd service YAML template file; no Ginkgo test code or test files are added or modified, making the check not applicable.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The only change is to a systemd service unit YAML file, which is not subject to the MicroShift test compatibility check.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR only modifies a YAML service unit file. No new Ginkgo e2e tests are added, so the SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed The PR modifies a systemd unit file template, not a Kubernetes deployment or operator code. No Kubernetes scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed PR modifies a systemd unit YAML file, not OTE binary code. The check is inapplicable as it concerns OTE binaries communicating with openshift-tests.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR only modifies a YAML systemd service template file. No new Ginkgo e2e tests are added, so this check does not apply.
No-Weak-Crypto ✅ Passed PR adds a systemd service unit file that disables ostree fsync during bootstrap via bind-mounting. No weak crypto algorithms, custom implementations, or non-constant-time secret comparisons detected.
Container-Privileges ✅ Passed PR introduces privileged (--privileged, --net=host, --pid=host) podman containers in machine-config-daemon-firstboot.service for legitimate system bootstrap with clear justification.
No-Sensitive-Data-In-Logs ✅ Passed The PR adds one systemd service file with only standard system configuration commands. No sensitive data, credentials, or PII are exposed in logging.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

# Run this via podman because we want to use the nmstatectl binary in our container
ExecStart=/usr/bin/podman run --rm --privileged --net=host -v /:/rootfs --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig --persist-nics
ExecStart=/usr/bin/podman run --rm --privileged --pid=host --net=host -v /:/rootfs --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig
ExecStopPost=/usr/bin/ostree config --repo=/sysroot/ostree/repo set core.fsync true
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe unset is better?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also loses whatever setting might've been set directly.
Also agree with the coderabbit comment re. error handling.

One way to structure this is to copy the config to somewhere in /run and then bind-mount on top of the config and then ostree config set it. That way it's really obvious that it's scoped to firstboot.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sdodson
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@templates/common/_base/units/machine-config-daemon-firstboot.service.yaml`:
- Around line 19-23: The ExecStartPre/ExecStopPost pair in
machine-config-daemon-firstboot.service.yaml currently disables core.fsync and
only re-enables it on clean stop, so add an idempotent recovery that always
ensures ostree core.fsync is true on subsequent boots: implement a small
follow-up systemd unit (or add to the existing machine-config-daemon.service)
that runs ostree config --repo=/sysroot/ostree/repo set core.fsync true
unconditionally at boot (or checks and sets it if false), enable it with
appropriate After= and WantedBy= so it always runs on startup, and keep the
original ExecStartPre/ExecStopPost as-is for the performance window.
- Line 19: The ExecStartPre that disables ostree core.fsync is currently not
marked as best-effort and will fail the unit if it errors; update the
machine-config-daemon-firstboot.service template so the ExecStartPre that runs
"/usr/bin/ostree ... set core.fsync false" is prefixed with "-" to make it
best-effort (so failures don't block firstboot), and also add a preceding
best-effort ExecStartPre that attempts to reset core.fsync to true
("/usr/bin/ostree ... set core.fsync true") to guard against previous runs
leaving fsync disabled; reference the ExecStartPre lines manipulating
"core.fsync" in the service unit and ensure both are best-effort so they won't
cause unit failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1649fad-a0af-48f6-b2ce-d0cf41d462d6

📥 Commits

Reviewing files that changed from the base of the PR and between bf66c0b and 9afb1fc.

📒 Files selected for processing (1)
  • templates/common/_base/units/machine-config-daemon-firstboot.service.yaml

Comment thread templates/common/_base/units/machine-config-daemon-firstboot.service.yaml Outdated
Comment thread templates/common/_base/units/machine-config-daemon-firstboot.service.yaml Outdated
Copy link
Copy Markdown
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the way we'd add a e.g. --disable-fsync to rpm-ostree rebase (or ostree container image pull) when we call it. But the even better fix here is to probably something like --disable-layers-fsync which skips it only when committing layers.

# Run this via podman because we want to use the nmstatectl binary in our container
ExecStart=/usr/bin/podman run --rm --privileged --net=host -v /:/rootfs --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig --persist-nics
ExecStart=/usr/bin/podman run --rm --privileged --pid=host --net=host -v /:/rootfs --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig
ExecStopPost=/usr/bin/ostree config --repo=/sysroot/ostree/repo set core.fsync true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also loses whatever setting might've been set directly.
Also agree with the coderabbit comment re. error handling.

One way to structure this is to copy the config to somewhere in /run and then bind-mount on top of the config and then ostree config set it. That way it's really obvious that it's scoped to firstboot.

@sdodson
Copy link
Copy Markdown
Member Author

sdodson commented Jun 3, 2026

Ideally, the way we'd add a e.g. --disable-fsync to rpm-ostree rebase (or ostree container image pull) when we call it. But the even better fix here is to probably something like --disable-layers-fsync which skips it only when committing layers.

I can pursue the bindmount option if you think that's good enough to move forward.

If we really want to pursue the rpm-ostree change that feels like it's not worth it to achieve 4-10s improvement, at this time. We're taking ~ 40 seconds to rebase from RHCOS to OCP Node even when 100% of RHCOS chunks are present and if we're going to invest that level of effort I'd hope to get that cut at least in half. Here's the breakdown based on 4.22 logs

  Summary: The "51 chunks already present" saves what would be ~200-400 MB of base RHCOS content in an older-version scenario. But even with a perfect version match, you still pay for:
  1. ~22s for the extensions sidecar (always pulled from registry, never on AMI)
  2. ~22s for 219.6 MB of custom OCP layers (kubelet/CRI-O/OVS — same deal)
  3. ~10s of silent SELinux work
  4. ~9s of ostree staging

Number 1 is solved in #5905 just not in 4.22.

The time savings may not actually come via rpm-ostree but perhaps by fetching content for number 2 ahead of this stage, there's already about 30 seconds between when the network is up and this stage. But I'm not sure what item 3 is and if we can address that. Here's number 2 in detail from the logs.

  02:53:25  rpm-ostree rebase invoked
  02:53:26  ostree chunk layers already present: 51  ← the "same version" payoff
  02:53:26  custom layers needed: 2 (219.6 MB)      ← but these are always missing
  02:53:48  [0/2] Fetching bdc94d91... (219.6 MB)...done
  02:53:48  [1/2] Fetching 692e2beb... (6.6 kB)...done

I'm curious what's happening in number 3 and if it can be avoided in some way.

@dustymabe
Copy link
Copy Markdown
Member

add a e.g. --disable-fsync to rpm-ostree rebase (or ostree container image pull) when we call it. But the even better fix here is to probably something like --disable-layers-fsync

I'm assuming here that --disable-fsync or --disable-layers-fsync don't exist today and are things we'd have to add to rpm-ostree. I agree with @sdodson it's not worh the effort to implement that for 4-10s of gain.

I can pursue the bindmount option if you think that's good enough to move forward.

I think it is. I thought about the bind mount option before but you implemented it this way and I thought it was good enough. That's a long way of me saying I do like the bind mount option better if it works as well.

@dustymabe
Copy link
Copy Markdown
Member

The time savings may not actually come via rpm-ostree but perhaps by fetching content for number 2 ahead of this stage, there's already about 30 seconds between when the network is up and this stage.

Is that 30s number true even with #5905 ?

@jlebon
Copy link
Copy Markdown
Member

jlebon commented Jun 4, 2026

add a e.g. --disable-fsync to rpm-ostree rebase (or ostree container image pull) when we call it. But the even better fix here is to probably something like --disable-layers-fsync

I'm assuming here that --disable-fsync or --disable-layers-fsync don't exist today and are things we'd have to add to rpm-ostree. I agree with @sdodson it's not worh the effort to implement that for 4-10s of gain.

Not for this effort, I agree. But it's generally desirable of course. The impact of this likely gets worse the more layers you have.

I can pursue the bindmount option if you think that's good enough to move forward.

I think it is. I thought about the bind mount option before but you implemented it this way and I thought it was good enough. That's a long way of me saying I do like the bind mount option better if it works as well.

Good enough for me too to be clear!

@sdodson sdodson force-pushed the mcd-firstboot-fsync branch from 9afb1fc to c33eca9 Compare June 4, 2026 14:52
@sdodson
Copy link
Copy Markdown
Member Author

sdodson commented Jun 4, 2026

/hold
Debugging, it's failing to scale new nodes.

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 4, 2026
Saves a few seconds, measured at around 4-10s. Assumes we tolerate
the outcome that corruption due to powerfailure means we have to re-ignite
a node from scratch. Suggested by @dustymabe and original analysis at
https://github.com/dustymabe/20250529-ostree-fsync-analysis

Rather than writing fsync=false to disk and re-enabling it on stop, we
bind mount a tmpfs-backed copy of the ostree repo config over the real
path. The ostree config command then writes to the bind mount instead of
the underlying file, so the on-disk config is never modified. If the host
crashes or reboots before ExecStopPost runs, the kernel tears down the
bind mount and the original config is automatically restored.
@sdodson sdodson force-pushed the mcd-firstboot-fsync branch from c33eca9 to a50480e Compare June 5, 2026 17:42
sdodson added a commit to sdodson/machine-config-operator that referenced this pull request Jun 5, 2026
…ring firstboot

Two related optimizations to reduce I/O contention during node scale-up:

1. Defer rpm-ostree rollback cleanup by 60s +jitter (removeRollback)

   On boot 2, MCD calls `rpm-ostree cleanup -r` shortly after starting,
   which triggers three syncfs() calls totalling ~43s. These run
   concurrently with CNI image pulls and compete for disk bandwidth during
   the kubelet-to-ready window. Sleeping before cleanup lets the node reach
   Ready first. Jitter (up to 30s) spreads cleanup across nodes that scale
   up in the same burst.

2. Disable ostree fsync during machine-config-daemon-firstboot (from openshift#6122)

   Bind mount a tmpfs-backed copy of the ostree repo config over the real
   path before the rebase so that setting core.fsync=false is ephemeral.
   If the host crashes or reboots, the bind mount disappears and the
   on-disk config is unchanged automatically.

   We append a second [core] section (GKeyFile merges duplicate groups)
   rather than using `ostree config set` because ostree uses an atomic
   rename which fails with EBUSY on a bind-mounted file.

   Saves ~3-5s on rebase fetch. Measured at n=18 on AWS m6a.xlarge.
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

@sdodson: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants