Skip to content

[ddmd] add --api-only flag for test fixtures and Linux build#729

Merged
zeeshanlakhani merged 5 commits into
mainfrom
zl/ddmd-no-state-machine
May 20, 2026
Merged

[ddmd] add --api-only flag for test fixtures and Linux build#729
zeeshanlakhani merged 5 commits into
mainfrom
zl/ddmd-no-state-machine

Conversation

@zeeshanlakhani
Copy link
Copy Markdown
Contributor

@zeeshanlakhani zeeshanlakhani commented May 7, 2026

Omicron's oxidecomputer/omicron#10381 introduces a stubbed ddmd admin endpoint because spawning a real ddmd in a generic test toolchain is not viable: the routing state machine (discovery, exchange, route synchronization) depends on illumos networking facilities the toolchain does not provide. Consumers of the stub, e.g., Nexus RPW (multicast members), sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS service name, cannot exercise the real admin surface from Omicron's test harness.

This work adds an opt-in --api-only flag to ddmd that runs only the admin API server and skips the state machine entirely, allowing the fixture to spawn the real binary. This is analogous to mgd --no-bgp-dispatcher, which Omicron's MgdInstance already uses for the same purpose.

To make the fixture path usable on Linux, ddmd itself must build on Linux. The previous code pulled the illumos-only crates libnet, dpd-client, opte-ioctl, and oxide-vpc unconditionally through ddm, which failed to link on Linux (-lzfs, -ldlpi). This change introduces a backend feature in both ddm and ddmd (default-on, mirroring mgd's mg-lower pattern) that marks those four crates optional. The buildomat linux.sh job now builds ddmd and ddmadm, with ddmd invoked as cargo build --bin ddmd --no-default-features.

The illumos-only halves of ddm are isolated by the feature gate:

  • The routing state machine implementation moves from sm.rs into sm/state.rs.
  • The exchange runtime (HTTP push/pull and route programming) moves from exchange.rs into exchange/runtime.rs.
  • The discovery runtime (UDPv6 solicitation/advertisement loops) moves from discovery.rs into discovery/runtime.rs.

Each parent mod.rs keeps the platform-agnostic types and re-exports the runtime surface so existing call sites resolve unchanged on illumos. The runtime submodules are gated as a unit by #[cfg(all(feature = "backend", target_os = "illumos"))]. We also remove the single-function ddm/src/util.rs, inlining the function into discovery/runtime.rs, where its sole caller lives.

The SIGTERM cleanup handler is installed regardless of the flag, so Ctrl-C still exits cleanly in --api-only mode. The imported route sets are empty in that mode, so the cleanup itself is a noop. --api-only and --addr are mutually exclusive at the clap level (conflicts_with), so passing them together is rejected at parse time.

@zeeshanlakhani zeeshanlakhani requested review from jgallagher and taspelund and removed request for jgallagher and taspelund May 7, 2026 01:59
@zeeshanlakhani zeeshanlakhani force-pushed the zl/ddmd-no-state-machine branch from 6a10770 to 6dcf010 Compare May 7, 2026 03:23
@zeeshanlakhani zeeshanlakhani changed the title [ddmd] add --no-state-machine flag for test fixtures [ddmd] add --no-state-machine flag for test fixtures and Linux build May 7, 2026
Omicron's oxidecomputer/omicron#10381 introduces a stubbed `ddmd`
admin endpoint because spawning a real `ddmd` in a generic test
toolchain is not viable: the routing state machine (discovery, exchange, route
synchronization) depends on illumos networking facilities the toolchain does not
provide. Consumers of the stub, e.g., Nexus RPW (multicast members),
sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS
service name, cannot exercise the real admin surface from Omicron's test harness.

This work adds an opt-in `--no-state-machine` flag to `ddmd` that runs only
the admin API server and skips the state machine entirely, allowing the fixture
to spawn the real binary. This is analogous to `mgd --no-bgp-dispatcher`, which
Omicron's `MgdInstance` already uses for the same purpose.

To make the fixture path usable on Linux, `ddmd` itself must build on Linux.
The previous code pulled the illumos-only crates `libnet`, `dpd-client`,
`opte-ioctl`, and `oxide-vpc` unconditionally through `ddm`, which failed to
link on Linux (`-lzfs`, `-ldlpi`). This change introduces an `illumos` feature
in both `ddm` and `ddmd` (default-on, mirroring `mgd`'s `mg-lower` pattern) that
marks those four crates optional. The buildomat `linux.sh` job now builds `ddmd`
and `ddmadm`, with `ddmd` invoked as `cargo build --bin ddmd --no-default-features`.

The illumos-only halves of `ddm` are isolated by the feature gate:

- The routing state machine implementation moves from `sm.rs` into
  `sm/state.rs`.
- The exchange runtime (HTTP push/pull and route programming) moves from
  `exchange.rs` into `exchange/runtime.rs`.
- The discovery runtime (UDPv6 solicitation/advertisement loops) moves from
  `discovery.rs` into `discovery/runtime.rs`.

Each parent `mod.rs` keeps the platform-agnostic types and re-exports the
runtime surface so existing call sites resolve unchanged on illumos. The runtime
submodules are gated as a unit by `#[cfg(all(feature = "illumos",
target_os = "illumos"))]`. We also remove the single-function `ddm/src/util.rs`,
inlining the function into `discovery/runtime.rs`, where its sole caller lives.

The SIGTERM cleanup handler is installed regardless of the flag, so
Ctrl-C still exits cleanly in `--no-state-machine` mode. The imported
route sets are empty in that mode, so the cleanup itself is a noop.
Passing `--addr` alongside `--no-state-machine` is harmless but ignored,
with a warning logged.
@zeeshanlakhani zeeshanlakhani force-pushed the zl/ddmd-no-state-machine branch from 6dcf010 to 3b54e16 Compare May 7, 2026 04:25
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 7, 2026
…fixture

We address @jgallagher's review by:

- Replacing the four positional `u16` arguments in `DnsConfigBuilder::host_zone_switch`
  with a `HostSwitchZonePorts` named-fields structure.

- Replacing the dropshot-based stubbed `DdmInstance` in test-utils with a
  fixture that spawns and supervises a real `ddmd` subprocess running with
  `--no-state-machine`, analogous to `MgdInstance` and `mgd --no-bgp-dispatcher`.
  Only the switch-zone `ddmd` is registered in internal DNS, while sled-global-zone
  instances are accessed locally by their own host and don't need DNS registration.

  This **does** require maghemite changes, already PR'ed to oxidecomputer/maghemite#729.

  To make this all work, we wire `ddmd` into the developer xtask toolchain.
  `cargo xtask download maghemite-ddmd` reuses the existing `mg-ddm.tar.gz`
  illumos zone artifact (extracting `ddmd`/`ddmadm`). On Linux it overlays a
  raw `ddmd` binary, and on macOS it builds from source.

Also, we had to bump `oxnet` from 0.1.4 to 0.1.5 to satisfy the new maghemite pin.
Copy link
Copy Markdown
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few notes on the mechanics of the split; I'll defer to folks who know maghemite better for the code organization.

Comment thread ddmd/src/main.rs Outdated
Comment thread ddmd/src/main.rs Outdated
Comment thread ddm/Cargo.toml Outdated
Includes:

- Reject `--no-state-machine` together with `--addr` at clap level via `conflicts_with`
- Collapse the two cfg-gated `termination_handler` variants into one cfg-gated body.
- Rename the `illumos` Cargo feature to `state-machine` so that it describes the gated 
  functionality (and matches the CLI flag) rather than colliding semantically with 
  `target_os = "illumos"`.
@zeeshanlakhani zeeshanlakhani requested a review from jgallagher May 15, 2026 05:55
@taspelund
Copy link
Copy Markdown
Contributor

taspelund commented May 15, 2026

Code seems ok to me overall, but I haven't spent enough time in the ddm side of this repo to have strong opinions on the organization.

The feature and CLI flag naming could be a little more intuitive, as I don't think it's immediately obvious what effect "no state machine" would have to someone who isn't familiar with ddm or the feature itself.

Maybe the CLI flag could be something like --api-only and the cargo feature could be "backend"?

- CLI flag: `--no-state-machine` -> `--api-only` (describes what the
  daemon serves, not what it skips).
- Cargo feature: `state-machine` -> `backend` (gates the illumos-only
  routing backend: state machine, exchange/discovery runtime, sys layer).
@zeeshanlakhani
Copy link
Copy Markdown
Contributor Author

@taspelund updated.

@zeeshanlakhani zeeshanlakhani changed the title [ddmd] add --no-state-machine flag for test fixtures and Linux build [ddmd] add --api-only flag for test fixtures and Linux build May 18, 2026
@zeeshanlakhani zeeshanlakhani self-assigned this May 18, 2026
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 19, 2026
Picks up recent oxidecomputer/maghemite#729 (ddmd --api-only flag) and the
preceding main changes that moved canonical types out of the auto-generated
client into the `mg-api-types` crate.

Includes:

- replaces `rdb-types` (removed upstream) with `mg-api-types` as a direct
  workspace dep
- bumps `num_enum` 0.7.5 -> 0.7.6 to satisfy maghemite's workspace pin
- migrates types
- renames `bgp_apply_v2` callers to `bgp_apply`
- `DdmInstance` fixture is renamed from `--no-state-machine` to `--api-only` to
  match the new clap flag.
Copy link
Copy Markdown
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure who should approve this - @taspelund and I both deferred. I'll slap my approval on with a couple comments about code that changed during the move - other than that this looks like just rearranging organization to support the new flags. But big 👍 if you want to find and wait for whoever really owns this to take a look.

Comment thread ddm/src/admin.rs Outdated
}
.to_logger("admin")
.map_err(|e| e.to_string())?;
let ds_log = log.new(o!(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change to the logging level too, right? Previously we constructed a StderrTerminal at the Error level, but now we're inheriting whatever level log has - presumably Info? Is that okay?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good discussion point. Structured-logging aside (which is good here), the move to Info matches what mgd does vs what ddm had ~ mgd/src/admin.rs has been constructing its dropshot logger as log.new(o!("component" => COMPONENT_MGD, "module" => MOD_ADMIN, "unit" => UNIT_API_SERVER)), doing what the the parent level runs at (typically Info). The Error cap was a ddm-only.

So the question is do we want both daemons quiet normally vs on error? I think, I'll keep what ddm did previously and file an issue to look at logging consistency.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to tag in @rcgoodfellow in to help give some historical context -- I just don't have the history with ddm to know why we'd filter the logs down to just >= error.

I would think info should generally not be so cluttered that we need to avoid them in the logger. If the reason for the filter was truly just noise, then we should consider converting noisy logging calls to debug instead of leaving them at info... although I'm ok if that means filing a follow-up issue to address it later, rather than holding up this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#740 opened to do the logging level change separately.

Comment thread ddm/src/exchange/runtime.rs Outdated
@zeeshanlakhani
Copy link
Copy Markdown
Contributor Author

I discussed this with @taspelund, and he was good with it. We'll tackle the logging level separately as per the issue.

@zeeshanlakhani zeeshanlakhani merged commit 2636abd into main May 20, 2026
16 checks passed
@zeeshanlakhani zeeshanlakhani deleted the zl/ddmd-no-state-machine branch May 20, 2026 01:36
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 20, 2026
This brings main forward and updates maghemite to current main
(9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR 
oxidecomputer/maghemite#729 with the ddmd --api-only flag). 

We also bump workspace clap from 4.5 to 4.6 to satisfy the
new maghemite constraint. The lockfile cascades through to align
omicron-as-git refs at 915f229 too.
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 20, 2026
This brings main forward and updates maghemite to current main
(9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR 
oxidecomputer/maghemite#729 with the ddmd --api-only flag). 

We also bump workspace clap from 4.5 to 4.6 to satisfy the
new maghemite constraint. The lockfile cascades through to align
omicron-as-git refs at 915f229 too. 

Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]`
list to resolve a duplicate-package error from maghemite's transitive
illumos-utils -> oxlog pull.
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 21, 2026
This brings main forward and updates maghemite to current main
(9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR 
oxidecomputer/maghemite#729 with the ddmd --api-only flag). 

We also bump workspace clap from 4.5 to 4.6 to satisfy the
new maghemite constraint. The lockfile cascades through to align
omicron-as-git refs at 915f229 too. 

Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]`
list to resolve a duplicate-package error from maghemite's transitive
illumos-utils -> oxlog pull.
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 21, 2026
This brings main forward and updates maghemite to current main
(9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR 
oxidecomputer/maghemite#729 with the ddmd --api-only flag). 

We also bump workspace clap from 4.5 to 4.6 to satisfy the
new maghemite constraint. The lockfile cascades through to align
omicron-as-git refs at 915f229 too. 

Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]`
list to resolve a duplicate-package error from maghemite's transitive
illumos-utils -> oxlog pull.
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 21, 2026
This brings main forward and updates maghemite to current main
(9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR 
oxidecomputer/maghemite#729 with the ddmd --api-only flag). 

We also bump workspace clap from 4.5 to 4.6 to satisfy the
new maghemite constraint. The lockfile cascades through to align
omicron-as-git refs at 915f229 too. 

Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]`
list to resolve a duplicate-package error from maghemite's transitive
illumos-utils -> oxlog pull.
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 21, 2026
This brings main forward and updates maghemite to current main
(9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR 
oxidecomputer/maghemite#729 with the ddmd --api-only flag). 

We also bump workspace clap from 4.5 to 4.6 to satisfy the
new maghemite constraint. The lockfile cascades through to align
omicron-as-git refs at 915f229 too. 

Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]`
list to resolve a duplicate-package error from maghemite's transitive
illumos-utils -> oxlog pull.
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 21, 2026
This brings main forward and updates maghemite to current main
(9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR 
oxidecomputer/maghemite#729 with the ddmd --api-only flag). 

We also bump workspace clap from 4.5 to 4.6 to satisfy the
new maghemite constraint. The lockfile cascades through to align
omicron-as-git refs at 915f229 too. 

Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]`
list to resolve a duplicate-package error from maghemite's transitive
illumos-utils -> oxlog pull.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants