[ddmd] add --api-only flag for test fixtures and Linux build#729
Conversation
6a10770 to
6dcf010
Compare
Omicron's oxidecomputer/omicron#10381 introduces a stubbed `ddmd` admin endpoint because spawning a real `ddmd` in a generic test toolchain is not viable: the routing state machine (discovery, exchange, route synchronization) depends on illumos networking facilities the toolchain does not provide. Consumers of the stub, e.g., Nexus RPW (multicast members), sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS service name, cannot exercise the real admin surface from Omicron's test harness. This work adds an opt-in `--no-state-machine` flag to `ddmd` that runs only the admin API server and skips the state machine entirely, allowing the fixture to spawn the real binary. This is analogous to `mgd --no-bgp-dispatcher`, which Omicron's `MgdInstance` already uses for the same purpose. To make the fixture path usable on Linux, `ddmd` itself must build on Linux. The previous code pulled the illumos-only crates `libnet`, `dpd-client`, `opte-ioctl`, and `oxide-vpc` unconditionally through `ddm`, which failed to link on Linux (`-lzfs`, `-ldlpi`). This change introduces an `illumos` feature in both `ddm` and `ddmd` (default-on, mirroring `mgd`'s `mg-lower` pattern) that marks those four crates optional. The buildomat `linux.sh` job now builds `ddmd` and `ddmadm`, with `ddmd` invoked as `cargo build --bin ddmd --no-default-features`. The illumos-only halves of `ddm` are isolated by the feature gate: - The routing state machine implementation moves from `sm.rs` into `sm/state.rs`. - The exchange runtime (HTTP push/pull and route programming) moves from `exchange.rs` into `exchange/runtime.rs`. - The discovery runtime (UDPv6 solicitation/advertisement loops) moves from `discovery.rs` into `discovery/runtime.rs`. Each parent `mod.rs` keeps the platform-agnostic types and re-exports the runtime surface so existing call sites resolve unchanged on illumos. The runtime submodules are gated as a unit by `#[cfg(all(feature = "illumos", target_os = "illumos"))]`. We also remove the single-function `ddm/src/util.rs`, inlining the function into `discovery/runtime.rs`, where its sole caller lives. The SIGTERM cleanup handler is installed regardless of the flag, so Ctrl-C still exits cleanly in `--no-state-machine` mode. The imported route sets are empty in that mode, so the cleanup itself is a noop. Passing `--addr` alongside `--no-state-machine` is harmless but ignored, with a warning logged.
6dcf010 to
3b54e16
Compare
…fixture We address @jgallagher's review by: - Replacing the four positional `u16` arguments in `DnsConfigBuilder::host_zone_switch` with a `HostSwitchZonePorts` named-fields structure. - Replacing the dropshot-based stubbed `DdmInstance` in test-utils with a fixture that spawns and supervises a real `ddmd` subprocess running with `--no-state-machine`, analogous to `MgdInstance` and `mgd --no-bgp-dispatcher`. Only the switch-zone `ddmd` is registered in internal DNS, while sled-global-zone instances are accessed locally by their own host and don't need DNS registration. This **does** require maghemite changes, already PR'ed to oxidecomputer/maghemite#729. To make this all work, we wire `ddmd` into the developer xtask toolchain. `cargo xtask download maghemite-ddmd` reuses the existing `mg-ddm.tar.gz` illumos zone artifact (extracting `ddmd`/`ddmadm`). On Linux it overlays a raw `ddmd` binary, and on macOS it builds from source. Also, we had to bump `oxnet` from 0.1.4 to 0.1.5 to satisfy the new maghemite pin.
jgallagher
left a comment
There was a problem hiding this comment.
Just a few notes on the mechanics of the split; I'll defer to folks who know maghemite better for the code organization.
Includes: - Reject `--no-state-machine` together with `--addr` at clap level via `conflicts_with` - Collapse the two cfg-gated `termination_handler` variants into one cfg-gated body. - Rename the `illumos` Cargo feature to `state-machine` so that it describes the gated functionality (and matches the CLI flag) rather than colliding semantically with `target_os = "illumos"`.
|
Code seems ok to me overall, but I haven't spent enough time in the ddm side of this repo to have strong opinions on the organization. The feature and CLI flag naming could be a little more intuitive, as I don't think it's immediately obvious what effect "no state machine" would have to someone who isn't familiar with ddm or the feature itself. Maybe the CLI flag could be something like --api-only and the cargo feature could be "backend"? |
- CLI flag: `--no-state-machine` -> `--api-only` (describes what the daemon serves, not what it skips). - Cargo feature: `state-machine` -> `backend` (gates the illumos-only routing backend: state machine, exchange/discovery runtime, sys layer).
|
@taspelund updated. |
Picks up recent oxidecomputer/maghemite#729 (ddmd --api-only flag) and the preceding main changes that moved canonical types out of the auto-generated client into the `mg-api-types` crate. Includes: - replaces `rdb-types` (removed upstream) with `mg-api-types` as a direct workspace dep - bumps `num_enum` 0.7.5 -> 0.7.6 to satisfy maghemite's workspace pin - migrates types - renames `bgp_apply_v2` callers to `bgp_apply` - `DdmInstance` fixture is renamed from `--no-state-machine` to `--api-only` to match the new clap flag.
jgallagher
left a comment
There was a problem hiding this comment.
I'm not sure who should approve this - @taspelund and I both deferred. I'll slap my approval on with a couple comments about code that changed during the move - other than that this looks like just rearranging organization to support the new flags. But big 👍 if you want to find and wait for whoever really owns this to take a look.
| } | ||
| .to_logger("admin") | ||
| .map_err(|e| e.to_string())?; | ||
| let ds_log = log.new(o!( |
There was a problem hiding this comment.
This is a change to the logging level too, right? Previously we constructed a StderrTerminal at the Error level, but now we're inheriting whatever level log has - presumably Info? Is that okay?
There was a problem hiding this comment.
Yeah, good discussion point. Structured-logging aside (which is good here), the move to Info matches what mgd does vs what ddm had ~ mgd/src/admin.rs has been constructing its dropshot logger as log.new(o!("component" => COMPONENT_MGD, "module" => MOD_ADMIN, "unit" => UNIT_API_SERVER)), doing what the the parent level runs at (typically Info). The Error cap was a ddm-only.
So the question is do we want both daemons quiet normally vs on error? I think, I'll keep what ddm did previously and file an issue to look at logging consistency.
There was a problem hiding this comment.
I'm going to tag in @rcgoodfellow in to help give some historical context -- I just don't have the history with ddm to know why we'd filter the logs down to just >= error.
I would think info should generally not be so cluttered that we need to avoid them in the logger. If the reason for the filter was truly just noise, then we should consider converting noisy logging calls to debug instead of leaving them at info... although I'm ok if that means filing a follow-up issue to address it later, rather than holding up this PR
There was a problem hiding this comment.
#740 opened to do the logging level change separately.
|
I discussed this with @taspelund, and he was good with it. We'll tackle the logging level separately as per the issue. |
This brings main forward and updates maghemite to current main (9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR oxidecomputer/maghemite#729 with the ddmd --api-only flag). We also bump workspace clap from 4.5 to 4.6 to satisfy the new maghemite constraint. The lockfile cascades through to align omicron-as-git refs at 915f229 too.
This brings main forward and updates maghemite to current main (9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR oxidecomputer/maghemite#729 with the ddmd --api-only flag). We also bump workspace clap from 4.5 to 4.6 to satisfy the new maghemite constraint. The lockfile cascades through to align omicron-as-git refs at 915f229 too. Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]` list to resolve a duplicate-package error from maghemite's transitive illumos-utils -> oxlog pull.
This brings main forward and updates maghemite to current main (9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR oxidecomputer/maghemite#729 with the ddmd --api-only flag). We also bump workspace clap from 4.5 to 4.6 to satisfy the new maghemite constraint. The lockfile cascades through to align omicron-as-git refs at 915f229 too. Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]` list to resolve a duplicate-package error from maghemite's transitive illumos-utils -> oxlog pull.
This brings main forward and updates maghemite to current main (9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR oxidecomputer/maghemite#729 with the ddmd --api-only flag). We also bump workspace clap from 4.5 to 4.6 to satisfy the new maghemite constraint. The lockfile cascades through to align omicron-as-git refs at 915f229 too. Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]` list to resolve a duplicate-package error from maghemite's transitive illumos-utils -> oxlog pull.
This brings main forward and updates maghemite to current main (9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR oxidecomputer/maghemite#729 with the ddmd --api-only flag). We also bump workspace clap from 4.5 to 4.6 to satisfy the new maghemite constraint. The lockfile cascades through to align omicron-as-git refs at 915f229 too. Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]` list to resolve a duplicate-package error from maghemite's transitive illumos-utils -> oxlog pull.
This brings main forward and updates maghemite to current main (9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR oxidecomputer/maghemite#729 with the ddmd --api-only flag). We also bump workspace clap from 4.5 to 4.6 to satisfy the new maghemite constraint. The lockfile cascades through to align omicron-as-git refs at 915f229 too. Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]` list to resolve a duplicate-package error from maghemite's transitive illumos-utils -> oxlog pull.
This brings main forward and updates maghemite to current main (9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR oxidecomputer/maghemite#729 with the ddmd --api-only flag). We also bump workspace clap from 4.5 to 4.6 to satisfy the new maghemite constraint. The lockfile cascades through to align omicron-as-git refs at 915f229 too. Finally, we patch `oxlog` to the `[patch."github.com/oxidecomputer/omicron"]` list to resolve a duplicate-package error from maghemite's transitive illumos-utils -> oxlog pull.
Omicron's oxidecomputer/omicron#10381 introduces a stubbed
ddmdadmin endpoint because spawning a realddmdin a generic test toolchain is not viable: the routing state machine (discovery, exchange, route synchronization) depends on illumos networking facilities the toolchain does not provide. Consumers of the stub, e.g., Nexus RPW (multicast members), sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS service name, cannot exercise the real admin surface from Omicron's test harness.This work adds an opt-in
--api-onlyflag toddmdthat runs only the admin API server and skips the state machine entirely, allowing the fixture to spawn the real binary. This is analogous tomgd --no-bgp-dispatcher, which Omicron'sMgdInstancealready uses for the same purpose.To make the fixture path usable on Linux,
ddmditself must build on Linux. The previous code pulled the illumos-only crateslibnet,dpd-client,opte-ioctl, andoxide-vpcunconditionally throughddm, which failed to link on Linux (-lzfs,-ldlpi). This change introduces abackendfeature in bothddmandddmd(default-on, mirroringmgd'smg-lowerpattern) that marks those four crates optional. The buildomatlinux.shjob now buildsddmdandddmadm, withddmdinvoked ascargo build --bin ddmd --no-default-features.The illumos-only halves of
ddmare isolated by the feature gate:sm.rsintosm/state.rs.exchange.rsintoexchange/runtime.rs.discovery.rsintodiscovery/runtime.rs.Each parent
mod.rskeeps the platform-agnostic types and re-exports the runtime surface so existing call sites resolve unchanged on illumos. The runtime submodules are gated as a unit by#[cfg(all(feature = "backend", target_os = "illumos"))]. We also remove the single-functionddm/src/util.rs, inlining the function intodiscovery/runtime.rs, where its sole caller lives.The SIGTERM cleanup handler is installed regardless of the flag, so Ctrl-C still exits cleanly in
--api-onlymode. The imported route sets are empty in that mode, so the cleanup itself is a noop.--api-onlyand--addrare mutually exclusive at the clap level (conflicts_with), so passing them together is rejected at parse time.