Skip to content

[internal-dns] register and publish ddmd in the switch zone#10381

Draft
zeeshanlakhani wants to merge 7 commits into
mainfrom
zl/ddm-internal-dns
Draft

[internal-dns] register and publish ddmd in the switch zone#10381
zeeshanlakhani wants to merge 7 commits into
mainfrom
zl/ddm-internal-dns

Conversation

@zeeshanlakhani
Copy link
Copy Markdown
Collaborator

@zeeshanlakhani zeeshanlakhani commented May 6, 2026

DDMD has always run in the switch zone alongside Dendrite, MGS, and MGD, but it was never registered in internal DNS, leaving no path for a cross-host consumer to discover it. This adds ServiceName::Ddm, plumbs ddm_port through the host-zone switch (RSS plan + reconfigurator DNS execution), threads an Overridables::ddm_ports map for the test suite, and includes a DdmInstance test fixture in test-utils that spawns a real ddmd subproc via --no-state-machine (matching MgdInstance's pattern) so that the test harness registers a real DDM port in DNS the same way it does for the other switch-zone services.

We also drop the duplicate DDMD_PORT const in ddm-admin-client in favor of the canonical omicron_common::address::DDMD_PORT. Same-host callers continue to use Client::localhost().

The legit subproc fixture depends on oxidecomputer/maghemite#729, which adds a no-state-machine flag to ddmd that skips the kernel-related state machine and leaves only the admin API running.

This was extracted from the multicast PR (zl/multicast-mgd-ddm), which uses ddmd cross-host as the first DNS-resolved consumer, as Nexus is the consumer.

References

DDMD has always run in the switch zone alongside Dendrite, MGS,
and MGD, but it was never registered in internal DNS, leaving no path for a
cross-host consumer to discover it. This adds `ServiceName::Ddm`,
plumbs `ddm_port` through the host-zone switch (RSS plan + reconfigurator
DNS execution), threads an `Overridables::ddm_ports` map for the
test suite, and lands a `DdmInstance` dropshot sim in test utils so
that the test harness registers a real DDM port in DNS the same way it does
for the other switch-zone services.

We also drop the duplicate DDMD_PORT const in `ddm-admin-client` in favor of
the canonical `omicron_common::address::DDMD_PORT`. Same-host
callers continue to use `Client::localhost()`.

This was extracted from the multicast PR (zl/multicast-mgd-ddm), which
uses ddmd cross-host as the first DNS-resolved consumer, as Nexus is the consumer.
Copy link
Copy Markdown
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for splitting this out!

Comment thread internal-dns/types/src/config.rs Outdated
Comment thread test-utils/src/dev/maghemite.rs Outdated
Comment thread test-utils/src/dev/maghemite.rs Outdated
zeeshanlakhani added a commit to oxidecomputer/maghemite that referenced this pull request May 7, 2026
Omicron's oxidecomputer/omicron#10381 introduces a stubbed `ddmd`
admin endpoint because spawning a real `ddmd` in a generic test
toolchain is not viable: the routing state machine (discovery, exchange, route
synchronization) depends on illumos networking facilities the toolchain does not
provide. Consumers of the stub, e.g., Nexus RPW (multicast members),
sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS
service name, cannot exercise the real admin surface from Omicron's test harness.

This work adds an opt-in `--no-state-machine` flag to `ddmd` that runs only
the admin API server and skips the state machine entirely, allowing the fixture
to spawn the real binary. This is analogous to `mgd --no-bgp-dispatcher`, which
Omicron's `MgdInstance` already uses for the same purpose.

To make the fixture path usable on Linux, `ddmd` itself must build on Linux.
The previous code pulled the illumos-only crates `libnet`, `dpd-client`,
`opte-ioctl`, and `oxide-vpc` unconditionally through `ddm`, which failed to
link on Linux (`-lzfs`, `-ldlpi`). This change introduces an `illumos` feature
in both `ddm` and `ddmd` (default-on, mirroring `mgd`'s `mg-lower` pattern) that
marks those four crates optional. The buildomat `linux.sh` job now builds `ddmd`
and `ddmadm`, with `ddmd` invoked as `cargo build --bin ddmd --no-default-features`.

The illumos-only halves of `ddm` are isolated by the feature gate:

- The routing state machine implementation moves from `sm.rs` into
  `sm/state.rs`.
- The exchange runtime (HTTP push/pull and route programming) moves from
  `exchange.rs` into `exchange/runtime.rs`.
- The discovery runtime (UDPv6 solicitation/advertisement loops) moves from
  `discovery.rs` into `discovery/runtime.rs`.

Each parent `mod.rs` keeps the platform-agnostic types and re-exports the
runtime surface so existing call sites resolve unchanged on illumos. The runtime
submodules are gated as a unit by `#[cfg(all(feature = "illumos",
target_os = "illumos"))]`. We also remove the single-function `ddm/src/util.rs`,
inlining the function into `discovery/runtime.rs`, where its sole caller lives.

The SIGTERM cleanup handler is installed regardless of the flag, so
Ctrl-C still exits cleanly in `--no-state-machine` mode. The imported
route sets are empty in that mode, so the cleanup itself is a noop.
Passing `--addr` alongside `--no-state-machine` is harmless but ignored,
with a warning logged.
zeeshanlakhani added a commit to oxidecomputer/maghemite that referenced this pull request May 7, 2026
Omicron's oxidecomputer/omicron#10381 introduces a stubbed `ddmd`
admin endpoint because spawning a real `ddmd` in a generic test
toolchain is not viable: the routing state machine (discovery, exchange, route
synchronization) depends on illumos networking facilities the toolchain does not
provide. Consumers of the stub, e.g., Nexus RPW (multicast members),
sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS
service name, cannot exercise the real admin surface from Omicron's test harness.

This work adds an opt-in `--no-state-machine` flag to `ddmd` that runs only
the admin API server and skips the state machine entirely, allowing the fixture
to spawn the real binary. This is analogous to `mgd --no-bgp-dispatcher`, which
Omicron's `MgdInstance` already uses for the same purpose.

To make the fixture path usable on Linux, `ddmd` itself must build on Linux.
The previous code pulled the illumos-only crates `libnet`, `dpd-client`,
`opte-ioctl`, and `oxide-vpc` unconditionally through `ddm`, which failed to
link on Linux (`-lzfs`, `-ldlpi`). This change introduces an `illumos` feature
in both `ddm` and `ddmd` (default-on, mirroring `mgd`'s `mg-lower` pattern) that
marks those four crates optional. The buildomat `linux.sh` job now builds `ddmd`
and `ddmadm`, with `ddmd` invoked as `cargo build --bin ddmd --no-default-features`.

The illumos-only halves of `ddm` are isolated by the feature gate:

- The routing state machine implementation moves from `sm.rs` into
  `sm/state.rs`.
- The exchange runtime (HTTP push/pull and route programming) moves from
  `exchange.rs` into `exchange/runtime.rs`.
- The discovery runtime (UDPv6 solicitation/advertisement loops) moves from
  `discovery.rs` into `discovery/runtime.rs`.

Each parent `mod.rs` keeps the platform-agnostic types and re-exports the
runtime surface so existing call sites resolve unchanged on illumos. The runtime
submodules are gated as a unit by `#[cfg(all(feature = "illumos",
target_os = "illumos"))]`. We also remove the single-function `ddm/src/util.rs`,
inlining the function into `discovery/runtime.rs`, where its sole caller lives.

The SIGTERM cleanup handler is installed regardless of the flag, so
Ctrl-C still exits cleanly in `--no-state-machine` mode. The imported
route sets are empty in that mode, so the cleanup itself is a noop.
Passing `--addr` alongside `--no-state-machine` is harmless but ignored,
with a warning logged.
…fixture

We address @jgallagher's review by:

- Replacing the four positional `u16` arguments in `DnsConfigBuilder::host_zone_switch`
  with a `HostSwitchZonePorts` named-fields structure.

- Replacing the dropshot-based stubbed `DdmInstance` in test-utils with a
  fixture that spawns and supervises a real `ddmd` subprocess running with
  `--no-state-machine`, analogous to `MgdInstance` and `mgd --no-bgp-dispatcher`.
  Only the switch-zone `ddmd` is registered in internal DNS, while sled-global-zone
  instances are accessed locally by their own host and don't need DNS registration.

  This **does** require maghemite changes, already PR'ed to oxidecomputer/maghemite#729.

  To make this all work, we wire `ddmd` into the developer xtask toolchain.
  `cargo xtask download maghemite-ddmd` reuses the existing `mg-ddm.tar.gz`
  illumos zone artifact (extracting `ddmd`/`ddmadm`). On Linux it overlays a
  raw `ddmd` binary, and on macOS it builds from source.

Also, we had to bump `oxnet` from 0.1.4 to 0.1.5 to satisfy the new maghemite pin.
@zeeshanlakhani zeeshanlakhani requested a review from taspelund May 7, 2026 06:56
zeeshanlakhani added a commit to oxidecomputer/maghemite that referenced this pull request May 7, 2026
Omicron's oxidecomputer/omicron#10381 introduces a stubbed `ddmd`
admin endpoint because spawning a real `ddmd` in a generic test
toolchain is not viable: the routing state machine (discovery, exchange, route
synchronization) depends on illumos networking facilities the toolchain does not
provide. Consumers of the stub, e.g., Nexus RPW (multicast members),
sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS
service name, cannot exercise the real admin surface from Omicron's test harness.

This work adds an opt-in `--no-state-machine` flag to `ddmd` that runs only
the admin API server and skips the state machine entirely, allowing the fixture
to spawn the real binary. This is analogous to `mgd --no-bgp-dispatcher`, which
Omicron's `MgdInstance` already uses for the same purpose.

To make the fixture path usable on Linux, `ddmd` itself must build on Linux.
The previous code pulled the illumos-only crates `libnet`, `dpd-client`,
`opte-ioctl`, and `oxide-vpc` unconditionally through `ddm`, which failed to
link on Linux (`-lzfs`, `-ldlpi`). This change introduces an `illumos` feature
in both `ddm` and `ddmd` (default-on, mirroring `mgd`'s `mg-lower` pattern) that
marks those four crates optional. The buildomat `linux.sh` job now builds `ddmd`
and `ddmadm`, with `ddmd` invoked as `cargo build --bin ddmd --no-default-features`.

The illumos-only halves of `ddm` are isolated by the feature gate:

- The routing state machine implementation moves from `sm.rs` into
  `sm/state.rs`.
- The exchange runtime (HTTP push/pull and route programming) moves from
  `exchange.rs` into `exchange/runtime.rs`.
- The discovery runtime (UDPv6 solicitation/advertisement loops) moves from
  `discovery.rs` into `discovery/runtime.rs`.

Each parent `mod.rs` keeps the platform-agnostic types and re-exports the
runtime surface so existing call sites resolve unchanged on illumos. The runtime
submodules are gated as a unit by `#[cfg(all(feature = "illumos",
target_os = "illumos"))]`. We also remove the single-function `ddm/src/util.rs`,
inlining the function into `discovery/runtime.rs`, where its sole caller lives.

The SIGTERM cleanup handler is installed regardless of the flag, so
Ctrl-C still exits cleanly in `--no-state-machine` mode. The imported
route sets are empty in that mode, so the cleanup itself is a noop.
Passing `--addr` alongside `--no-state-machine` is harmless but ignored,
with a warning logged.
@zeeshanlakhani zeeshanlakhani self-assigned this May 7, 2026
@zeeshanlakhani zeeshanlakhani force-pushed the zl/ddm-internal-dns branch from e3c6a18 to 9824436 Compare May 8, 2026 01:35
@zeeshanlakhani zeeshanlakhani force-pushed the zl/ddm-internal-dns branch from f69d6d4 to e212660 Compare May 8, 2026 07:38
zeeshanlakhani added a commit to oxidecomputer/maghemite that referenced this pull request May 8, 2026
Omicron's oxidecomputer/omicron#10381 introduces a stubbed `ddmd`
admin endpoint because spawning a real `ddmd` in a generic test
toolchain is not viable: the routing state machine (discovery, exchange, route
synchronization) depends on illumos networking facilities the toolchain does not
provide. Consumers of the stub, e.g., Nexus RPW (multicast members),
sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS
service name, cannot exercise the real admin surface from Omicron's test harness.

This work adds an opt-in `--no-state-machine` flag to `ddmd` that runs only
the admin API server and skips the state machine entirely, allowing the fixture
to spawn the real binary. This is analogous to `mgd --no-bgp-dispatcher`, which
Omicron's `MgdInstance` already uses for the same purpose.

To make the fixture path usable on Linux, `ddmd` itself must build on Linux.
The previous code pulled the illumos-only crates `libnet`, `dpd-client`,
`opte-ioctl`, and `oxide-vpc` unconditionally through `ddm`, which failed to
link on Linux (`-lzfs`, `-ldlpi`). This change introduces an `illumos` feature
in both `ddm` and `ddmd` (default-on, mirroring `mgd`'s `mg-lower` pattern) that
marks those four crates optional. The buildomat `linux.sh` job now builds `ddmd`
and `ddmadm`, with `ddmd` invoked as `cargo build --bin ddmd --no-default-features`.

The illumos-only halves of `ddm` are isolated by the feature gate:

- The routing state machine implementation moves from `sm.rs` into
  `sm/state.rs`.
- The exchange runtime (HTTP push/pull and route programming) moves from
  `exchange.rs` into `exchange/runtime.rs`.
- The discovery runtime (UDPv6 solicitation/advertisement loops) moves from
  `discovery.rs` into `discovery/runtime.rs`.

Each parent `mod.rs` keeps the platform-agnostic types and re-exports the
runtime surface so existing call sites resolve unchanged on illumos. The runtime
submodules are gated as a unit by `#[cfg(all(feature = "illumos",
target_os = "illumos"))]`. We also remove the single-function `ddm/src/util.rs`,
inlining the function into `discovery/runtime.rs`, where its sole caller lives.

The SIGTERM cleanup handler is installed regardless of the flag, so
Ctrl-C still exits cleanly in `--no-state-machine` mode. The imported
route sets are empty in that mode, so the cleanup itself is a noop.
Passing `--addr` alongside `--no-state-machine` is harmless but ignored,
with a warning logged.
Picks up recent oxidecomputer/maghemite#729 (ddmd --api-only flag) and the
preceding main changes that moved canonical types out of the auto-generated
client into the `mg-api-types` crate.

Includes:

- replaces `rdb-types` (removed upstream) with `mg-api-types` as a direct
  workspace dep
- bumps `num_enum` 0.7.5 -> 0.7.6 to satisfy maghemite's workspace pin
- migrates types
- renames `bgp_apply_v2` callers to `bgp_apply`
- `DdmInstance` fixture is renamed from `--no-state-machine` to `--api-only` to
  match the new clap flag.
@zeeshanlakhani zeeshanlakhani requested a review from jgallagher May 19, 2026 12:53
Copy link
Copy Markdown
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes all LGTM - just one comment about PR ordering.

Comment thread tools/maghemite_ddm_openapi_version Outdated
zeeshanlakhani added a commit to oxidecomputer/maghemite that referenced this pull request May 20, 2026
Omicron's oxidecomputer/omicron#10381 introduces a stubbed `ddmd` admin endpoint because spawning a real `ddmd` in a generic test toolchain is not viable: the routing state machine (discovery, exchange, route synchronization) depends on illumos networking facilities the toolchain does not provide. Consumers of the stub, e.g., Nexus RPW (multicast members), sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS service name, cannot exercise the real admin surface from Omicron's test harness.

This work adds an opt-in `--api-only` flag to `ddmd` that runs only the admin API server and skips the state machine entirely, allowing the fixture to spawn the real binary. This is analogous to `mgd --no-bgp-dispatcher`, which Omicron's `MgdInstance` already uses for the same purpose.

To make the fixture path usable on Linux, `ddmd` itself must build on Linux. The previous code pulled the illumos-only crates `libnet`, `dpd-client`, `opte-ioctl`, and `oxide-vpc` unconditionally through `ddm`, which failed to link on Linux (`-lzfs`, `-ldlpi`). This change introduces a `backend` feature in both `ddm` and `ddmd` (default-on, mirroring `mgd`'s `mg-lower` pattern) that marks those four crates optional. The buildomat `linux.sh` job now builds `ddmd` and `ddmadm`, with `ddmd` invoked as `cargo build --bin ddmd --no-default-features`.

The illumos-only halves of `ddm` are isolated by the feature gate:

- The routing state machine implementation moves from `sm.rs` into `sm/state.rs`.
- The exchange runtime (HTTP push/pull and route programming) moves from `exchange.rs` into `exchange/runtime.rs`.
- The discovery runtime (UDPv6 solicitation/advertisement loops) moves from `discovery.rs` into `discovery/runtime.rs`.

Each parent `mod.rs` keeps the platform-agnostic types and re-exports the runtime surface so existing call sites resolve unchanged on illumos. The runtime submodules are gated as a unit by `#[cfg(all(feature = "backend", target_os = "illumos"))]`. We also remove the single-function `ddm/src/util.rs`, inlining the function into `discovery/runtime.rs`, where its sole caller lives.

The SIGTERM cleanup handler is installed regardless of the flag, so Ctrl-C still exits cleanly in `--api-only` mode. The imported route sets are empty in that mode, so the cleanup itself is a noop. `--api-only` and `--addr` are mutually exclusive at the clap level (`conflicts_with`), so passing them together is rejected at parse time.
This brings main forward and updates maghemite to current main
(9bb5037167c1ff0d812299f668841c9b7bda4480, including the merged PR 
oxidecomputer/maghemite#729 with the ddmd --api-only flag). 

We also bump workspace clap from 4.5 to 4.6 to satisfy the
new maghemite constraint. The lockfile cascades through to align
omicron-as-git refs at 915f229 too.
@zeeshanlakhani zeeshanlakhani requested a review from jgallagher May 20, 2026 12:50
@zeeshanlakhani zeeshanlakhani marked this pull request as draft May 20, 2026 13:31
@zeeshanlakhani
Copy link
Copy Markdown
Collaborator Author

Reworking some deps shenanigans with maghemite upstream first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants