[fm] Add simple disk diagnoser based on zpool health by smklein · Pull Request #10460 · oxidecomputer/omicron

smklein · 2026-05-19T00:13:33Z

The first fault management diagnosis engine: opens a case for any
non-Online zpool whose backing physical disk is currently in service
in the control plane, and closes it on recovery or expungement.

Supporting infrastructure introduced along the way:

DiagnosisEngineKind::Disk variant (Rust + DB enum)
fm_case_fact child table for per-engine state (one case has 0..N
immutable facts; stable UUIDs across sitreps; participates in
copy-forward + GC like other sitrep child tables)
CaseBuilder::{add_fact, remove_fact, facts} API
InServiceDisk nexus-types projection consumed by FM, populated from
the existing zpool_list_all_external_batched datastore method with
policy filtering done in the background task

smklein · 2026-05-19T00:36:48Z

+pub(super) fn analyze(
+    input: &Input,
+    builder: &mut SitrepBuilder<'_>,
+) -> anyhow::Result<()> {


So, the whole point of this PR is "to be able to build something here, and re-use it", but ironically the contents of this particular DE is particularly prone to change.

The "short version" of what we're doing:

Look at inventory, DB state, old sitreps

Make sure a case exists for each unhealthy zpool, with a corresponding "DiskFact"

Close old cases if their zpools is now healthy (or expunged)

We're doing this with a jumble of indices, iterations, etc. I think those will change. I think this DE will grow to track other state about these disks. I think each of these cases will potentially grow to have different facts.

smklein · 2026-05-19T01:03:18Z


+    /// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep,
+    /// grouped by `case_id`.
+    async fn fm_case_facts_read_on_conn(


By reading facts alongside cases, there isn't really a need to mark "DE" on the fact table, so I removed it. It's redundant data anyway.

(Figured I'd mention this because it diverges slightly from the DB structure we talked about - but still sorts facts into case-specific buckets, so we can still "parse by the case DE type").

Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.

Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...

Couldn't each of those DEs just make a duplicate copy of that fact in their own cases? that seems like it helps keep fact lifecycle scoped "per-case" which is what we want.

I really hesitate to include this data "just to have it" because then it means we need to handle the case where "fact.de != fact.case.de", which is an impossible data corruption case we could just avoid by omitting the column

Specifically with your data-loss case: my main argument is that "facts are associated with cases", regardless of how they're generated.

So: in the case where we have "DE 1 which does something, but wants to write down a fact for a case managed by DE 2" - I think we can make this happen in-memory during sitrep construction, but on-disk, this could look like:

DE1 has a case C1, queries for data

(next sitrep) DE1 sees new data for C1, decides to open a case C2 for analysis by a different DE (DE2). It can also pass along a fact for C2

On-disk: That fact is associated with C2. We could have a "comment" about how it was originally noticed by DE1/C1? But that origination doesn't really matter

andrewjstone

It's really exciting to see this coming together!

I think it makes sense to use JSON for payloads in the DB due to the explosion in types as discussed in chat. I wonder about the versioning strategy though. The DEs in Nexus are the only things that need to interpret payloads, but they are essentially client side versioned. During an update, Nexus will not understand new payloads. Do we plan to use a two-phase update model where reporters can't issue newly added reports until a second update, or will DE's just ignore payloads they can't understand?

smklein · 2026-05-19T16:27:40Z

During an update, Nexus will not understand new payloads. Do we plan to use a two-phase update model where reporters can't issue newly added reports until a second update, or will DE's just ignore payloads they can't understand?

Nexus is performing an atomic handoff from "old" to "new" before the database can be accessed, right? I don't think we need to worry about a mixed-version Nexus scenario - I believe we'll have "old Nexus, working with old data", then we'll perform handoff, and only worry about "new Nexus working with old + new data, which it can migrate"

Regardless, there are a bunch of strategies we could use for doing "fact payload" schema migration:

We could use the existing DB migration tools, to perform "data-only" migrations (look at all fm_case_fact rows, where diagnosis_engine = x, and where payload->variant = y, and re-write the payload).
We could rely on the re-generation of sitreps to have a phase where we load "old facts, and update them to new fact format". e.g. CaseFact::VariantFoo1 could be read, and in-memory updated to CaseFact::VariantFoo2, which gets written out in the next sitrep.

The first fault management diagnosis engine: opens a case for any non-Online zpool whose backing physical disk is currently in service in the control plane, and closes it on recovery or expungement. Supporting infrastructure introduced along the way: - DiagnosisEngineKind::Disk variant (Rust + DB enum) - fm_case_fact child table for per-engine state (one case has 0..N immutable facts; stable UUIDs across sitreps; participates in copy-forward + GC like other sitrep child tables) - CaseBuilder::{add_fact, remove_fact, facts} API - InServiceDisk nexus-types projection consumed by FM, populated from the existing zpool_list_all_external_batched datastore method with policy filtering done in the background task Schema migration: add-disk-de-and-facts (version 260) adds the 'disk' enum value and creates fm_case_fact.

andrewjstone · 2026-05-19T17:43:45Z

Nexus is performing an atomic handoff from "old" to "new" before the database can be accessed, right? I don't think we need to worry about a mixed-version Nexus scenario - I believe we'll have "old Nexus, working with old data", then we'll perform handoff, and only worry about "new Nexus working with old + new data, which it can migrate"

Ah, I must be misunderstanding how payloads get populated. I was presuming that it's possible for the ingester of the payload to write to the database without actually knowing the format of the payload. But if we limit ingestion of new payloads until Nexus is updated, than I agree there is no problem.

hawkw

Here's an incomplete review focusing on the database models and domain types; I haven't actually gotten as far as the actual diagnosis engine yet. I figured it would be more useful to leave a smaller review sooner rather than waiting to get to the "other half" of this PR.

hawkw · 2026-05-19T17:36:46Z


+    /// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep,
+    /// grouped by `case_id`.
+    async fn fm_case_facts_read_on_conn(


Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.

Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...

hawkw · 2026-05-19T18:25:27Z

+                writeln!(f, "{:>indent$}{PAYLOAD:<WIDTH$} {payload}", "")?;
+                writeln!(f, "{:>indent$}{COMMENT:<WIDTH$} {comment}\n", "")?;


nit: i might put the comment before the payload, and also consider making the JSON multiline...though you might have to indent it nicely to make it not look bad.

Sure, patched in 58c780f

hawkw · 2026-05-19T18:26:09Z

+            for CaseFact { id, payload, comment } in facts.iter() {
+                const PAYLOAD: &str = "payload:";
+                const COMMENT: &str = "comment:";
+                const WIDTH: usize = const_max_len(&[PAYLOAD, COMMENT]);
+
+                writeln!(f, "{BULLET:>indent$}fact {id}")?;
+                writeln!(f, "{:>indent$}{PAYLOAD:<WIDTH$} {payload}", "")?;
+                writeln!(f, "{:>indent$}{COMMENT:<WIDTH$} {comment}\n", "")?;


nit: i would love to have an indented displayer for facts and make this just call that for each fact, since we might want to use that elsewhere. not a huge deal though.

Refactored in 58c780f

hawkw · 2026-05-19T18:26:46Z

+CREATE TABLE IF NOT EXISTS omicron.public.fm_case_fact (
+    id UUID NOT NULL,
+    sitrep_id UUID NOT NULL,
+    case_id UUID NOT NULL,


would like a

Suggested change

case_id UUID NOT NULL,

case_id UUID NOT NULL,

created_sitrep_id UUID NOT NULL,

here

Fixed in dc03b8e9

hawkw · 2026-05-19T18:28:36Z

        let mut support_bundles_requested = Vec::new();
        let mut bundle_data_selections_requested = Vec::new();
        let mut case_ereports = Vec::new();
+        let mut case_facts = Vec::new();


would be nice to be able to with_capacity this to be as long as the case's facts map...but i also notice we are not doing this for any of the other ones so it's kinda fine i guess...

hawkw · 2026-05-19T18:29:57Z

+    /// Open cases from the parent sitrep, copied forward into this analysis
+    /// input. Closed cases live separately on the (crate-private)
+    /// `closed_cases_copied_forward` accessor.
+    pub fn open_cases(&self) -> &IdOrdMap<fm::Case> {


probably just me being extremely persnickety but i would have kind of rather we refactor this in a separate smaller PR. not a big deal though.

smklein commented May 19, 2026

View reviewed changes

smklein force-pushed the fm-disk-diagnoser branch 2 times, most recently from a3cddcc to 26f2ade Compare May 19, 2026 01:28

andrewjstone reviewed May 19, 2026

View reviewed changes

smklein force-pushed the fm-disk-diagnoser branch from 26f2ade to 67b661f Compare May 19, 2026 16:28

smklein force-pushed the fm-disk-diagnoser branch from 67b661f to 793b1ec Compare May 19, 2026 17:12

hawkw self-requested a review May 19, 2026 17:31

hawkw reviewed May 19, 2026

View reviewed changes

smklein added 5 commits May 19, 2026 12:07

Debuggable facts

dfabfb4

more logging

7555b0b

CaseFact -> Fact, add created_sitrep_id to facts, fm- prefix

bcf5b68

payload_to, deserialization error context

a415d27

case formatting

58c780f

		writeln!(f, "{:>indent$}{PAYLOAD:<WIDTH$} {payload}", "")?;
		writeln!(f, "{:>indent$}{COMMENT:<WIDTH$} {comment}\n", "")?;

	case_id UUID NOT NULL,
	case_id UUID NOT NULL,
	created_sitrep_id UUID NOT NULL,

Conversation

smklein commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

smklein commented May 19, 2026

Uh oh!

andrewjstone commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smklein commented May 19, 2026 •

edited

Loading

smklein May 19, 2026 •

edited

Loading

andrewjstone commented May 19, 2026 •

edited

Loading