Skip to content

generalize crate for multi-device PCIe passthrough#1573

Open
cheese-head wants to merge 3 commits into
NVIDIA:mainfrom
cheese-head:feat/vfio-multi-device-passthrough
Open

generalize crate for multi-device PCIe passthrough#1573
cheese-head wants to merge 3 commits into
NVIDIA:mainfrom
cheese-head:feat/vfio-multi-device-passthrough

Conversation

@cheese-head
Copy link
Copy Markdown

Summary

Generalize openshell-vfio beyond GPU-only / single-device passthrough so it can serve as the binding/validation primitive layer the VM driver needs to implement RFC-0004's resource_requirements model. Adds atomic IOMMU-group binding, dry-run validation for ValidateSandboxCreate-style paths, class-agnostic device enumeration, and a correctness fix for partially-failed binds. Purely additive plus one bug fix; consumer crates are unchanged.

Related Issue

Foundational for RFC-0004 sandbox resource requirements (#1360). Unblocks multi-device-per-sandbox passthrough that the existing single-device API could not express when devices shared an IOMMU group (consumer GPUs + HDA + USB-C, multi-PF NICs, devices behind ACS-deficient PCIe switches).

Changes

  • Add prepare_pci_group_for_passthrough / release_pci_group_from_passthrough for atomic bind/release of multiple PCIe devices sharing one IOMMU group. Rollback only restores devices newly bound by the call, so it does not steal bindings owned by other guards.
  • Add validate_pci_for_passthrough / validate_pci_group_for_passthrough as dry-run pre-flight checks for ValidateSandboxCreate-style paths. Performs every structural and IOMMU-peer check without touching driver_override or any other kernel state. prepare_* now delegates to its validate counterpart to keep the two in lockstep.
  • Add probe_host_vfio_candidates(sysfs, vendor_filter) for vendor-filtered, class-agnostic enumeration of passthrough-eligible PCI devices so consumers can advertise DeviceClassCapability for arbitrary classes (GPUs, NICs, VFs) instead of being limited to probe_host_nvidia_vfio_readiness.
  • Add PciBindGuard::companion_bdfs() accessor for consumer-side persistence of grouped bindings (crash-recovery state, status reporting).
  • Add VfioError::GroupMismatch and VfioError::EmptyGroup for typed validation responses.
  • Fix bind_device_to_vfio to clear driver_override and re-probe the host driver on drivers_probe failure and on post-probe polling timeout. Previously a failed bind could leave the device wedged with driver_override="vfio-pci" pinned on disk, causing the next probe event to silently re-bind to vfio-pci.

Testing

  • mise run pre-commit passes
  • cargo test -p openshell-vfio passes (52/52, up from 32)
  • cargo clippy -p openshell-vfio --all-targets -- -D warnings clean
  • cargo check -p openshell-driver-vm clean (consumer crate compiles unchanged)
  • Unit tests added (20 new), covering: atomic group bind happy path, rollback safety when a pre-bound peer must not be stolen, rejection of undeclared peers / mixed groups / duplicates / missing devices / empty slices, batch release continues after per-device errors, dry-run validators write no kernel state, and vendor-filtered probing including skip-on-no-IOMMU-group.
  • E2E tests added/updated (N/A — primitive crate, exercised end-to-end by consumer crates which are out of scope for this PR)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable) — N/A, no public surface in docs/ covers openshell-vfio today

Signed-off-by: Patrick Riel <priel@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@drew drew requested a review from elezar May 26, 2026 20:58
@drew
Copy link
Copy Markdown
Collaborator

drew commented May 26, 2026

@elezar adding this for your review as well since it's adjacent to gpu work

…ng restart reconciliation, without rebinding or mutating sysfs.

Signed-off-by: Patrick Riel <priel@nvidia.com>
Comment thread crates/openshell-vfio/src/gpu.rs Outdated
Comment on lines +96 to +102
let vendor = read_sysfs_trimmed(&dev_dir.join("vendor"))?;
if vendor != NVIDIA_VENDOR_ID {
return Err(VfioError::NotNvidia {
bdf: bdf.to_string(),
vendor,
});
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this code NVIDIA-specific? If so, we may want to update the function name.

static VFIO_ID_REFCOUNTS: LazyLock<Mutex<HashMap<String, usize>>> =
LazyLock::new(|| Mutex::new(HashMap::new()));

pub(crate) fn current_driver_name(sysfs: &SysfsRoot, bdf: &str) -> Option<String> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We introduced an abstraction over sysfs paths to enable testing. Does an abstraction for a device also make sense? We have a number of places where we:

  1. Get the path for a specific bdf
  2. Append a path to it
  3. Read a link or contents of a file to determine a property.

Would hiding this behind named methods be useful?

Signed-off-by: Patrick Riel <priel@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants