Skip to content

[BUG] cudf-polars error in a MIG sandbox #22388

@mlubin

Description

@mlubin

(Claude-generated bug report follows)

cudf-polars: NVMLError_NoPermission on MIG-partitioned GPUs

Status: present in cudf-polars-cu12 26.4.0 with Polars 1.40.1.

pl.GPUEngine raises polars.exceptions.ComputeError: 'cuda' conversion failed: NVMLError_NoPermission: Insufficient Permissions on every
.collect() when the visible device is a MIG slice.

Environment

  • Driver: 580.126.20 (CUDA 13.0)
  • GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition with MIG mode
    enabled (1g.24gb slices).
  • CUDA_VISIBLE_DEVICES=MIG-<UUID> pointed at one slice.
  • cudf-polars-cu12==26.4.0, polars==1.40.1, Python 3.13.

Plain cudf.DataFrame(...).sum() works in the same environment, so
this is specific to the cudf-polars callback path, not RMM/cuDF
itself.

Repro

import polars as pl
gpu = pl.GPUEngine(raise_on_fail=True, executor="streaming")
df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})
df.lazy().sum().collect(engine=gpu)
# polars.exceptions.ComputeError:
#   'cuda' conversion failed: NVMLError_NoPermission: Insufficient Permissions

Run with CUDA_VISIBLE_DEVICES=MIG-<uuid-of-a-slice>.

Root cause

cudf_polars/utils/config.py::get_device_handle (lines 70–92 in 26.4.0)
detects when the visible device is a MIG slice and upgrades the
NVML handle from the MIG slice to its parent device:

handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(index))
if pynvml.nvmlDeviceIsMigDeviceHandle(handle):
    # Additionally get parent device handle
    # if the device itself is a MIG instance
    handle = pynvml.nvmlDeviceGetDeviceHandleFromMigDeviceHandle(handle)

get_total_device_memory then calls
pynvml.nvmlDeviceGetMemoryInfo(parent_handle) to size the RMM pool.
From inside a MIG sandbox, querying the parent device's memory is
privileged — it raises NVMLError_NoPermission, not
NVMLError_NotSupported, so the existing except pynvml.NVMLError_NotSupported clause doesn't catch it. The exception
propagates, the cudf-polars callback fails to register, and Polars
surfaces the error as a ComputeError at .collect() time.

The upgrade was presumably added to support nested MIG handles or to
get a "canonical" handle, but on the GetMemoryInfo path it's the wrong
direction: the MIG handle itself is happy to answer GetMemoryInfo, and
the value returned (the slice's memory, not the full parent's) is the
correct number to use when sizing RMM pool — RMM is allocating inside
the slice, not the parent.

Verification

Removing the upgrade (using the MIG handle directly) makes the bug go
away:

             handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(index))
-            if pynvml.nvmlDeviceIsMigDeviceHandle(handle):
-                # Additionally get parent device handle
-                # if the device itself is a MIG instance
-                handle = pynvml.nvmlDeviceGetDeviceHandleFromMigDeviceHandle(handle)

After the patch, the repro above succeeds and returns
shape: (1, 1) ┌─ a ─┐ │ 15 │ └─────┘.

Suggested fix

Drop the parent-handle upgrade in get_device_handle. If the upgrade
was needed somewhere else, scope it to that call rather than caching
the parent handle for everyone (get_device_handle is
@functools.cached, so any caller that reuses it inherits the
parent-handle problem).

A minimal alternative is to broaden the exception clause in
get_total_device_memory so NVMLError_NoPermission also returns
None (treating "we can't size the pool" as an absent device-memory
signal). That hides the bug in the cache-sizing path but doesn't fix
it for any future caller of get_device_handle().

Other affected configurations

Likely affects any MIG-mode GPU under cudf-polars. Tested only on
RTX PRO 6000 Blackwell, but the issue is structural in NVML's
permission model — nvmlDeviceGetMemoryInfo on the parent device
from inside a MIG sandbox is documented as restricted.

Notes

  • cudf proper (without the -polars callback) does not hit this
    path; it sizes RMM differently.
  • The benchmarks-side helper at
    cudf_polars/experimental/benchmarks/utils.py has the same NVML
    dance but catches NVMLError_NotSupported only, so it would
    presumably hit the same wall on MIG. Not exercised in this repro
    but worth fixing in the same PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions