[BUG] cudf-polars error in a MIG sandbox

(Claude-generated bug report follows)

# cudf-polars: `NVMLError_NoPermission` on MIG-partitioned GPUs

**Status:** present in `cudf-polars-cu12` 26.4.0 with Polars 1.40.1.

`pl.GPUEngine` raises `polars.exceptions.ComputeError: 'cuda' conversion
failed: NVMLError_NoPermission: Insufficient Permissions` on every
`.collect()` when the visible device is a MIG slice.

## Environment

- Driver: `580.126.20` (CUDA 13.0)
- GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition with MIG mode
  enabled (`1g.24gb` slices).
- `CUDA_VISIBLE_DEVICES=MIG-<UUID>` pointed at one slice.
- `cudf-polars-cu12==26.4.0`, `polars==1.40.1`, Python 3.13.

Plain `cudf.DataFrame(...).sum()` works in the same environment, so
this is specific to the cudf-polars callback path, not RMM/cuDF
itself.

## Repro

```python
import polars as pl
gpu = pl.GPUEngine(raise_on_fail=True, executor="streaming")
df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})
df.lazy().sum().collect(engine=gpu)
# polars.exceptions.ComputeError:
#   'cuda' conversion failed: NVMLError_NoPermission: Insufficient Permissions
```

Run with `CUDA_VISIBLE_DEVICES=MIG-<uuid-of-a-slice>`.

## Root cause

`cudf_polars/utils/config.py::get_device_handle` (lines 70–92 in 26.4.0)
detects when the visible device is a MIG slice and **upgrades** the
NVML handle from the MIG slice to its parent device:

```python
handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(index))
if pynvml.nvmlDeviceIsMigDeviceHandle(handle):
    # Additionally get parent device handle
    # if the device itself is a MIG instance
    handle = pynvml.nvmlDeviceGetDeviceHandleFromMigDeviceHandle(handle)
```

`get_total_device_memory` then calls
`pynvml.nvmlDeviceGetMemoryInfo(parent_handle)` to size the RMM pool.
From inside a MIG sandbox, querying the parent device's memory is
privileged — it raises `NVMLError_NoPermission`, not
`NVMLError_NotSupported`, so the existing `except
pynvml.NVMLError_NotSupported` clause doesn't catch it. The exception
propagates, the cudf-polars callback fails to register, and Polars
surfaces the error as a `ComputeError` at `.collect()` time.

The upgrade was presumably added to support nested MIG handles or to
get a "canonical" handle, but on the GetMemoryInfo path it's the wrong
direction: the MIG handle itself is happy to answer GetMemoryInfo, and
the value returned (the slice's memory, not the full parent's) is the
correct number to use when sizing RMM pool — RMM is allocating inside
the slice, not the parent.

## Verification

Removing the upgrade (using the MIG handle directly) makes the bug go
away:

```diff
             handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(index))
-            if pynvml.nvmlDeviceIsMigDeviceHandle(handle):
-                # Additionally get parent device handle
-                # if the device itself is a MIG instance
-                handle = pynvml.nvmlDeviceGetDeviceHandleFromMigDeviceHandle(handle)
```

After the patch, the repro above succeeds and returns
`shape: (1, 1) ┌─ a ─┐ │ 15 │ └─────┘`.

## Suggested fix

Drop the parent-handle upgrade in `get_device_handle`. If the upgrade
was needed somewhere else, scope it to that call rather than caching
the parent handle for everyone (`get_device_handle` is
`@functools.cache`d, so any caller that reuses it inherits the
parent-handle problem).

A minimal alternative is to broaden the exception clause in
`get_total_device_memory` so `NVMLError_NoPermission` also returns
`None` (treating "we can't size the pool" as an absent device-memory
signal). That hides the bug in the cache-sizing path but doesn't fix
it for any future caller of `get_device_handle()`.

## Other affected configurations

Likely affects any MIG-mode GPU under cudf-polars. Tested only on
RTX PRO 6000 Blackwell, but the issue is structural in NVML's
permission model — `nvmlDeviceGetMemoryInfo` on the parent device
from inside a MIG sandbox is documented as restricted.

## Notes

- `cudf` proper (without the `-polars` callback) does not hit this
  path; it sizes RMM differently.
- The benchmarks-side helper at
  `cudf_polars/experimental/benchmarks/utils.py` has the same NVML
  dance but catches `NVMLError_NotSupported` only, so it would
  presumably hit the same wall on MIG. Not exercised in this repro
  but worth fixing in the same PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cudf-polars error in a MIG sandbox #22388

cudf-polars: `NVMLError_NoPermission` on MIG-partitioned GPUs

Environment

Repro

Root cause

Verification

Suggested fix

Other affected configurations

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] cudf-polars error in a MIG sandbox #22388

Description

cudf-polars: NVMLError_NoPermission on MIG-partitioned GPUs

Environment

Repro

Root cause

Verification

Suggested fix

Other affected configurations

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

cudf-polars: `NVMLError_NoPermission` on MIG-partitioned GPUs