(Claude-generated bug report follows)
cudf-polars: NVMLError_NoPermission on MIG-partitioned GPUs
Status: present in cudf-polars-cu12 26.4.0 with Polars 1.40.1.
pl.GPUEngine raises polars.exceptions.ComputeError: 'cuda' conversion failed: NVMLError_NoPermission: Insufficient Permissions on every
.collect() when the visible device is a MIG slice.
Environment
- Driver:
580.126.20 (CUDA 13.0)
- GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition with MIG mode
enabled (1g.24gb slices).
CUDA_VISIBLE_DEVICES=MIG-<UUID> pointed at one slice.
cudf-polars-cu12==26.4.0, polars==1.40.1, Python 3.13.
Plain cudf.DataFrame(...).sum() works in the same environment, so
this is specific to the cudf-polars callback path, not RMM/cuDF
itself.
Repro
import polars as pl
gpu = pl.GPUEngine(raise_on_fail=True, executor="streaming")
df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})
df.lazy().sum().collect(engine=gpu)
# polars.exceptions.ComputeError:
# 'cuda' conversion failed: NVMLError_NoPermission: Insufficient Permissions
Run with CUDA_VISIBLE_DEVICES=MIG-<uuid-of-a-slice>.
Root cause
cudf_polars/utils/config.py::get_device_handle (lines 70–92 in 26.4.0)
detects when the visible device is a MIG slice and upgrades the
NVML handle from the MIG slice to its parent device:
handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(index))
if pynvml.nvmlDeviceIsMigDeviceHandle(handle):
# Additionally get parent device handle
# if the device itself is a MIG instance
handle = pynvml.nvmlDeviceGetDeviceHandleFromMigDeviceHandle(handle)
get_total_device_memory then calls
pynvml.nvmlDeviceGetMemoryInfo(parent_handle) to size the RMM pool.
From inside a MIG sandbox, querying the parent device's memory is
privileged — it raises NVMLError_NoPermission, not
NVMLError_NotSupported, so the existing except pynvml.NVMLError_NotSupported clause doesn't catch it. The exception
propagates, the cudf-polars callback fails to register, and Polars
surfaces the error as a ComputeError at .collect() time.
The upgrade was presumably added to support nested MIG handles or to
get a "canonical" handle, but on the GetMemoryInfo path it's the wrong
direction: the MIG handle itself is happy to answer GetMemoryInfo, and
the value returned (the slice's memory, not the full parent's) is the
correct number to use when sizing RMM pool — RMM is allocating inside
the slice, not the parent.
Verification
Removing the upgrade (using the MIG handle directly) makes the bug go
away:
handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(index))
- if pynvml.nvmlDeviceIsMigDeviceHandle(handle):
- # Additionally get parent device handle
- # if the device itself is a MIG instance
- handle = pynvml.nvmlDeviceGetDeviceHandleFromMigDeviceHandle(handle)
After the patch, the repro above succeeds and returns
shape: (1, 1) ┌─ a ─┐ │ 15 │ └─────┘.
Suggested fix
Drop the parent-handle upgrade in get_device_handle. If the upgrade
was needed somewhere else, scope it to that call rather than caching
the parent handle for everyone (get_device_handle is
@functools.cached, so any caller that reuses it inherits the
parent-handle problem).
A minimal alternative is to broaden the exception clause in
get_total_device_memory so NVMLError_NoPermission also returns
None (treating "we can't size the pool" as an absent device-memory
signal). That hides the bug in the cache-sizing path but doesn't fix
it for any future caller of get_device_handle().
Other affected configurations
Likely affects any MIG-mode GPU under cudf-polars. Tested only on
RTX PRO 6000 Blackwell, but the issue is structural in NVML's
permission model — nvmlDeviceGetMemoryInfo on the parent device
from inside a MIG sandbox is documented as restricted.
Notes
cudf proper (without the -polars callback) does not hit this
path; it sizes RMM differently.
- The benchmarks-side helper at
cudf_polars/experimental/benchmarks/utils.py has the same NVML
dance but catches NVMLError_NotSupported only, so it would
presumably hit the same wall on MIG. Not exercised in this repro
but worth fixing in the same PR.
(Claude-generated bug report follows)
cudf-polars:
NVMLError_NoPermissionon MIG-partitioned GPUsStatus: present in
cudf-polars-cu1226.4.0 with Polars 1.40.1.pl.GPUEngineraisespolars.exceptions.ComputeError: 'cuda' conversion failed: NVMLError_NoPermission: Insufficient Permissionson every.collect()when the visible device is a MIG slice.Environment
580.126.20(CUDA 13.0)enabled (
1g.24gbslices).CUDA_VISIBLE_DEVICES=MIG-<UUID>pointed at one slice.cudf-polars-cu12==26.4.0,polars==1.40.1, Python 3.13.Plain
cudf.DataFrame(...).sum()works in the same environment, sothis is specific to the cudf-polars callback path, not RMM/cuDF
itself.
Repro
Run with
CUDA_VISIBLE_DEVICES=MIG-<uuid-of-a-slice>.Root cause
cudf_polars/utils/config.py::get_device_handle(lines 70–92 in 26.4.0)detects when the visible device is a MIG slice and upgrades the
NVML handle from the MIG slice to its parent device:
get_total_device_memorythen callspynvml.nvmlDeviceGetMemoryInfo(parent_handle)to size the RMM pool.From inside a MIG sandbox, querying the parent device's memory is
privileged — it raises
NVMLError_NoPermission, notNVMLError_NotSupported, so the existingexcept pynvml.NVMLError_NotSupportedclause doesn't catch it. The exceptionpropagates, the cudf-polars callback fails to register, and Polars
surfaces the error as a
ComputeErrorat.collect()time.The upgrade was presumably added to support nested MIG handles or to
get a "canonical" handle, but on the GetMemoryInfo path it's the wrong
direction: the MIG handle itself is happy to answer GetMemoryInfo, and
the value returned (the slice's memory, not the full parent's) is the
correct number to use when sizing RMM pool — RMM is allocating inside
the slice, not the parent.
Verification
Removing the upgrade (using the MIG handle directly) makes the bug go
away:
After the patch, the repro above succeeds and returns
shape: (1, 1) ┌─ a ─┐ │ 15 │ └─────┘.Suggested fix
Drop the parent-handle upgrade in
get_device_handle. If the upgradewas needed somewhere else, scope it to that call rather than caching
the parent handle for everyone (
get_device_handleis@functools.cached, so any caller that reuses it inherits theparent-handle problem).
A minimal alternative is to broaden the exception clause in
get_total_device_memorysoNVMLError_NoPermissionalso returnsNone(treating "we can't size the pool" as an absent device-memorysignal). That hides the bug in the cache-sizing path but doesn't fix
it for any future caller of
get_device_handle().Other affected configurations
Likely affects any MIG-mode GPU under cudf-polars. Tested only on
RTX PRO 6000 Blackwell, but the issue is structural in NVML's
permission model —
nvmlDeviceGetMemoryInfoon the parent devicefrom inside a MIG sandbox is documented as restricted.
Notes
cudfproper (without the-polarscallback) does not hit thispath; it sizes RMM differently.
cudf_polars/experimental/benchmarks/utils.pyhas the same NVMLdance but catches
NVMLError_NotSupportedonly, so it wouldpresumably hit the same wall on MIG. Not exercised in this repro
but worth fixing in the same PR.