[CUDA] JIT-compile qmm_naive by zcbenz · Pull Request #3576 · ml-explore/mlx

zcbenz · 2026-05-21T22:59:46Z

Bundle the headers of CUTLASS and JIT-compile the qmm_naive kernels, which reduces the binary size (#3567) and is required for meeting the size limit of PyPI. Most of the changes are moving code to backend/cuda/device and reducing uses of advanced C++ features to make NVRTC happy.

An unfortunate side effect is test_quantized.py now takes half an hour to run, I will try if I can make some sub-tests run in parallel or do some proper caching in CI.

[CUDA] JIT-compile qmm_naive

53f9633

zcbenz force-pushed the cutlass-jit branch from e2cfcec to 53f9633 Compare May 21, 2026 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] JIT-compile qmm_naive#3576

[CUDA] JIT-compile qmm_naive#3576
zcbenz wants to merge 1 commit into
ml-explore:mainfrom
zcbenz:cutlass-jit

zcbenz commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zcbenz commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zcbenz commented May 21, 2026 •

edited

Loading