Skip to content

Add new radius search cuda kernels for geotransolver#1763

Draft
shrek wants to merge 11 commits into
NVIDIA:mainfrom
shrek:add-new-radius-search-algo
Draft

Add new radius search cuda kernels for geotransolver#1763
shrek wants to merge 11 commits into
NVIDIA:mainfrom
shrek:add-new-radius-search-algo

Conversation

@shrek

@shrek shrek commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

PhysicsNeMo Pull Request

Description

This PR implements 2 new radius search kernels for geotransolver. These kernels outperform the current warp-based radius search implementation. The kernels are implemented in cuda c++, and wrapped with CuPy.

Compact Cell Points

In this algorithm, points are first hashed into radius sized cells. Then, cell-counts are prefix-summed, and points in each cell are scattered into contiguous locations in an array. Finally, radius search is implemented with 1 warp per query point, and for each query point, the points in 27 cells adjacent to the query point cell are computed for distances.

Morton Cell Points

In this approach, radius-sized cell-ids of points are morton-sorted. Then, a cell directory is constructed which contains cell counts, and offsets of points in a contiguous array. Then 1 warp processes each query wherin 27 adjacent cells are binary searched in the morton order, and distances are computed with points in those cells.

Performance

Hopper

Configuration

  - model=geotransolver_volume
  - dataset=drivaer_ml_volume
  - compile=true
  - profile=false
  - sampling_resolution=100000
  - training.seed=42
  - 29 training steps per epoch
  - Epoch 1 excluded
  - 261 measured steps per backend
  - NVIDIA H100 80GB

For radius search alone:

   Backend    12 calls<br>per step    Mean<br>per call    Speedup
  ━━━━━━━━━  ━━━━━━━━━━━━━━━━━━━━━━  ━━━━━━━━━━━━━━━━━━  ━━━━━━━━━
   Warp                  159.03 ms            13.25 ms      1.00×
  ─────────  ──────────────────────  ──────────────────  ─────────
   Compact                23.59 ms             1.97 ms      6.74×
  ─────────  ──────────────────────  ──────────────────  ─────────
   Morton                 25.89 ms             2.16 ms      6.14×

  Values are averaged across five profiled training steps. Each step contains 12 radius-search calls.

For the entire training step:

   Backend         Mean<br>step      Median         p95    Steps/sec    Speedup<br>vs. Warp
  ━━━━━━━━━━━━━━  ━━━━━━━━━━━━━━  ━━━━━━━━━━  ━━━━━━━━━━  ━━━━━━━━━━━  ━━━━━━━━━━━━━━━━━━━━━
   Warp                0.7111 s    0.7213 s    0.7581 s         1.41                  1.00×
  ──────────────  ──────────────  ──────────  ──────────  ───────────  ─────────────────────
   Compact-cell        0.4829 s    0.4857 s    0.5175 s         2.07                  1.47×
  ──────────────  ──────────────  ──────────  ──────────  ───────────  ─────────────────────
   Morton-cell         0.4981 s    0.4919 s    0.5111 s         2.01                  1.43×

Blackwell

  model: geotransolver_volume
  dataset: drivaer_ml_volume
  precision: bfloat16
  compile: true
  sampling_resolution: 200000
     Backend          Mean step time       Throughput    Speedup vs. Warp
  ━━━━━━━━━━━━━━━━━━━━━  ━━━━━━━━━━━━━━━━  ━━━━━━━━━━━━━━━  ━━━━━━━━━━━━━━━━━━
   warp                         0.8206 s    1.219 steps/s               1.00×
  ─────────────────────  ────────────────  ───────────────  ──────────────────
   compact_cell_points          0.4824 s    2.073 steps/s               1.70×
  ─────────────────────  ────────────────  ───────────────  ──────────────────
   morton_cell_points           0.4858 s    2.059 steps/s               1.69×

TBD

Queries can also be assigned cells and morton sorted, and queries in same cell could be searched in a block to improve memory locality.

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant