diff --git a/CHANGELOG.md b/CHANGELOG.md
index daebc11c77..90f205a2f5 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -36,6 +36,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   coordinates with no learnable parameters.
 - Adds radiation transport example (`examples/nuclear_engineering/radiation_transport`)
 - Adds agent skills structure, and initial skill for 'discoverability'.
+- Adds the `physicsnemo-functional-builder` agent skill: a standalone workflow
+  for adding a new `physicsnemo.nn.functional` op (or a Warp/cuML/SciPy backend
+  for an existing op) via `FunctionSpec`, with cross-backend equivalence tests.
 - Adds xDeepONet to experimental models
   (`physicsnemo.experimental.models.xdeeponet.DeepONet`).  A single
   dimension-generic (2D/3D) DeepONet that accepts a spatial or MLP branch,
diff --git a/skills/physicsnemo-functional-builder/BENCHMARK.md b/skills/physicsnemo-functional-builder/BENCHMARK.md
new file mode 100644
index 0000000000..a084786dcf
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/BENCHMARK.md
@@ -0,0 +1,87 @@
+# Evaluation Report
+
+Evaluation of the `physicsnemo-functional-builder` skill before publication through NVSkills-Eval.
+
+This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use.
+
+> **Status: pending.** The results, Tier-1/Tier-2 findings, and verdict below are
+> populated by an NVSkills-Eval run prior to publication. The evaluation dataset
+> (`evals/evals.json`) and target agents are committed; run the harness and
+> refresh this file before publishing.
+
+## Evaluation Summary
+
+- Skill: `physicsnemo-functional-builder`
+- Evaluation date: _pending_
+- NVSkills-Eval profile: `external`
+- Environment: `local`
+- Dataset: 4 evaluation tasks (`evals/evals.json`)
+- Attempts per task: 2
+- Pass threshold: 50%
+- Overall verdict: _pending_
+
+## Agents Used
+
+- `claude-code`
+- `codex`
+
+## Metrics Used
+
+Reported benchmark dimensions:
+
+- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
+- Correctness: checks whether the agent follows the expected workflow and produces the correct final output.
+- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
+- Effectiveness: checks whether the agent performs measurably better with the skill than without it.
+- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work.
+
+Underlying evaluation signals used in this run:
+
+- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
+- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
+- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
+- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
+- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
+- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
+- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.
+
+## Test Tasks
+
+The benchmark dataset contained 4 evaluation tasks:
+
+- Positive tasks: 2 tasks where the skill was expected to activate (add a new functional op with a Warp backend; add an optional cuML/SciPy backend to an existing op).
+- Negative tasks: 2 tasks where the functional-builder skill was not expected (a reusable-layer/model request that belongs to `physicsnemo-model-builder`; an out-of-scope request such as a datapipe or a "which op should I use" usage question).
+- Unlabeled tasks: 0.
+
+Entries with `expected_skill` set are treated as positive skill-activation cases; entries with `expected_skill: null` are treated as negative activation cases.
+
+## Results
+
+_Pending NVSkills-Eval run._
+
+| Dimension | Num | `claude-code` | `codex` |
+|---|---:|---:|---:|
+| Security | — | — | — |
+| Correctness | — | — | — |
+| Discoverability | — | — | — |
+| Effectiveness | — | — | — |
+| Efficiency | — | — | — |
+
+Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.
+
+## Tier 1: Static Validation Summary
+
+_Pending NVSkills-Eval run._
+
+## Tier 2: Deduplication Summary
+
+_Pending NVSkills-Eval run._ Note: this skill is intentionally distinct from
+`physicsnemo-model-builder` (authoring `nn.functional` ops/backends vs. models and
+`nn.Module` layers); the negative eval tasks guard that routing boundary.
+
+## Publication Recommendation
+
+_Pending NVSkills-Eval run._ Refresh this file with the harness output (results
+table, Tier-1/Tier-2 findings, verdict) before publishing, and keep it with the
+skill; re-run when the evaluation dataset, skill behavior, or target agents
+materially change.
diff --git a/skills/physicsnemo-functional-builder/SKILL.md b/skills/physicsnemo-functional-builder/SKILL.md
new file mode 100644
index 0000000000..69c61aa73b
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/SKILL.md
@@ -0,0 +1,205 @@
+---
+name: physicsnemo-functional-builder
+description: Official NVIDIA-authored workflow for adding a new functional op (or a new optimized backend for an existing op) to physicsnemo.nn.functional. Scaffolds a FunctionSpec with multi-backend dispatch (a torch reference plus optional Warp, cuML, or SciPy backends), wires re-exports, writes cross-backend equivalence tests, and runs the local CI gates (ruff, interrogate, pytest). Use when a contributor wants to add an op or a Warp/cuML/SciPy backend to physicsnemo.nn.functional. Do NOT use for complete models or reusable nn.Module layers (use physicsnemo-model-builder), datapipes, losses or metrics, training-recipe or example authoring, environment setup, or deciding which existing op to use.
+license: Apache-2.0
+metadata:
+  author: NVIDIA <agent-skills@nvidia.com>
+  tags:
+    - physicsnemo
+    - functional
+    - kernels
+    - contributing
+    - scaffolding
+---
+
+# PhysicsNeMo Functional Builder
+
+Drive a contributor from "I have an op (a kNN, an SDF query, a sampler, an
+interpolation, a geometry kernel)" — or "I have a faster backend for an
+existing op" — to a standards-compliant, tested, CI-green addition to
+`physicsnemo.nn.functional`. **You do the mechanical work**: the per-op package
+layout, the `FunctionSpec` shell, backend registration and dispatch, the
+`torch.library.custom_op` wrapping for accelerated backends, re-exports,
+cross-backend tests, and the gates. **The contributor brings the math** — the
+actual algorithm and any hand-written Warp/cuML kernel. Keep that division
+explicit: never invent their algorithm; scaffold everything around it.
+
+The audience is a researcher fluent in PyTorch but new to PhysicsNeMo, so
+**explain the "why"** at each step (name the rule, give the reason) rather than
+silently emitting files.
+
+This skill is standalone — it does not depend on any other PhysicsNeMo skill.
+
+## Core principle
+
+1. **`physicsnemo/core/function_spec.py` and `CODING_STANDARDS/` are ground
+   truth — read them, don't paraphrase from memory.** The dispatch/registration
+   machinery lives in `FunctionSpec` (`physicsnemo/core/function_spec.py`); the
+   tensor-annotation / docstring / import rules live in `CODING_STANDARDS/`
+   (`MOD-***`, `EXT-***`). Open the real class and the cited rule before relying
+   on them, and reference them by name when you justify a decision. The exact
+   `FunctionSpec` method names (`register`, `dispatch`, `make_function`,
+   `make_inputs_forward`, `compare_forward`, `warp_launch_context`) may evolve —
+   confirm against the source.
+2. **Study a live exemplar before scaffolding.** The house pattern is consistent
+   and best learned by reading one op end-to-end:
+   `physicsnemo/nn/functional/geometry/farthest_point_sampling/` (Warp + torch)
+   and `physicsnemo/nn/functional/neighbors/knn/` (cuML + SciPy + torch). Mirror
+   their structure for the new op.
+3. **Verify every path before you cite it.** Glob/Read the live repo; a path
+   recalled from memory or pattern-matched from a neighbor is disproof — drop it.
+
+## Scope
+
+In scope: a **new functional op** in `physicsnemo/nn/functional/<category>/`, and
+**adding a backend** (Warp, cuML, SciPy) to an existing op. Both center on
+`FunctionSpec`.
+
+Out of scope — stop and redirect: complete models / reusable `nn.Module` layers
+(→ `physicsnemo-model-builder`), datapipes (`physicsnemo/datapipes/`), losses &
+metrics (`physicsnemo/metrics/`), training recipes (`examples/`), and "which op
+should I use" (a usage question, not an authoring one).
+
+## Key facts that differ from models/layers
+
+State these early — contributors coming from the model side get them wrong:
+
+- **Functionals live in the STABLE tree**, `physicsnemo/nn/functional/...` —
+  **not** `experimental/`. (Contrast `MOD-002a`, which sends new *models/layers*
+  to `experimental/`. There is no `experimental/nn/functional`.) See
+  `references/placement.md`.
+- **There is no `Module`, no parameters, no serialization, no `ModelMetaData`,
+  no checkpoint round-trip.** A functional is a stateless op. The testing story
+  is **cross-backend equivalence**, not `validate_checkpoint`.
+- **Accelerated backends must be wrapped in `torch.library.custom_op`** (plus a
+  `register_fake`), so they compose with `torch.compile`/autograd — even an
+  inline Warp kernel.
+
+## Workflow
+
+Run in order. Confirm the consequential choices (new-op vs new-backend, category,
+which backends); scaffold the rest.
+
+### 1. Intake & classify
+
+Ask only what you can't infer (≤4 questions). Resolve:
+
+- **Task:** a brand-new op, or a new backend for an existing op?
+- **Identity:** op name (snake_case) and `FunctionSpec` class (PascalCase); the
+  signature (inputs/outputs + tensor shapes via jaxtyping); the category
+  (`geometry`, `neighbors`, `interpolation`, …).
+- **Backends:** which ones to provide. **Always a torch reference (`baseline`)**;
+  then optionally Warp (CUDA kernels), cuML (CUDA, optional dep), SciPy (CPU,
+  optional dep). (`references/backends.md`.)
+
+### 2. Place it (and say why)
+
+Per-op package under `physicsnemo/nn/functional/<category>/<op_name>/`:
+`<op_name>.py` (the `FunctionSpec` subclass + `dispatch`), `_torch_impl.py`,
+`_warp_impl.py` + `kernels.py` (if Warp), `_cuml_impl.py` / `_scipy_impl.py` (if
+those deps), `utils.py` (shared validation), `__init__.py`. Re-export up the
+chain: op `__init__` → category `__init__` → `physicsnemo/nn/functional/__init__.py`.
+Say the rule: **stable tree, not experimental** (`references/placement.md`).
+
+### 3. Scaffold the `FunctionSpec` + dispatch
+
+From `references/dispatch.md` and the skeletons in `references/scaffolds.md`:
+
+- Subclass `FunctionSpec`; register each backend with
+  `@FunctionSpec.register(name=..., required_imports=(...), rank=..., baseline=...)`
+  (lower `rank` = preferred; **exactly one** `baseline=True`, the torch ref).
+- Implement `dispatch()` for backend selection: explicit `implementation=`
+  override → availability check; else auto-select (fast backend on CUDA, CPU
+  backend on CPU) with a one-time fallback warning.
+- Expose the public op via `OpClass.make_function("op_name")`.
+- Add `make_inputs_forward` (benchmark inputs) and `compare_forward`
+  (tie-aware equivalence) hooks.
+- jaxtyping on every tensor arg (`MOD-006`); NumPy `r"""` docstrings
+  (`Parameters`/`Returns`/`Raises`); shape validation in `utils.py` guarded by
+  `if not torch.compiler.is_compiling():` (`MOD-005`); upward-only imports
+  (`EXT-***`).
+
+### 4. Backends (torch always; Warp/cuML/SciPy as chosen)
+
+From `references/backends.md`:
+
+- **torch** reference impl in `_torch_impl.py` — the `baseline`, always present,
+  device-agnostic, the equivalence oracle.
+- **Warp:** pure `@wp.kernel`s in `kernels.py` (no torch import); `_warp_impl.py`
+  wraps the launch in `@torch.library.custom_op(...)` + `@<op>.register_fake`,
+  converts with `wp.from_torch(..., return_ctype=True)`, and uses
+  `FunctionSpec.warp_launch_context(tensor)` for device/stream. Warp kernels are
+  typically CUDA-only — raise clearly on CPU.
+- **cuML / SciPy:** gate the whole impl on
+  `check_version_spec(pkg, ver, hard_fail=False)`; inside, wrap with
+  `torch.library.custom_op` + `register_fake`, move data zero-copy via DLPack
+  (cuML) or numpy (SciPy), and **check `tensor.device.type` inside the impl**.
+  When the dep is missing, register a stub that raises a clear `ImportError`.
+
+### 5. Cross-backend tests
+
+From `references/testing.md` (mirror source path under `test/nn/functional/...`):
+
+- A **known-answer** test on a deterministic input (backend-independent truth).
+- **Per-backend** parametrization with `pytest.skip` for device (Warp/cuML are
+  CUDA-only) and for missing optional deps (`check_version_spec`).
+- A **backend-parity** test via `OpClass.compare_forward(...)` — and remember the
+  classic trap: for neighbor ops compare **distances, not indices** (equal-distance
+  ties order differently across backends); sort before comparing.
+- `torch.library.opcheck(...)` for each `custom_op` backend.
+
+### 6. Gates
+
+From the repo root, run and iterate to green (explain each):
+
+```
+make lint          # ruff format --check + ruff check
+make interrogate   # docstring coverage
+make pytest        # or: pytest test/nn/functional/<category>/... -q
+```
+
+Unlike `experimental/`, `nn/functional/` is **not** lint/interrogate-exempt —
+the new op must pass ruff and docstring coverage.
+
+### 7. Finish & review
+
+- Add a one-line `CHANGELOG.md` entry and SPDX Apache-2.0 headers to new files;
+  remind the contributor commits need `-s` (sign-off).
+- Do an independent **code-review pass over the diff** before opening the PR —
+  re-check it against `FunctionSpec`, the standards (`MOD-***`/`EXT-***`),
+  correctness, and backend parity, ideally with fresh eyes (a separate review
+  session/agent). If the host agent offers a built-in code-review command (for
+  example Claude Code's `/code-review`), use it; otherwise review the diff
+  directly. Then open the PR — CODEOWNERS review + CI re-run the gates.
+
+## Common gotchas
+
+Surface the relevant traps inline as you scaffold (full catalogue:
+`references/lessons.md`):
+
+- **Stable tree, not `experimental/`** — the opposite of the models/layers rule.
+- **Backend ties differ:** compare neighbor outputs by *distances* (sorted), not
+  indices; use `compare_forward` to encode the tie-aware comparison.
+- **Device checks belong inside the impl**, not only in `dispatch` (e.g. cuML
+  raises on CPU tensors, Warp on CPU tensors).
+- **`custom_op` + `register_fake` are mandatory** for Warp/cuML backends, or
+  `torch.compile`/`opcheck` break.
+- **Optional deps are gated by `check_version_spec(..., hard_fail=False)`** with a
+  stub `ImportError` fallback — never a bare top-level `import cuml`.
+- **Exactly one `baseline=True`** (the torch reference); it's the benchmark and
+  equivalence oracle.
+
+## Related resources
+
+- `references/placement.md` — where functionals go (stable tree), per-op package
+  layout, re-exports, and what's *not* a functional.
+- `references/dispatch.md` — `FunctionSpec`: registration, `dispatch`,
+  `make_function`, benchmark/compare hooks.
+- `references/backends.md` — Warp (kernels + `custom_op` wrap), cuML & SciPy
+  (optional-dep gating, DLPack), and the torch reference.
+- `references/testing.md` — cross-backend equivalence, device/dep skips, `opcheck`.
+- `references/scaffolds.md` — copy-paste skeletons for the package, each backend,
+  and the test module.
+- `references/lessons.md` — gotchas distilled from real functional PRs.
+- `physicsnemo/core/function_spec.py`, `CODING_STANDARDS/` — the authoritative
+  source; read before relying on them.
diff --git a/skills/physicsnemo-functional-builder/evals/evals.json b/skills/physicsnemo-functional-builder/evals/evals.json
new file mode 100644
index 0000000000..4826df2a7d
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/evals/evals.json
@@ -0,0 +1,58 @@
+[
+  {
+    "id": "add-new-functional-with-warp-backend",
+    "question": "I want to add a new point-cloud sampling op to PhysicsNeMo's nn.functional with a fast CUDA path. How do I implement it so it follows the repository conventions and has both a reference and an accelerated backend?",
+    "expected_skill": "physicsnemo-functional-builder",
+    "expected_script": null,
+    "ground_truth": "A new functional op is added under physicsnemo/nn/functional/<category>/<op_name>/ in the STABLE tree (not experimental — unlike new models/layers). It is a FunctionSpec subclass that registers backends via @FunctionSpec.register(name=..., required_imports=..., rank=..., baseline=...), with exactly one baseline=True torch reference impl (_torch_impl.py) and an accelerated Warp backend (pure @wp.kernel in kernels.py, wrapped in _warp_impl.py with torch.library.custom_op + register_fake, launched via wp.from_torch/warp_launch_context). A dispatch() classmethod selects the backend by device (Warp on CUDA, torch on CPU) with an explicit implementation= override, and the public op is exposed via make_function(...). Tensor args use jaxtyping, shape validation lives in utils.py guarded by torch.compiler.is_compiling(), and the op is re-exported up through the category and nn.functional __init__ files. Tests mirror the source path and cover known-answer, cross-backend parity via compare_forward, device/dep skips, and torch.library.opcheck.",
+    "expected_behavior": [
+      "Loads the physicsnemo-functional-builder skill.",
+      "Places the op in the stable physicsnemo/nn/functional/ tree and explains it is NOT experimental (unlike models/layers).",
+      "Scaffolds a FunctionSpec subclass with a baseline torch impl and a Warp backend wrapped in torch.library.custom_op + register_fake.",
+      "Implements dispatch() selecting backend by device, plus make_function for the public op.",
+      "Proposes cross-backend equivalence tests (compare_forward), device/dependency skips, and opcheck.",
+      "Does not invent the op's algorithm or kernel; defers the math to the contributor.",
+      "Every absolute path cited in the final message exists on disk."
+    ]
+  },
+  {
+    "id": "add-cuml-scipy-backend-to-existing-op",
+    "question": "Our nn.functional kNN only has a torch implementation. How do I add an optional cuML (GPU) and SciPy (CPU) backend so it's faster when those libraries are installed?",
+    "expected_skill": "physicsnemo-functional-builder",
+    "expected_script": null,
+    "ground_truth": "A new backend is added as a dep-gated implementation module (_cuml_impl.py / _scipy_impl.py) and registered on the existing FunctionSpec via @FunctionSpec.register(name='cuml', required_imports=('cuml>=...','cupy>=...'), rank=...) (and 'scipy'). Each impl is gated on check_version_spec(pkg, ver, hard_fail=False) with an ImportError-raising stub when the dep is missing, is wrapped in torch.library.custom_op + register_fake, checks the tensor device inside the impl (cuML rejects CPU, SciPy rejects CUDA), moves data zero-copy (DLPack for cuML, numpy for SciPy), and casts unsupported dtypes (bf16->fp32) around the call. dispatch() is updated to prefer cuML on CUDA / SciPy on CPU with a one-time fallback to the torch baseline. Tests add per-backend parity using compare_forward that compares sorted DISTANCES not indices (ties differ across backends), with check_version_spec / device pytest.skips.",
+    "expected_behavior": [
+      "Loads the physicsnemo-functional-builder skill.",
+      "Adds dep-gated _cuml_impl.py / _scipy_impl.py registered on the existing FunctionSpec via @FunctionSpec.register with required_imports.",
+      "Gates optional deps with check_version_spec(hard_fail=False) and a stub ImportError fallback; checks device inside each impl.",
+      "Wraps backends in torch.library.custom_op + register_fake and uses DLPack/numpy zero-copy.",
+      "Updates dispatch() to prefer cuML/SciPy by device with fallback to the torch baseline.",
+      "Recommends cross-backend parity tests comparing sorted distances (not indices), with device/dep skips.",
+      "Every absolute path cited in the final message exists on disk."
+    ]
+  },
+  {
+    "id": "reusable-layer-defers-to-model-builder",
+    "question": "I have a new attention block (an nn.Module with learnable weights) I want to add to PhysicsNeMo so other models can reuse it. How do I add it correctly?",
+    "expected_skill": "physicsnemo-model-builder",
+    "expected_script": null,
+    "ground_truth": "A reusable, parameterized nn.Module building block is a LAYER, not a functional. It belongs under physicsnemo/nn/module/ (starting in experimental per MOD-002a) and is the territory of the physicsnemo-model-builder skill, not the functional-builder. The functional-builder is scoped to stateless tensor ops in physicsnemo.nn.functional (FunctionSpec + backends), so it should NOT activate here.",
+    "expected_behavior": [
+      "Does NOT load the physicsnemo-functional-builder skill.",
+      "Treats the request as a reusable layer (nn.Module) — physicsnemo-model-builder territory.",
+      "Does not scaffold a FunctionSpec or an nn.functional op."
+    ]
+  },
+  {
+    "id": "out-of-scope-datapipe",
+    "question": "How do I add a new datapipe for my custom point-cloud HDF5 dataset in PhysicsNeMo?",
+    "expected_skill": null,
+    "expected_script": null,
+    "ground_truth": "Datapipes are out of scope for the physicsnemo-functional-builder skill, which covers stateless ops and their backends in physicsnemo.nn.functional. A datapipe belongs under physicsnemo/datapipes/. The skill should not activate and should redirect rather than scaffold a functional op.",
+    "expected_behavior": [
+      "Does NOT load the physicsnemo-functional-builder skill.",
+      "Recognizes datapipes as out of scope and redirects toward physicsnemo/datapipes/.",
+      "Does not scaffold a FunctionSpec or nn.functional op."
+    ]
+  }
+]
diff --git a/skills/physicsnemo-functional-builder/references/backends.md b/skills/physicsnemo-functional-builder/references/backends.md
new file mode 100644
index 0000000000..776b03d4e7
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/references/backends.md
@@ -0,0 +1,154 @@
+# Backends — torch reference, Warp kernels, cuML & SciPy
+
+Every op ships a **torch reference** (the `baseline`). Accelerated backends are
+optional and **must be wrapped in `torch.library.custom_op`** so they compose
+with autograd and `torch.compile`. Study `farthest_point_sampling/` (Warp+torch)
+and `neighbors/knn/` (cuML+SciPy+torch) before writing.
+
+## 1. torch reference (`_torch_impl.py`) — always present
+
+- Pure PyTorch, device-agnostic (runs on CPU and CUDA).
+- The correctness oracle every other backend is tested against, and the
+  `baseline=True` benchmark reference.
+- No `custom_op` wrapper needed (it's already plain torch).
+- Prioritize clarity over speed — this is the spec, not the fast path.
+
+## 2. Warp backend (`kernels.py` + `_warp_impl.py`)
+
+**`kernels.py` — pure Warp, no torch import:**
+
+```python
+import warp as wp
+
+@wp.kernel
+def fps_fused(
+    points: wp.array3d(dtype=wp.float32),   # (B, N, D)
+    selected: wp.array2d(dtype=wp.int32),   # (B, K) output
+    num_points: wp.int32,
+    num_samples: wp.int32,
+    dim: wp.int32,
+):
+    b, t = wp.tid()
+    ...  # the contributor's algorithm; use wp.tile_* for block reductions
+```
+
+**`_warp_impl.py` — wrap the launch in a `custom_op`:**
+
+```python
+import torch
+import warp as wp
+from physicsnemo.core.function_spec import FunctionSpec
+from .kernels import fps_fused
+from .utils import validate_inputs
+
+wp.init()  # at module load
+
+@torch.library.custom_op("physicsnemo::farthest_point_sampling_warp", mutates_args=())
+def farthest_point_sampling(points: torch.Tensor, num_samples: int,
+                            random_start: bool = False) -> torch.Tensor:
+    points_b, was_unbatched = validate_inputs(points, num_samples)
+    if points_b.device.type != "cuda":
+        raise ValueError("The Warp farthest_point_sampling backend requires CUDA tensors.")
+    points_f = points_b.detach().to(torch.float32).contiguous()
+    selected = torch.empty((batch, num_samples), dtype=torch.int32, device=points_f.device)
+
+    wp_device, wp_stream = FunctionSpec.warp_launch_context(points_f)   # device/stream
+    with wp.ScopedStream(wp_stream):
+        wp.launch(
+            fps_fused,
+            dim=(batch, block_size),
+            block_dim=block_size,
+            inputs=[
+                wp.from_torch(points_f, return_ctype=True),            # zero-copy
+                wp.from_torch(selected, return_ctype=True),
+                num_points, num_samples, point_dim,
+            ],
+            device=wp_device, stream=wp_stream,
+        )
+    selected = selected.to(torch.int64)
+    return selected.squeeze(0) if was_unbatched else selected
+
+@farthest_point_sampling.register_fake
+def _(points, num_samples, random_start=False):
+    # shape inference for torch.compile / opcheck — no compute
+    if points.ndim == 2:
+        return torch.empty((num_samples,), dtype=torch.int64, device=points.device)
+    return torch.empty((points.shape[0], num_samples), dtype=torch.int64, device=points.device)
+```
+
+Warp essentials:
+- `wp.init()` once at module load.
+- `wp.from_torch(t, return_ctype=True)` — zero-copy torch→warp; tensors must be
+  `.contiguous()` and a Warp-supported dtype (cast bf16→fp32 first).
+- `FunctionSpec.warp_launch_context(tensor)` → `(device, stream)`; launch under
+  `wp.ScopedStream(stream)` so it respects the active CUDA stream.
+- Warp kernels here are **CUDA-only** — raise a clear `ValueError` on CPU tensors
+  (the torch baseline covers CPU via `dispatch`).
+- `@<op>.register_fake` is **mandatory** — without it `torch.compile`/`opcheck`
+  can't infer output shapes.
+
+## 3. cuML / SciPy backends (`_cuml_impl.py` / `_scipy_impl.py`) — dep-gated
+
+Never `import cuml` at top level. Gate the whole impl on availability and
+register a clear-error stub otherwise:
+
+```python
+import importlib
+import torch
+from physicsnemo.core.version_check import check_version_spec   # confirm the exact path
+
+CUML_AVAILABLE = check_version_spec("cuml", "26.2.0", hard_fail=False)
+CUPY_AVAILABLE = check_version_spec("cupy", "13.6.0", hard_fail=False)
+
+if CUML_AVAILABLE and CUPY_AVAILABLE:
+    cuml = importlib.import_module("cuml")
+    cp = importlib.import_module("cupy")
+
+    @torch.library.custom_op("physicsnemo::knn_cuml", mutates_args=())
+    def knn_impl(points: torch.Tensor, queries: torch.Tensor, k: int = 3
+                 ) -> tuple[torch.Tensor, torch.Tensor]:
+        if points.device.type != "cuda":                 # device check INSIDE the impl
+            raise ValueError(f"`knn` cuml does not support CPU, got {points.device=}")
+        restore = points.dtype
+        if restore == torch.bfloat16:                    # cuFFT/cuML want fp32
+            points, queries = points.float(), queries.float()
+        points = cp.from_dlpack(points)                  # zero-copy via DLPack
+        queries = cp.from_dlpack(queries)
+        nn = cuml.neighbors.NearestNeighbors(n_neighbors=k); nn.fit(points)
+        distance, indices = nn.kneighbors(queries)
+        indices = torch.from_dlpack(indices)
+        distance = torch.from_dlpack(distance)
+        if restore == torch.bfloat16:
+            distance = distance.to(restore)
+        return indices, distance
+
+    @knn_impl.register_fake
+    def _(points, queries, k=3):
+        return (torch.empty(queries.shape[0], k, device=queries.device, dtype=torch.int64),
+                torch.empty(queries.shape[0], k, device=queries.device, dtype=queries.dtype))
+else:
+    def knn_impl(*args, **kwargs):                        # stub: clear ImportError
+        raise ImportError("physicsnemo kNN: cuml/cupy not installed.")
+```
+
+SciPy mirrors this for **CPU**: gate on `check_version_spec("scipy", ...)`, check
+`points.device.type != "cpu"` inside, move data via `.detach().numpy()`, build
+the structure (e.g. `scipy.spatial.KDTree`), and convert results back with
+`torch.from_numpy(...)`. Cast bf16→fp32 the same way.
+
+Optional-dep rules:
+- `check_version_spec(pkg, ver, hard_fail=False)` is the availability gate
+  (verify the import path in the live repo).
+- Convert zero-copy where possible: **DLPack** for GPU (cuML/cuPy), numpy for CPU
+  (SciPy).
+- Keep the **hard device check inside the impl** — `dispatch` chooses by device,
+  but a directly-called backend must reject the wrong device itself.
+- Match the registered `required_imports` versions to what the impl actually gates
+  on.
+
+## Backend ↔ registration mapping
+
+The `_*_impl.py` functions are what the `@FunctionSpec.register(name=...)` methods
+in `<op_name>.py` call. Keep the registered `name`, `required_imports`, and `rank`
+consistent with the impl they wrap (e.g. `name="cuml"`,
+`required_imports=("cuml>=26.2.0","cupy>=13.6.0")`, `rank=0`).
diff --git a/skills/physicsnemo-functional-builder/references/dispatch.md b/skills/physicsnemo-functional-builder/references/dispatch.md
new file mode 100644
index 0000000000..22f3a91ea0
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/references/dispatch.md
@@ -0,0 +1,122 @@
+# Dispatch — `FunctionSpec`, registration, backend selection
+
+Every functional is a subclass of **`FunctionSpec`**
+(`physicsnemo/core/function_spec.py`). It is the single mechanism for: declaring
+backends, choosing one at call time, and providing benchmark/equivalence hooks.
+**Read `core/function_spec.py` before scaffolding** — the method names below are
+the current convention but may evolve; treat the source as truth.
+
+## The shape of a functional
+
+```python
+class FarthestPointSampling(FunctionSpec):
+    """One-line contract + Parameters/Returns/Raises (NumPy docstring)."""
+
+    # Backends. Lower rank = preferred. EXACTLY ONE baseline=True (the oracle).
+    @FunctionSpec.register(name="warp", required_imports=("warp>=0.6.0",), rank=0)
+    def warp_forward(points, num_samples, random_start=False): ...
+
+    @FunctionSpec.register(name="torch", rank=1, baseline=True)
+    def torch_forward(points, num_samples, random_start=False): ...
+
+    @classmethod
+    def dispatch(cls, points, num_samples, random_start=False, implementation=None):
+        ...  # backend selection — see below
+
+    @classmethod
+    def make_inputs_forward(cls, device="cpu"):
+        ...  # yields (label, args_tuple, kwargs_dict) for benchmarking
+
+    @classmethod
+    def compare_forward(cls, output, reference):
+        ...  # tie-aware equivalence (see references/testing.md)
+
+# Public callable the rest of physicsnemo imports:
+farthest_point_sampling = FarthestPointSampling.make_function("farthest_point_sampling")
+```
+
+## `@FunctionSpec.register(...)`
+
+- `name`: backend id used by `implementation="..."` and in tests
+  (`"warp"`, `"torch"`, `"cuml"`, `"scipy"`).
+- `required_imports`: version specs gating availability, e.g.
+  `("cuml>=26.2.0", "cupy>=13.6.0")`. The registry marks the backend
+  unavailable (rather than crashing) when an import is missing.
+- `rank`: integer preference; the lowest-rank *available* backend wins
+  auto-selection.
+- `baseline=True`: marks the reference impl (the torch one). Exactly one. It is
+  the benchmark reference and the equivalence oracle.
+
+The registered functions take the **op's arguments directly** (no `self`).
+
+## `dispatch()` — the selection recipe
+
+Two paths: explicit override, then auto-select. The canonical shape:
+
+```python
+@classmethod
+def dispatch(cls, points, num_samples, random_start=False, implementation=None):
+    impls = cls._get_impls()
+    cls._check_impl(implementation, impls)            # validate the name
+
+    if implementation is not None:                    # explicit override
+        impl = impls[implementation]
+        if not impl.available:
+            raise ImportError(f"Implementation '{implementation}' is not available...")
+        return impl.func(points, num_samples, random_start)
+
+    # auto-select: fast backend on CUDA, fall back to the CPU/torch reference
+    warp_impl = impls.get("warp")
+    if points.is_cuda and warp_impl is not None and warp_impl.available:
+        return warp_impl.func(points, num_samples, random_start)
+    return impls["torch"].func(points, num_samples, random_start)
+```
+
+For an op with cuML (CUDA) + SciPy (CPU) + torch, prefer by device and warn once
+on fallback:
+
+```python
+preferred_name = "cuml" if points.is_cuda else "scipy"
+preferred = impls.get(preferred_name)
+impl = preferred if (preferred is not None and preferred.available) else None
+if impl is None:
+    impl = impls["torch"]
+    cls._warn_fallback(preferred, impl)               # one-time fallback warning
+return impl.func(points, queries, k)
+```
+
+Rules of thumb:
+- The auto path must always reach an **available** backend — the torch baseline
+  guarantees that.
+- Put the *device-appropriateness* decision here (CUDA→Warp/cuML, CPU→torch/SciPy),
+  but keep the *hard device check* inside each impl too (a cuML impl must reject a
+  CPU tensor itself — see `references/backends.md`).
+
+## `make_function(...)`
+
+`OpClass.make_function("op_name")` returns the public callable that runs
+`dispatch`. Bind it at module scope and re-export it (see
+`references/placement.md`). Users call `op_name(...)` and may pass
+`implementation="warp"` to force a backend (mainly for tests/benchmarks).
+
+## Benchmark + equivalence hooks
+
+- `make_inputs_forward(cls, device)` — a generator yielding
+  `(label, args_tuple, kwargs_dict)` covering representative sizes; powers the
+  benchmark harness. Keep cases small but realistic.
+- `compare_forward(cls, output, reference)` — how two backends' outputs are
+  judged equal. Encode tie-invariance here (sort indices; for neighbor ops
+  compare sorted **distances**, not indices). Optional `make_inputs_backward` /
+  `compare_backward` exist for differentiable ops.
+
+## Cross-cutting (applies to every backend signature)
+
+- **jaxtyping** on tensor args/returns (`MOD-006`):
+  `Float[torch.Tensor, "*batch num_points dim"]`, `Int[...]`.
+- **NumPy `r"""` docstrings** with `Parameters` / `Returns` / `Raises`.
+- **Shape/precondition validation** in `utils.py`, guarded by
+  `if not torch.compiler.is_compiling():` (`MOD-005`) so it's skipped under
+  compilation.
+- **Upward-only imports** (`EXT-***`): functional code may import from
+  `physicsnemo.core` (e.g. `FunctionSpec`, `check_version_spec`) but must not
+  reach into `models/`.
diff --git a/skills/physicsnemo-functional-builder/references/lessons.md b/skills/physicsnemo-functional-builder/references/lessons.md
new file mode 100644
index 0000000000..47d28c819b
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/references/lessons.md
@@ -0,0 +1,59 @@
+# Common gotchas (functional ops)
+
+Distilled from real `physicsnemo.nn.functional` PRs. Surface the relevant one
+inline as you scaffold.
+
+- **Stable tree, not `experimental/`.** New functionals go directly into
+  `physicsnemo/nn/functional/...`. This is the *opposite* of the models/layers
+  rule (`MOD-002a`), and the mistake a contributor coming from the model side
+  makes first. There is no `experimental/nn/functional`.
+
+- **Compare neighbor outputs by distance, not index.** Equal-distance ties are
+  ordered differently across cuML / SciPy / torch. Parity tests that compare
+  indices are spuriously red. Compare **sorted distances**; for samplers compare
+  **sorted indices** (set-equality). Put this in `compare_forward` so every test
+  inherits it.
+
+- **`custom_op` + `register_fake` are mandatory** for Warp/cuML/SciPy backends.
+  Without the `custom_op` wrapper the op won't compose with autograd/`torch.compile`;
+  without `register_fake` shape inference (and `opcheck`) fails.
+
+- **Device check belongs inside the impl, not only in `dispatch`.** A cuML/Warp
+  impl must raise on a CPU tensor (and SciPy on CUDA) even when called directly
+  via `implementation="..."`. `dispatch` picks by device; the impl enforces it.
+
+- **Optional deps are gated, never bare-imported.** Use
+  `check_version_spec(pkg, ver, hard_fail=False)` and an `if available: … else:
+  stub-raising-ImportError` block. A top-level `import cuml` breaks import on
+  every CPU-only install.
+
+- **Cast unsupported dtypes before the backend.** Warp and cuML/cuPy generally
+  want fp32 — upcast bf16 (and often fp16) on the way in, and cast results back
+  to the caller's dtype on the way out.
+
+- **Warp needs contiguous, zero-copy tensors.** `wp.from_torch(t,
+  return_ctype=True)` requires `.contiguous()`; launch under
+  `wp.ScopedStream(stream)` from `FunctionSpec.warp_launch_context(...)` so it
+  honors the active CUDA stream. `wp.init()` once at module load.
+
+- **Exactly one `baseline=True`** — the torch reference. It's the benchmark
+  reference and the equivalence oracle, and it guarantees `dispatch` always has
+  an available fallback.
+
+- **Dynamic-shape backends don't `torch.compile` cleanly.** Ops whose output size
+  depends on data at runtime (e.g. radius search with an unbounded neighbor
+  count) are compile-incompatible — document it and keep a bounded/`max_*` path
+  for the compiled case.
+
+- **`nn/functional/` is linted.** Unlike `experimental/`, it is **not** exempt
+  from ruff/interrogate — the op needs jaxtyping, full NumPy docstrings
+  (`Parameters`/`Returns`/`Raises`), and clean formatting to pass CI.
+
+- **Validate under `if not torch.compiler.is_compiling():`** (`MOD-005`). Keep the
+  shape/precondition checks in `utils.py` and guard them so they're skipped under
+  compilation rather than tracing into the graph.
+
+- **Read `core/function_spec.py` for the live API.** `register` /
+  `dispatch` / `make_function` / `make_inputs_forward` / `compare_forward` /
+  `warp_launch_context` are the current names — confirm against source rather than
+  trusting a skeleton verbatim.
diff --git a/skills/physicsnemo-functional-builder/references/placement.md b/skills/physicsnemo-functional-builder/references/placement.md
new file mode 100644
index 0000000000..951e0847c2
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/references/placement.md
@@ -0,0 +1,83 @@
+# Placement — where a functional lives
+
+Resolve two questions before writing: **is this actually a functional**, and
+**where in `nn/functional/` does it go**.
+
+## Is it a functional? (litmus test)
+
+A `physicsnemo.nn.functional` op is a **stateless tensor-in / tensor-out
+operation** — no learnable parameters, no `nn.Module`, no checkpoint. It usually
+has (or could have) more than one implementation (a torch reference plus an
+accelerated backend). Examples: nearest neighbors, radius search, signed-distance
+queries, farthest-point sampling, mesh voxelization, interpolation.
+
+If it has parameters / state / a `forward` users compose into a model → it's a
+**layer or model**, not a functional → redirect to `physicsnemo-model-builder`.
+If it's a loss/metric → `physicsnemo/metrics/`. If it's data loading →
+`physicsnemo/datapipes/`.
+
+## Where it goes — the STABLE tree (not experimental)
+
+```
+physicsnemo/nn/functional/<category>/<op_name>/
+  __init__.py          # re-export the FunctionSpec subclass + the public op
+  <op_name>.py         # FunctionSpec subclass + dispatch + make_function(...)
+  _torch_impl.py       # torch reference impl (the baseline)
+  kernels.py           # pure-Warp @wp.kernel definitions (if Warp backend)
+  _warp_impl.py        # Warp impl wrapped in torch.library.custom_op (if Warp)
+  _cuml_impl.py        # cuML backend, dep-gated (if provided)
+  _scipy_impl.py       # SciPy backend, dep-gated (if provided)
+  utils.py             # shared input validation / small helpers
+```
+
+> **Key difference from models/layers:** new functionals go straight into the
+> **stable** `physicsnemo/nn/functional/` tree — there is **no
+> `experimental/nn/functional`**. (`MOD-002a` routes new *models/layers* to
+> `experimental/`; it does **not** apply here.) Confirm by checking that existing
+> ops like `farthest_point_sampling` and `knn` live under `nn/functional/`, not
+> `experimental/`.
+
+A small op can be a **single file** (`<category>/<op_name>.py`) holding the
+`FunctionSpec` + inline `custom_op` + kernel (see `geometry/sdf.py`). Prefer the
+package layout once there's more than one backend or a kernel file.
+
+## Choosing the category
+
+Pick the existing category that fits; don't invent one without reason:
+
+```bash
+ls physicsnemo/nn/functional/                 # geometry, neighbors, interpolation, ...
+ls physicsnemo/nn/functional/<category>/      # see sibling ops for the pattern
+```
+
+- neighbor / search ops → `neighbors/`
+- geometry, SDF, sampling, voxelization → `geometry/`
+- resampling / gather-scatter interpolation → `interpolation/`
+
+## Re-exports (wire all the way up)
+
+After creating the op, export it at each level (verify the exact `__init__`
+contents in the live repo first):
+
+1. **op `__init__.py`** → `from .<op_name> import <OpClass>, <op_name>`
+2. **category `__init__.py`** (`<category>/__init__.py`) → re-export `<op_name>`
+   (and the class) and add to `__all__`.
+3. **`physicsnemo/nn/functional/__init__.py`** → re-export from the category so
+   users can `from physicsnemo.nn.functional import <op_name>`.
+
+Check whether siblings are also surfaced on `physicsnemo.nn` (`nn/__init__.py`);
+match the prevailing convention rather than guessing.
+
+## Tests mirror the source path
+
+```
+test/nn/functional/<category>/test_<op_name>.py
+```
+
+## Don't put these here (redirect)
+
+- A parameterized building block / `nn.Module` → `physicsnemo/nn/module/`
+  (→ `physicsnemo-model-builder`).
+- A loss or metric → `physicsnemo/metrics/`.
+- A datapipe / transform → `physicsnemo/datapipes/`.
+- "Which existing op should I use" → a usage question, not an authoring task.
diff --git a/skills/physicsnemo-functional-builder/references/scaffolds.md b/skills/physicsnemo-functional-builder/references/scaffolds.md
new file mode 100644
index 0000000000..0dd4ffb8d1
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/references/scaffolds.md
@@ -0,0 +1,255 @@
+# Scaffolds — copy-paste skeletons
+
+Adapt names/shapes; **verify the `FunctionSpec` API and `check_version_spec`
+import path against the live repo** before trusting these verbatim. All files
+start with the SPDX Apache-2.0 header. Replace `myop` / `MyOp` / `<category>`.
+
+## Package layout
+
+```
+physicsnemo/nn/functional/<category>/myop/
+  __init__.py  <op>.py  _torch_impl.py  utils.py
+  kernels.py  _warp_impl.py            # if Warp backend
+  _cuml_impl.py  _scipy_impl.py        # if those backends
+```
+
+## `utils.py` — shared validation
+
+```python
+from __future__ import annotations
+import torch
+
+def validate_inputs(points: torch.Tensor, num_samples: int) -> tuple[torch.Tensor, bool]:
+    """Validate and canonicalize inputs (skipped under torch.compile)."""
+    if not torch.compiler.is_compiling():
+        if points.ndim not in (2, 3):
+            raise ValueError(f"points must be rank 2 or 3, got {points.ndim}")
+        if num_samples <= 0:
+            raise ValueError(f"num_samples must be positive, got {num_samples}")
+    was_unbatched = points.ndim == 2
+    points_b = points.unsqueeze(0) if was_unbatched else points
+    return points_b, was_unbatched
+```
+
+## `_torch_impl.py` — the baseline
+
+```python
+from __future__ import annotations
+import torch
+from jaxtyping import Float, Int
+from .utils import validate_inputs
+
+def myop_torch(
+    points: Float[torch.Tensor, "*batch num_points dim"],
+    num_samples: int,
+    random_start: bool = False,
+) -> Int[torch.Tensor, "*batch num_samples"]:
+    points_b, was_unbatched = validate_inputs(points, num_samples)
+    # ... pure-torch reference algorithm (device-agnostic) ...
+    out = ...
+    return out.squeeze(0) if was_unbatched else out
+```
+
+## `kernels.py` — pure Warp (no torch import)
+
+```python
+import warp as wp
+
+@wp.kernel
+def myop_kernel(
+    points: wp.array3d(dtype=wp.float32),
+    out: wp.array2d(dtype=wp.int32),
+    num_points: wp.int32,
+    num_samples: wp.int32,
+    dim: wp.int32,
+):
+    b, t = wp.tid()
+    ...  # the algorithm; wp.tile_max / wp.tile_min for block reductions
+```
+
+## `_warp_impl.py` — Warp wrapped in `custom_op`
+
+```python
+from __future__ import annotations
+import torch
+import warp as wp
+from physicsnemo.core.function_spec import FunctionSpec
+from .kernels import myop_kernel
+from .utils import validate_inputs
+
+wp.init()
+
+@torch.library.custom_op("physicsnemo::myop_warp", mutates_args=())
+def myop(points: torch.Tensor, num_samples: int, random_start: bool = False) -> torch.Tensor:
+    points_b, was_unbatched = validate_inputs(points, num_samples)
+    if points_b.device.type != "cuda":
+        raise ValueError("The Warp myop backend requires CUDA tensors.")
+    points_f = points_b.detach().to(torch.float32).contiguous()
+    batch, num_points, dim = points_f.shape
+    out = torch.empty((batch, num_samples), dtype=torch.int32, device=points_f.device)
+    block = min(256, max(1, num_points))
+
+    wp_device, wp_stream = FunctionSpec.warp_launch_context(points_f)
+    with wp.ScopedStream(wp_stream):
+        wp.launch(
+            myop_kernel,
+            dim=(batch, block), block_dim=block,
+            inputs=[wp.from_torch(points_f, return_ctype=True),
+                    wp.from_torch(out, return_ctype=True),
+                    num_points, num_samples, dim],
+            device=wp_device, stream=wp_stream,
+        )
+    out = out.to(torch.int64)
+    return out.squeeze(0) if was_unbatched else out
+
+@myop.register_fake
+def _(points, num_samples, random_start=False):
+    if points.ndim == 2:
+        return torch.empty((num_samples,), dtype=torch.int64, device=points.device)
+    return torch.empty((points.shape[0], num_samples), dtype=torch.int64, device=points.device)
+```
+
+## `_cuml_impl.py` / `_scipy_impl.py` — dep-gated
+
+```python
+from __future__ import annotations
+import importlib
+import torch
+from physicsnemo.core.version_check import check_version_spec   # confirm path
+
+SCIPY_AVAILABLE = check_version_spec("scipy", "1.7.0", hard_fail=False)
+
+if SCIPY_AVAILABLE:
+    KDTree = importlib.import_module("scipy.spatial").KDTree
+
+    @torch.library.custom_op("physicsnemo::myop_scipy", mutates_args=())
+    def myop(points: torch.Tensor, queries: torch.Tensor, k: int = 3
+             ) -> tuple[torch.Tensor, torch.Tensor]:
+        if points.device.type != "cpu":
+            raise ValueError(f"`myop` scipy does not support CUDA, got {points.device=}")
+        restore = points.dtype
+        if restore == torch.bfloat16:
+            points, queries = points.float(), queries.float()
+        tree = KDTree(points.detach().numpy())
+        distance, indices = tree.query(queries.detach().numpy(), k=k)
+        indices = torch.from_numpy(indices).reshape(queries.shape[0], k)
+        distance = torch.from_numpy(distance).reshape(queries.shape[0], k)
+        return indices, distance.to(restore) if restore == torch.bfloat16 else distance
+
+    @myop.register_fake
+    def _(points, queries, k=3):
+        return (torch.empty(queries.shape[0], k, device=queries.device, dtype=torch.int64),
+                torch.empty(queries.shape[0], k, device=queries.device, dtype=queries.dtype))
+else:
+    def myop(*args, **kwargs):
+        raise ImportError("physicsnemo myop: scipy is not installed.")
+```
+
+(cuML mirrors this: gate on `cuml`+`cupy`, check `device.type != "cuda"`, move via
+`cp.from_dlpack` / `torch.from_dlpack`.)
+
+## `<op>.py` — the `FunctionSpec`
+
+```python
+from __future__ import annotations
+import torch
+from jaxtyping import Float, Int
+from physicsnemo.core.function_spec import FunctionSpec
+from ._torch_impl import myop_torch
+from ._warp_impl import myop as myop_warp        # if Warp
+
+class MyOp(FunctionSpec):
+    r"""One-line contract.
+
+    Parameters
+    ----------
+    points : Float[torch.Tensor, "*batch num_points dim"]
+        ...
+    Returns
+    -------
+    Int[torch.Tensor, "*batch num_samples"]
+        ...
+    Raises
+    ------
+    ValueError
+        If ...
+    """
+
+    @FunctionSpec.register(name="warp", required_imports=("warp>=0.6.0",), rank=0)
+    def warp_forward(points, num_samples, random_start=False):
+        return myop_warp(points, num_samples, random_start)
+
+    @FunctionSpec.register(name="torch", rank=1, baseline=True)
+    def torch_forward(points, num_samples, random_start=False):
+        return myop_torch(points, num_samples, random_start)
+
+    @classmethod
+    def dispatch(cls, points, num_samples, random_start=False, implementation=None):
+        impls = cls._get_impls()
+        cls._check_impl(implementation, impls)
+        if implementation is not None:
+            impl = impls[implementation]
+            if not impl.available:
+                raise ImportError(f"Implementation '{implementation}' is not available.")
+            return impl.func(points, num_samples, random_start)
+        warp_impl = impls.get("warp")
+        if points.is_cuda and warp_impl is not None and warp_impl.available:
+            return warp_impl.func(points, num_samples, random_start)
+        return impls["torch"].func(points, num_samples, random_start)
+
+    @classmethod
+    def make_inputs_forward(cls, device="cpu"):
+        device = torch.device(device)
+        for label, n, d, k in (("small", 256, 3, 16), ("large", 4096, 3, 256)):
+            yield (label, (torch.rand(n, d, device=device), k), {})
+
+    @classmethod
+    def compare_forward(cls, output, reference):
+        torch.testing.assert_close(output.sort(dim=-1).values, reference.sort(dim=-1).values)
+
+myop = MyOp.make_function("myop")
+```
+
+## `__init__.py` (op package) + re-exports
+
+```python
+# physicsnemo/nn/functional/<category>/myop/__init__.py
+from .myop import MyOp, myop
+__all__ = ["MyOp", "myop"]
+```
+
+Then add to `<category>/__init__.py` and `physicsnemo/nn/functional/__init__.py`
+(re-export `myop` + `MyOp`, extend `__all__`) — match the existing style there.
+
+## Test module
+
+```python
+# test/nn/functional/<category>/test_myop.py
+import pytest, torch
+from physicsnemo.nn.functional import myop
+from physicsnemo.nn.functional.<category>.myop.myop import MyOp
+from physicsnemo.core.version_check import check_version_spec
+
+@pytest.mark.parametrize("implementation", ["torch", "warp"])
+def test_myop_known_answer(device, implementation):
+    if implementation == "warp" and "cpu" in device:
+        pytest.skip("warp backend is CUDA-only")
+    points = ...  # deterministic input with a known result
+    out = myop(points, 3, implementation=implementation)
+    assert out.tolist() == [...]
+
+def test_myop_backend_parity(device):
+    if "cpu" in device:
+        pytest.skip("warp backend is CUDA-only")
+    points = ...  # tie-free, well-separated
+    MyOp.compare_forward(
+        myop(points, 40, implementation="warp"),
+        myop(points, 40, implementation="torch"),
+    )
+
+def test_myop_opcheck(device):
+    if "cpu" in device:
+        pytest.skip("warp backend is CUDA-only")
+    from physicsnemo.nn.functional.<category>.myop._warp_impl import myop as myop_warp_op
+    torch.library.opcheck(myop_warp_op, args=(..., 8), kwargs={"random_start": False})
+```
diff --git a/skills/physicsnemo-functional-builder/references/testing.md b/skills/physicsnemo-functional-builder/references/testing.md
new file mode 100644
index 0000000000..3a600c62f7
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/references/testing.md
@@ -0,0 +1,111 @@
+# Testing — cross-backend equivalence
+
+Functionals have **no checkpoint round-trip** to test. The job instead is: every
+backend produces the right answer, agrees with the torch baseline, and skips
+cleanly when its device/dependency is absent. Tests mirror the source path:
+`test/nn/functional/<category>/test_<op_name>.py`.
+
+## The four test kinds
+
+### 1. Known-answer (backend-independent truth)
+
+Construct an input whose result is analytically known, and assert it per backend.
+
+```python
+@pytest.mark.parametrize("implementation", ["torch", "warp"])
+def test_fps_known_answer_collinear(device, implementation):
+    if implementation == "warp" and "cpu" in device:
+        pytest.skip("warp FPS backend is CUDA-only")
+    m = 9
+    xs = torch.arange(m, dtype=torch.float32, device=device).reshape(m, 1)
+    points = torch.cat([xs, torch.zeros(m, 2, device=device)], dim=1)
+    idx = farthest_point_sampling(points, 3, implementation=implementation)
+    assert idx.tolist() == [0, m - 1, (m - 1) // 2]
+```
+
+### 2. Output-contract checks
+
+Shape, dtype, value ranges, and ordering invariants — run on the torch backend
+across dtypes/`k`.
+
+```python
+def _assert_knn_outputs(points, queries, indices, distances, k):
+    assert indices.shape == (queries.shape[0], k)
+    assert (indices >= 0).all() and (indices < points.shape[0]).all()
+    assert (distances >= 0).all()
+    assert torch.all(distances[:, 1:] >= distances[:, :-1])   # sorted
+```
+
+### 3. Backend parity (the important one) — use `compare_forward`
+
+Run the accelerated backend and the torch baseline on the same input and compare
+via the op's `compare_forward` hook, which encodes **tie-invariance**:
+
+```python
+def test_knn_backend_forward_parity(device):
+    points = torch.randn(53, 3, device=device)
+    queries = torch.randn(21, 3, device=device)
+    k = 5
+    if "cuda" in device:
+        if not check_version_spec("cuml", "26.2.0", hard_fail=False):
+            pytest.skip("cuml not available")
+        out_a = knn(points, queries, k, implementation="cuml")
+    else:
+        if not check_version_spec("scipy", "1.7.0", hard_fail=False):
+            pytest.skip("scipy not available")
+        out_a = knn(points, queries, k, implementation="scipy")
+    out_b = knn(points, queries, k, implementation="torch")
+    KNN.compare_forward(out_a, out_b)
+```
+
+> **The classic trap:** equal-distance neighbors are ordered differently across
+> backends, so **never compare neighbor indices directly** — compare **sorted
+> distances**. That logic belongs in `compare_forward` so every parity test
+> inherits it:
+> ```python
+> @classmethod
+> def compare_forward(cls, output, reference):
+>     _, distances = output
+>     _, ref_distances = reference
+>     torch.testing.assert_close(
+>         torch.sort(distances, dim=1)[0], torch.sort(ref_distances, dim=1)[0],
+>         atol=1e-5, rtol=1e-5)
+> ```
+> For sampling ops (FPS), sort the selected indices before comparing
+> (set-equality, order-invariant).
+
+### 4. `opcheck` for each `custom_op` backend
+
+Validates the custom-op schema, `register_fake`, and autograd plumbing:
+
+```python
+def test_fps_opcheck(device):
+    if "cpu" in device:
+        pytest.skip("warp FPS backend is CUDA-only")
+    from physicsnemo.nn.functional.geometry.farthest_point_sampling._warp_impl import (
+        farthest_point_sampling as fps_warp_op,
+    )
+    points = _well_separated_cloud(device, n=40)
+    torch.library.opcheck(fps_warp_op, args=(points, 8), kwargs={"random_start": False})
+```
+
+## Skips — device and dependency
+
+- **Device:** Warp and cuML are CUDA-only → `if "cpu" in device: pytest.skip(...)`.
+  SciPy is CPU-only → skip on CUDA.
+- **Optional dep:** `if not check_version_spec(pkg, ver, hard_fail=False): pytest.skip(...)`.
+- Use the repo's `device` fixture (it parametrizes/serves cpu+cuda and auto-skips
+  CUDA when unavailable) rather than hard-coding a device.
+
+## Determinism
+
+Build inputs deterministically (fixed grids, seeded `torch.rand`, well-separated
+clouds so there are no ties) — parity tests must not be flaky. Tie-free inputs
+also let you compare indices directly when you want a stricter check.
+
+## Run
+
+```
+pytest test/nn/functional/<category>/test_<op_name>.py -q
+# CPU-only host: Warp/cuML tests self-skip; SciPy/torch still exercise parity.
+```
diff --git a/skills/physicsnemo-functional-builder/skill-card.md b/skills/physicsnemo-functional-builder/skill-card.md
new file mode 100644
index 0000000000..3137e6df9a
--- /dev/null
+++ b/skills/physicsnemo-functional-builder/skill-card.md
@@ -0,0 +1,83 @@
+## Description: <br>
+Official NVIDIA-authored workflow for adding a new functional op (or a new optimized backend for an existing op) to `physicsnemo.nn.functional`. Scaffolds a `FunctionSpec` with multi-backend dispatch (a torch reference plus optional Warp, cuML, or SciPy backends), wires re-exports, writes cross-backend equivalence tests, and runs the local CI gates. <br>
+
+This skill is ready for commercial/non-commercial use. <br>
+
+## Owner
+NVIDIA <br>
+
+### License/Terms of Use: <br>
+Apache-2.0 <br>
+## Use Case: <br>
+Contributors and researchers adding a new functional op to the PhysicsNeMo package, or adding an accelerated backend (Warp CUDA kernels, cuML, or SciPy) to an existing op, so it follows the repository's functional conventions (`FunctionSpec` registration and dispatch, `torch.library.custom_op` wrapping, placement, typing, validation, cross-backend tests) and passes CI. <br>
+
+### Deployment Geography for Use: <br>
+Global <br>
+
+## Known Risks and Mitigations: <br>
+Risk: The skill scaffolds and edits source files, including GPU kernels; generated code could be incorrect, incomplete, or place files in the wrong location if the live repository structure differs from assumptions. <br>
+Mitigation: The skill reads `physicsnemo/core/function_spec.py` and an existing exemplar op before scaffolding, verifies paths against the live repo, runs the CI gates (ruff, interrogate, pytest) and cross-backend equivalence tests, and runs an independent code-review pass over the diff before completion. It defers the op's algorithm and any hand-written kernel to the human. Review the diff, the cross-backend parity tests, and the CI result before merging. <br>
+
+## Reference(s): <br>
+- [placement.md](references/placement.md) <br>
+- [dispatch.md](references/dispatch.md) <br>
+- [backends.md](references/backends.md) <br>
+- [testing.md](references/testing.md) <br>
+- [scaffolds.md](references/scaffolds.md) <br>
+- [lessons.md](references/lessons.md) <br>
+- [PhysicsNeMo GitHub Repository](https://github.com/NVIDIA/physicsnemo) <br>
+
+
+## Skill Output: <br>
+**Output Type(s):** [Code scaffolding, File edits, Analysis] <br>
+**Output Format:** [Python, Markdown] <br>
+**Output Parameters:** [N/A] <br>
+**Other Properties Related to Output:** [Generated files are standards-compliant skeletons completed by the contributor; the skill does not author the op's algorithm or hand-written kernels.] <br>
+
+## Evaluation Agents Used: <br>
+- Claude Code (`claude-code`) <br>
+- Codex (`codex`) <br>
+
+
+
+## Evaluation Tasks: <br>
+Evaluated against 4 internal evaluation tasks (2 positive skill-activation, 2 negative) with 2 attempts per task via NVSkills-Eval. <br>
+
+## Evaluation Metrics Used: <br>
+Reported benchmark dimensions: <br>
+- Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. <br>
+- Correctness: Checks whether the agent follows the expected workflow and produces the correct final output. <br>
+- Discoverability: Checks whether the agent loads the skill when relevant and avoids using it when irrelevant. <br>
+- Effectiveness: Checks whether the agent performs measurably better with the skill than without it. <br>
+- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work. <br>
+
+Underlying evaluation signals used in this run: <br>
+- `security`: Checks for unsafe operations, secret leakage, and unauthorized access. <br>
+- `skill_execution`: Verifies that the agent loaded the expected skill and workflow. <br>
+- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage. <br>
+- `accuracy`: Grades final-answer correctness against the reference answer. <br>
+- `goal_accuracy`: Checks whether the overall user task completed successfully. <br>
+- `behavior_check`: Verifies expected behavior steps, including safety expectations. <br>
+- `token_efficiency`: Compares token usage with and without the skill. <br>
+
+
+
+## Evaluation Results: <br>
+_Pending — populated by NVSkills-Eval prior to publication (see `BENCHMARK.md`)._ <br>
+
+| Dimension | Num | `claude-code` | `codex` |
+|---|---:|---:|---:|
+| Security | — | — | — |
+| Correctness | — | — | — |
+| Discoverability | — | — | — |
+| Effectiveness | — | — | — |
+| Efficiency | — | — | — |
+
+## Skill Version(s): <br>
+0.1.0 (source: pyproject.toml) <br>
+
+## Ethical Considerations: <br>
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>
+
+(For Release on NVIDIA Platforms Only) <br>
+Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail). <br>