-
Notifications
You must be signed in to change notification settings - Fork 718
add model builder skill #1745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
mnabian
wants to merge
8
commits into
NVIDIA:main
Choose a base branch
from
mnabian:model-builder-skill
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
add model builder skill #1745
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
f16f4f3
add model builder skill
mnabian 8d35b52
minor revisions
mnabian 7dd79a2
Merge branch 'main' into model-builder-skill
mnabian d205c7f
apply iter 1 eval comments - defalt to action, mandatory test helpers
mnabian 2703d9c
Merge branch 'model-builder-skill' of https://github.com/mnabian/phys…
mnabian ed30c40
repo instructions
mnabian 136fcfa
apply iter 2 eval comments - when not to use the skill, doneness crit…
mnabian b854abf
improved repo handling
mnabian File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # Evaluation Report | ||
|
|
||
| Evaluation of the `physicsnemo-model-builder` skill before publication through NVSkills-Eval. | ||
|
|
||
| This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use. | ||
|
|
||
| > **Status: pending.** The results, Tier-1/Tier-2 findings, and verdict below are | ||
| > populated by an NVSkills-Eval run prior to publication. The evaluation dataset | ||
| > (`evals/evals.json`) and target agents are committed; run the harness and | ||
| > refresh this file before publishing. | ||
|
|
||
| ## Evaluation Summary | ||
|
|
||
| - Skill: `physicsnemo-model-builder` | ||
| - Evaluation date: _pending_ | ||
| - NVSkills-Eval profile: `external` | ||
| - Environment: `local` | ||
| - Dataset: 4 evaluation tasks (`evals/evals.json`) | ||
| - Attempts per task: 2 | ||
| - Pass threshold: 50% | ||
| - Overall verdict: _pending_ | ||
|
|
||
| ## Agents Used | ||
|
|
||
| - `claude-code` | ||
| - `codex` | ||
|
|
||
| ## Metrics Used | ||
|
|
||
| Reported benchmark dimensions: | ||
|
|
||
| - Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. | ||
| - Correctness: checks whether the agent follows the expected workflow and produces the correct final output. | ||
| - Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant. | ||
| - Effectiveness: checks whether the agent performs measurably better with the skill than without it. | ||
| - Efficiency: checks whether the agent uses fewer tokens and avoids redundant work. | ||
|
|
||
| Underlying evaluation signals used in this run: | ||
|
|
||
| - `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access. | ||
| - `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow. | ||
| - `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage. | ||
| - `accuracy` (Accuracy): grades final-answer correctness against the reference answer. | ||
| - `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully. | ||
| - `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations. | ||
| - `token_efficiency` (Token Efficiency): compares token usage with and without the skill. | ||
|
|
||
| ## Test Tasks | ||
|
|
||
| The benchmark dataset contained 4 evaluation tasks: | ||
|
|
||
| - Positive tasks: 2 tasks where the skill was expected to activate (add a new model from scratch; wrap an external PyTorch model). | ||
| - Negative tasks: 2 tasks where the model-builder skill was not expected (a model-selection/discovery question that belongs to `physicsnemo-discover`; an out-of-scope datapipe request). | ||
| - Unlabeled tasks: 0. | ||
|
|
||
| Entries with `expected_skill` set are treated as positive skill-activation cases; entries with `expected_skill: null` are treated as negative activation cases. | ||
|
|
||
| ## Results | ||
|
|
||
| _Pending NVSkills-Eval run._ | ||
|
|
||
| | Dimension | Num | `claude-code` | `codex` | | ||
| |---|---:|---:|---:| | ||
| | Security | — | — | — | | ||
| | Correctness | — | — | — | | ||
| | Discoverability | — | — | — | | ||
| | Effectiveness | — | — | — | | ||
| | Efficiency | — | — | — | | ||
|
|
||
| Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available. | ||
|
|
||
| ## Tier 1: Static Validation Summary | ||
|
|
||
| _Pending NVSkills-Eval run._ | ||
|
|
||
| ## Tier 2: Deduplication Summary | ||
|
|
||
| _Pending NVSkills-Eval run._ Note: this skill is intentionally distinct from | ||
| `physicsnemo-discover` (authoring/porting models vs. selecting existing ones); | ||
| the negative eval tasks guard that routing boundary. | ||
|
|
||
| ## Publication Recommendation | ||
|
|
||
| _Pending NVSkills-Eval run._ Refresh this file with the harness output (results | ||
| table, Tier-1/Tier-2 findings, verdict) before publishing, and keep it with the | ||
| skill; re-run when the evaluation dataset, skill behavior, or target agents | ||
| materially change. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,223 @@ | ||
| --- | ||
| name: physicsnemo-model-builder | ||
| description: Official NVIDIA-authored workflow for adding a new model or reusable layer to PhysicsNeMo, or integrating an existing PyTorch model. Scaffolds a standards-compliant physicsnemo.Module (or a Module.from_torch wrapper for an external nn.Module), places it correctly, wires exports, writes tests against the house test helpers, and runs the local CI gates (ruff, interrogate, pytest). Use when a contributor wants to add or port a model or layer into the physicsnemo package. Do NOT use for datapipes, nn.functional ops/backends such as FunctionSpec, losses or metrics, training-recipe or example authoring, environment/installation setup, or merely deciding which existing model fits a task (use physicsnemo-discover for that). | ||
| license: Apache-2.0 | ||
| metadata: | ||
| author: NVIDIA <agent-skills@nvidia.com> | ||
| tags: | ||
| - physicsnemo | ||
| - models | ||
| - contributing | ||
| - scaffolding | ||
| - integration | ||
| --- | ||
|
|
||
| # PhysicsNeMo Model Builder | ||
|
|
||
| Drive a contributor from "I have a model (or layer, or an existing PyTorch | ||
| module)" to a standards-compliant, tested, CI-green addition to the | ||
| `physicsnemo` package. **You do the mechanical work** — placement, the | ||
| `physicsnemo.Module` shell, serialization wiring, docstrings, type | ||
| annotations, validation, exports, tests, and gates. **The contributor brings | ||
| the architecture** — the novel `forward` math. Keep that division explicit: | ||
| never invent their model; scaffold everything around it. | ||
|
|
||
| The audience is a researcher fluent in PyTorch but new to PhysicsNeMo, so | ||
| **explain the "why"** at each step (name the rule, give the reason) rather | ||
| than silently emitting files. | ||
|
|
||
| ## When NOT to use this skill | ||
|
|
||
| Stop and **redirect — do not activate** — when the request is to *pick*, *use*, | ||
| or *configure* something that already exists, or targets another surface: | ||
|
|
||
| - **"Which existing model should I use / which fits my data?"** — selection and | ||
| discovery, not authoring → `physicsnemo-discover`. | ||
| - **Datapipes** (`physicsnemo/datapipes/`), **losses or metrics** | ||
| (`physicsnemo/metrics/`), **functional ops / backends** | ||
| (`physicsnemo/nn/functional/`, `FunctionSpec`), **training recipes / examples** | ||
| (`examples/`). | ||
|
|
||
| This skill is only for **authoring or porting** a model or reusable layer into | ||
| the `physicsnemo` package. | ||
|
|
||
| ## Core principle | ||
|
|
||
| 1. **The written standards are ground truth — read them, don't paraphrase from | ||
| memory.** The authoritative rules live in `CODING_STANDARDS/` at the repo | ||
| root: `MODELS_IMPLEMENTATION.md` (rules `MOD-***`) and | ||
| `EXTERNAL_IMPORTS.md` (rules `EXT-***`). Open the cited rule before relying | ||
| on it; reference it by ID when you justify a decision. They evolve — a rule | ||
| recalled from memory may be stale. | ||
| 2. **Reuse before you build — discover, don't reinvent.** Half of a clean | ||
| integration is *not* writing code that already exists. Before scaffolding a | ||
| `forward`, enumerate what `physicsnemo.nn` already provides and tell the | ||
| contributor what to import (`references/reuse_map.md`). | ||
| 3. **Verify every path before you cite it.** Glob/Read the live repo; a path | ||
| recalled from memory or pattern-matched from a neighbor is disproof — drop | ||
| it. | ||
|
|
||
| ## Repo root resolution | ||
|
|
||
| Resolve the PhysicsNeMo repo root **first** (see `CONTRIBUTING.md §Repo root | ||
| resolution`); all `CODING_STANDARDS/…` and `physicsnemo/…` paths are rooted | ||
| there, and scaffolded files are written under it. **If no local clone is on the | ||
| path** (e.g. headless against the skills repo in an eval), shallow-clone the | ||
| canonical repo once and anchor to it — read its existing tree read-only for | ||
| standards / reuse / path verification, write new files under it: | ||
| `DEST="${TMPDIR:-/tmp}/physicsnemo-src"; [ -d "$DEST/physicsnemo" ] || git clone --depth 1 https://github.com/NVIDIA/physicsnemo "$DEST"`. | ||
| Use that URL verbatim; never interpolate one from user input. | ||
|
|
||
| ## Scope | ||
|
|
||
| In scope: **complete models** (`physicsnemo/experimental/models/`), **reusable | ||
| layers** (`physicsnemo/nn/module/`), and **wrapping an existing PyTorch | ||
| `nn.Module`** via `Module.from_torch`. | ||
|
|
||
| Out of scope — stop and redirect: datapipes (`physicsnemo/datapipes/`), | ||
| functional ops / custom backends (`physicsnemo/nn/functional/`, `FunctionSpec`), | ||
| losses & metrics (`physicsnemo/metrics/`), training recipes (`examples/`), and | ||
| "which model should I use" (→ `physicsnemo-discover`). | ||
|
|
||
| ## Workflow | ||
|
|
||
| Run in order, **after resolving the repo root** (§Repo root resolution). | ||
| Confirm the consequential choices with the contributor (artifact type, | ||
| placement, external-wrap vs from-scratch); scaffold the rest. | ||
|
|
||
| **Default to action.** When the repository is available, *create the actual | ||
| files* — `__init__.py`, `<name>.py`, and the test module — with a placeholder | ||
| `forward` that raises `NotImplementedError` (marked `# TODO: contributor's | ||
| forward math`), rather than only describing them in prose. Ask only the | ||
| questions that genuinely block placement; then produce files and iterate. The | ||
| skill's value is the working scaffold on disk, not a description of one. | ||
|
|
||
| ### 1. Intake & classify | ||
|
|
||
| Ask only what you can't infer (≤4 questions). Resolve: | ||
|
|
||
| - **Artifact type:** complete *model*, reusable *layer*, or *wrap* an existing | ||
| PyTorch module? (Decision tree + rationale: `references/placement.md`.) | ||
| - **Identity:** class name (PascalCase), one-line purpose, the forward | ||
| inputs/outputs and their tensor shapes, heavy deps. | ||
| - **For wrap:** the import path of their `nn.Module`, and whether its | ||
| `__init__` args are JSON-serializable — this picks the serialization path | ||
| (`references/serialization.md`). | ||
|
|
||
| ### 2. Place it (and say why) | ||
|
|
||
| - New **model** → `physicsnemo/experimental/models/<name>/` (`MOD-002a`: new | ||
| models start in `experimental`). Layout: `__init__.py` (re-exports) + | ||
| `<name>.py`. New **layer** → `physicsnemo/nn/module/<name>.py`, re-exported | ||
| from both `physicsnemo/nn/module/__init__.py` and `physicsnemo/nn/__init__.py` | ||
| (`MOD-000a`). Tests mirror the source path under `test/`. | ||
| - State the rule ID and the reason (experimental = API may change; layers are | ||
| shared building blocks). | ||
|
|
||
| ### 3. Reuse audit | ||
|
|
||
| Before writing `forward`, enumerate existing primitives the contributor would | ||
| otherwise reinvent — attention bases, embeddings, `Mlp`, neighbor ops | ||
| (`knn`, `radius_search`), the TE-aware `LayerNorm`. Use the live search | ||
| patterns in `references/reuse_map.md`; verify each path before citing it. Say | ||
| explicitly "import `X` from `physicsnemo.nn` instead of writing your own." Keep | ||
| genuinely novel, model-specific pieces local to the model. | ||
|
|
||
| ### 4. Scaffold the shell | ||
|
|
||
| Generate from the skeletons in `references/scaffolds.md`, adapting to the | ||
| contributor's shapes; explain what each enforced piece is for. | ||
|
|
||
| - **New model / layer:** subclass **`physicsnemo.Module`** (not | ||
| `torch.nn.Module` — `MOD-001`); a `ModelMetaData`; a **constructor taking | ||
| JSON-serializable config** (no splatted `**kwargs` — `MOD-010`; no | ||
| string-based class selection — `MOD-009`); a `forward` with jaxtyping on | ||
| every tensor arg (`MOD-006`), `if not torch.compiler.is_compiling():` shape | ||
| validation (`MOD-005`), and NumPy `r"""` docstrings with | ||
| `Parameters`/`Forward`/`Outputs` sections and `:math:` shapes (`MOD-003`). | ||
| Imports upward-only (`EXT-***`). | ||
| - **Wrap external:** `Module.from_torch(TheirModule, meta=...)`. **The | ||
| serialization gotcha lives here** — `physicsnemo.Module` save/from_checkpoint | ||
| requires `__init__` args to be JSON-serializable; nested `nn.Module` args | ||
| must each be converted via `Module.from_torch`. Walk them through | ||
| `references/serialization.md`, then prove it with the round-trip test below. | ||
|
|
||
| ### 5. Tests | ||
|
|
||
| Generate the test module from `references/scaffolds.md`: class-per-public-class, | ||
| the `device` fixture, parametrized constructor/attribute checks (≥2 configs — | ||
| `MOD-008a`), `validate_forward_accuracy` for non-regression (`MOD-008b`), and | ||
| `validate_checkpoint` for the save/load round-trip (`MOD-008c`). These helpers | ||
| are **mandatory and come from `test.common`** — write the import explicitly in | ||
| the generated test and name them in your summary; don't hand-roll what they | ||
| provide: | ||
|
|
||
| ```python | ||
| from test.common import validate_forward_accuracy, validate_checkpoint | ||
| ``` | ||
|
|
||
| ### 6. Gates | ||
|
|
||
| From the repo root, run and iterate to green (explain each): | ||
|
|
||
| ``` | ||
| make lint # ruff format --check + ruff check | ||
| make interrogate # docstring coverage | ||
| make pytest # or: pytest test/<mirrored/path> -q | ||
| ``` | ||
|
|
||
| `physicsnemo/experimental/` is exempt from ruff/interrogate, but **not** from | ||
| runtime contracts — the serialization round-trip test must still pass there. | ||
|
|
||
| ### 7. Finish & review | ||
|
|
||
| - Add a one-line `CHANGELOG.md` entry and SPDX Apache-2.0 headers to new files; | ||
| remind the contributor commits need `-s` (sign-off). | ||
| - Do an independent **code-review pass over the diff** before opening the PR — | ||
| re-check it against the standards (`MOD-***`/`EXT-***`), correctness, and the | ||
| reuse audit, ideally with fresh eyes (a separate review session/agent). If the | ||
| host agent offers a built-in code-review command (for example Claude Code's | ||
| `/code-review`), use it; otherwise review the diff directly. Then open the PR | ||
| — CODEOWNERS review + CI re-run the gates. | ||
|
|
||
| ### 8. Definition of done | ||
|
|
||
| Confirm each before declaring success; fix any miss before finishing: | ||
|
|
||
| - [ ] Repo root resolved; every cited path verified to exist (no memory/guesses). | ||
| - [ ] Placed right: model → `experimental/models/` (`MOD-002a`); layer → | ||
| `nn/module/` + both `__init__` re-exports (`MOD-000a`). | ||
| - [ ] Subclasses `physicsnemo.Module` with a `ModelMetaData` (`MOD-001`); | ||
| `__init__` is JSON-serializable — no splatted `**kwargs` (`MOD-010`), no | ||
| string class selection (`MOD-009`). | ||
| - [ ] `forward` has jaxtyping on every tensor arg (`MOD-006`) + | ||
| `is_compiling()`-guarded shape validation (`MOD-005`); NumPy `r"""` docstrings | ||
| with `:math:` shapes (`MOD-003`). | ||
| - [ ] Reuse audit done — nothing reimplemented that `physicsnemo.nn` provides. | ||
| - [ ] Tests use `test.common`: `validate_forward_accuracy` (`MOD-008b`) + | ||
| `validate_checkpoint` (`MOD-008c`), ≥2 constructor configs (`MOD-008a`). | ||
| - [ ] Gates green (`make lint`, `make interrogate`, `make pytest`); | ||
| `CHANGELOG.md` entry + SPDX headers added. | ||
|
|
||
| ## Common gotchas | ||
|
|
||
| Surface the relevant traps inline as you scaffold (full catalogue: | ||
| `references/lessons.md`): | ||
|
|
||
| - **`Module` serialization** is a common external-integration failure: raw | ||
| `nn.Module` submodule args break `from_checkpoint` (`references/serialization.md`). | ||
| - The **TE-aware `LayerNorm`** runs only on CUDA when Transformer Engine is | ||
| present; tests must skip the CPU case under TE. | ||
| - **`experimental/` skips lint, not runtime contracts.** | ||
| - Promote a model-specific layer to `physicsnemo.nn` only when a **second** | ||
| consumer appears — keep it local until then. | ||
|
|
||
| ## Related resources | ||
|
|
||
| - `references/placement.md` — artifact decision tree and where each kind goes. | ||
| - `references/reuse_map.md` — live search patterns for existing primitives. | ||
| - `references/serialization.md` — `physicsnemo.Module`, JSON args, `from_torch`. | ||
| - `references/scaffolds.md` — model / layer / external-wrap / test skeletons. | ||
| - `references/lessons.md` — gotchas distilled from real integrations. | ||
| - `CODING_STANDARDS/MODELS_IMPLEMENTATION.md`, `EXTERNAL_IMPORTS.md` — the | ||
| authoritative rules; read the cited rule before relying on it. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mnabian Something that can come up at this step is "layer X from
phyiscsnemo.nncan help me but it needs a minor change/extension Y -- do I make that change to X or should I make a new module to support my needs?". For example this is an important design decision point in nearly every Earth-2 NVR model integration I've done, and unfortunately it can end up quite subjective on what the best path is.Should we add some language here guiding that decision making (or redirect decision to user, possibly with some agent-generated proposals), or is this skill intended to be used strictly without making changes to the core library other than adding the new model?