Optimize the VC-SGC MC trial loop in the NEP path by dingxu2016 · Pull Request #1455 · brucefan1983/GPUMD

dingxu2016 · 2026-04-20T13:54:03Z

Summary

This PR adds two exact GPU-side optimizations to the VC-SGC MC trial loop in the NEP path. Tested on RTX 4090 using binary alloy VC-SGC NEP4 benchmarks, the combined patch improves cycle time by about +16% to +20% over the pre-optimization baseline, with no change to the VC-SGC acceptance rule.

Modification

Fuse before/after local NEP energy evaluation into one dual-state path and dispatch the angular contribution by L_max.
Build one global-shell neighbor list at the start of each compute() call and reuse it when constructing local inputs, replacing the previous per-trial all-atom scan.

Combined speedup vs the pre-optimization baseline:

case	cycle speedup
`N=4000, T=2000`	`+19.42%`
`N=4000, T=1500`	`+20.29%`
`N=4000, T=900`	`+16.32%`
`N=10000, T=2000, mc_trials=4000`	`+16.42%`

Others

No new external dependencies are introduced.
Both changes are exact and keep the VC-SGC acceptance rule unchanged.
Fixed-seed checks matched total Delta E and accept/reject behavior over the first 128 MC trials.
For the input-construction change alone, paired RTX 4090 runs improved cycle time by about +6.81% to +9.33% against the same dual-state + L_max path without global-shell input reuse.
nsys shows the main benefit comes from local-input construction, while the dual-state NEP kernel stays broadly flat.
The global-shell path is rebuilt once per compute() call to avoid persistent cross-call cache invalidation logic and keep the upstream patch reviewable.

Evaluate before/after local energies in one dual-state pass so shared descriptor work is reused across MC trial states. Dispatch the dual kernel on L_max and use energy-only ANN evaluation to cut angular descriptor footprint without changing the acceptance rule.

Build a one-time global neighbor list at the start of each MCMD call using max(rc_radial, rc_angular) as the cutoff. Replace the per-trial O(N x N_local) all-atom scan with an O(N_local x max_neighbors) lookup from the prebuilt shell. Also keep the small-box guard aligned with the shell cutoff.

brucefan1983 · 2026-04-22T03:34:09Z

I am not clear about the cycle speedup, what is it?

Could you provide a test input and compare the whole computation time?

dingxu2016 added 2 commits April 18, 2026 16:27

brucefan1983 marked this pull request as draft April 23, 2026 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the VC-SGC MC trial loop in the NEP path#1455

Optimize the VC-SGC MC trial loop in the NEP path#1455
dingxu2016 wants to merge 2 commits intobrucefan1983:masterfrom
dingxu2016:prprep/vcsgc-upstream

dingxu2016 commented Apr 20, 2026

Uh oh!

brucefan1983 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dingxu2016 commented Apr 20, 2026

Uh oh!

brucefan1983 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants