Skip to content

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611

Draft
Nucs wants to merge 272 commits into
masterfrom
nditer
Draft

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Nucs wants to merge 272 commits into
masterfrom
nditer

Conversation

@Nucs
Copy link
Copy Markdown
Member

@Nucs Nucs commented Apr 22, 2026

Summary

This PR ports NumPy 2.4.2's nditer machinery to NumSharp (NpyIter), builds a composable expression DSL on top (NpyExpr) with a three-tier custom-op API, wires multi-order memory layout (C/F/A/K) through the entire API surface, and replaces the matmul fallback with stride-native GEMM for all 12 dtypes (eliminates a ~100x slowdown on transposed inputs). Also lands a new Char8 1-byte dtype with 100% Python bytes parity, a trainable MNIST MLP example demonstrating fusion, and several pre-existing bug fixes surfaced by battletest.

Stats: +50,426 / -1,188 across 156 files, 64 commits.

TL;DR

  • NpyIter -- full NumPy nditer port: 32+ APIs, all iteration orders (C/F/A/K with NEGPERM), all indexing modes (MULTI_INDEX / C_INDEX / F_INDEX / RANGE), buffered casting, buffered-reduce double-loop, masking, unlimited operands and dimensions. 566/566 NumPy 2.4.2 parity scenarios pass byte-for-byte.
  • NpyExpr DSL + 3-tier custom-op API (Tier 3A raw IL / Tier 3B element-wise w/ SIMD / Tier 3C expression composition + Call(...) for arbitrary Func / MethodInfo invocation). 50+ ops, full operator overloads, structural cache key, ~5ns delegate dispatch.
  • C/F/A/K memory layout wired through np.copy, np.array, np.asarray, np.asanyarray, np.asfortranarray (new), np.ascontiguousarray (new), *_like, astype, flatten, ravel, reshape, eye, concatenate, vstack, hstack, cumsum, argsort -- plus post-hoc F-contig preservation across the ILKernel dispatchers (41 element-wise layout bugs fixed).
  • Stride-native matmul for all 12 dtypes -- np.dot(x.T, grad) MLP shape: 240 ms -> 1 ms. Removes ~165 lines of dead fallback code.
  • Char8 -- new 1-byte dtype, NumPy S1 / Python bytes(1) equivalent, full Python bytes API parity (battletested against Python oracle).
  • Trainable MNIST MLP example -- fused forward + backward + Adam, 99.89% test accuracy, 100 epochs in ~42s.
  • Bug fixes: NPTypeCode.Char.SizeOf() returned 1 (real=2), IsInf was stubbed to return null, Decimal priority was stale, argsort mishandled non-C-contig input, several NpyExpr IL emission bugs.
  • Test suite: 6,710 passing on net8.0 + net10.0, zero regressions.

Contents

NpyIter -- Full NumPy nditer Port

From-scratch C# port of NumPy 2.4.2's nditer under src/NumSharp.Core/Backends/Iterators/.

Files (new)

File Lines Purpose
NpyIter.cs 3,331 Core iterator engine
NpyIter.State.cs 977 Dynamic per-operand state allocation
NpyIter.Execution.cs 657 Iteration execution paths
NpyIter.Execution.Custom.cs 155 Tier 3A/3B/3C custom-op entry points
NpyIterBufferManager.cs 637 Buffered iteration with casting
NpyIterFlags.cs 516 Flag enums (matches NumPy semantics)
NpyIterCoalescing.cs 495 Axis reorder + coalesce
NpyIterCasting.cs 483 Casting validation + conversion
NpyIterKernels.cs 263 Path selection
NpyAxisIter.cs 492 Axis-based iterator (cumsum, var, std)
NpyLogicalReductionKernels.cs 127 Generic reduction kernel interfaces

Capability Matrix

Capability Notes
Iteration orders C, F, A, K (NEGPERM for negative-stride memory order)
Indexing modes MULTI_INDEX, C_INDEX, F_INDEX, RANGE (parallel chunking)
Buffering Type conversion, full casting rules (no / equiv / safe / same_kind / unsafe)
Reduction op_axes w/ -1 reduction axes, REDUCE_OK, IsFirstVisit, buffered-reduce double-loop including bufferSize < coreSize
Multi-operand Unlimited (NumPy NPY_MAXARGS=64 parity, dynamic alloc)
Dimensions Unlimited (NumSharp divergence; replaces NPY_MAXDIMS=64)
Masking WRITEMASKED + ARRAYMASK w/ reduction safety check
APIs ported Copy, GotoIndex, GotoMultiIndex, RemoveAxis, RemoveMultiIndex, ResetBasePointers, GetMultiIndexFunc, GetInnerFixedStrideArray, GetAxisStrideArray, CreateCompatibleStrides, DebugPrint, GetIterView, IterRange, Iternext, GetValue<T> / SetValue<T>, Finished, Shape, OVERLAP_ASSUME_ELEMENTWISE, TRANSFERFLAGS, reduction-axis encoding, more

Bugs found and fixed during port

  • Negative strides flipped regardless of order -- NumPy only flips when FORCEDORDER is unset (K-order only).
  • NO_BROADCAST flag not enforced -- was skipping operands instead of validating shape match.
  • F_INDEX returned C-order indices -- coalescing reduced NDim=1, losing axis structure. Now disables coalescing for C_INDEX / F_INDEX / MULTI_INDEX.
  • ALLOCATE with null operand threw NRE -- CalculateBroadcastShape accessed op[i].ndim before allocating.
  • op_axes reductions broken -- ApplyOpAxes was re-applying axes to already-correct strides, zeroing non-reduce strides.
  • SetupBufferedReduction produced inverted strides for non-reduce operands; only worked for 2-axis cases.
  • K-order on broadcast / non-contiguous arrays produced wrong order -- fixed by falling back to C-order when stride=0 is present.
  • Reset() desynced ranged iterators with IterStart > 0 -- now delegates to GotoIterIndex(IterStart).
  • CoalesceAxes rejected size-1 axes unless stride==0 -- size-1 axes contribute no iteration and should always absorb.
  • Buffer free-list bug -- Dispose used NativeMemory.Free for AlignedAlloc'd buffers (memory corruption).

NpyExpr DSL + 3-tier Custom-Op API

User-extensible kernel layer built on NpyIter.

Tiers

  • Tier 3A -- ExecuteRawIL(body, key, aux): emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* byteStrides, long count, void*). Full control.
  • Tier 3B -- ExecuteElementWise(scalar, vector, ...): per-element IL + 4x-unrolled SIMD shell + 1-vec remainder + scalar tail + scalar-strided fallback. SIMD when all operand dtypes match and are SIMD-capable.
  • Tier 3C -- ExecuteExpression(expr, inputTypes, outputType): compose NpyExpr trees via operator syntax, Compile() emits IL automatically.

NpyExpr Node Catalog

Category Ops
Binary arithmetic Add Subtract Multiply Divide Mod Power FloorDivide ATan2
Binary bitwise BitwiseAnd BitwiseOr BitwiseXor
Unary arithmetic Negate Abs Sign Sqrt Cbrt Square Reciprocal
Unary exp/log Exp Exp2 Expm1 Log Log2 Log10 Log1p
Unary trig Sin Cos Tan Sinh Cosh Tanh ASin ACos ATan Deg2Rad Rad2Deg
Unary rounding Floor Ceil Round Truncate
Unary predicates BitwiseNot LogicalNot IsNaN IsFinite IsInf
Comparisons Equal NotEqual Less LessEqual Greater GreaterEqual (return 0/1 at output dtype)
Combinators Min Max Clamp Where(cond, a, b)
Operators + - * / % & OR ^ ~ ! unary-

Call(...) escape hatch (commit 8da3e693)

Invoke any Func<...>, Delegate, or MethodInfo per element -- three dispatch paths chosen at construction:

Condition Emitted IL
Static method, no target call <methodinfo> (zero-indirection, JIT-inlinable)
Instance MethodInfo + target ldc.i4 id -> LookupTarget -> castclass T -> callvirt <mi>
Any other Delegate ldc.i4 id -> LookupDelegate -> castclass Func<..> -> callvirt Invoke

Auto-conversion at the call boundary (input/output via EmitConvertTo). Process-wide DelegateSlots registry, ~5ns lookup. Cache key includes MetadataToken + ModuleVersionId to disambiguate dynamic assemblies.

Bugs caught during DSL battletest

  • IsNaN / IsFinite / IsInf silently wrote I4 0/1 into double slots (denormals instead of 1.0). Fix: UnaryNode inserts trailing EmitConvertTo(Int32, outType).
  • LogicalNot broken for Int64 / Single / Double / Decimal -- Ldc_I4_0+Ceq only valid for I4-sized operands. Fix: route through EmitComparisonOperation(Equal, outType) with properly-typed zero literal.
  • WhereNode prelude unfinished (threw at compile). Rewrote.
  • MinMaxNode did not propagate NaN -- rerouted through Math.Min / Math.Max (matches np.minimum / np.maximum, not fmin / fmax).
  • Vector256.Round / Truncate are .NET 9+ only -- excluded from SIMD path; scalar path works on net8.

Multi-Order Memory Layout (C/F/A/K)

Shape changes (src/NumSharp.Core/View/Shape.cs, +171 lines)

  • New IsFContiguous (O(1) flag check via ArrayFlags.F_CONTIGUOUS).
  • New ComputeFContiguousStrides helper.
  • New Shape(long[] dims, char order) ctor for explicit physical-order construction.
  • Aligned _UpdateContiguousFlags with NumPy -- single-pass (isC, isF) tuple, fewer call sites.
  • Fixed Shape.Order -- was hardcoded to 'C', now derives from contiguity flags.
  • Fixed empty-array semantics -- any dim==0 is both C- and F-contig per NumPy, no BROADCASTED flag.
  • OrderResolver.cs (new, 75 lines) -- centralizes C/F/A/K -> C/F mapping.

API surface wiring

API Change
np.copy, NDArray.copy(order) New overload, default 'K' (was 'C')
np.array(Array, ..., order) F-order materialization via copy('F')
np.asarray, np.asanyarray New overloads accepting Type? + order
np.asfortranarray, np.ascontiguousarray NEW thin wrappers
np.empty_like / zeros_like / ones_like / full_like New order overload, default 'K'
astype(Type, copy, order) New overload, default 'K'
flatten(order), ravel(order) F-order via copy('F') reinterpret
reshape(Shape, char order) New overload; F-order column-major fill
np.eye(..., order) New overload
np.concatenate, vstack, hstack Preserve F-contig when all inputs F-contig
np.cumsum(axis) Preserve F-contig via post-hoc copy('F')
NDArray.argsort Copies non-C-contig input to C-contig first

Post-hoc F-contig preservation across ILKernel dispatch (Group F, 41 bugs fixed)

Refactoring 27 partial files (~21K lines) of IL emitters to accept arbitrary output strides was rejected as too invasive. Instead, central dispatchers (ExecuteBinaryOp, ExecuteUnaryOp, ExecuteComparisonOp) call ShouldProduceFContigOutput(operands, resultShape) after the kernel and relay via cheap .copy('F') when every non-scalar operand is strictly F-contig. Matches NumPy's F+F->F, C+C->C, F+C->C, F*scalar->F, F+FCol->F rules.

np.modf, np.clip, np.negative, np.maximum / minimum updated individually (do not route through the central dispatchers).

TDD coverage

51 sections of OrderSupport.OpenBugs.Tests.cs (3,005 lines), each driven by side-by-side Python / NumPy 2.4.2 output. Remaining [OpenBugs] are minimal API gaps (np.tile, np.flip, np.where, np.sort).

Stride-Native MatMul

np.dot / np.matmul previously fell into a ~100x slower fallback for any non-contiguous operand (transposed view, slice). This PR ships stride-native paths for all 12 dtypes.

New files

  • SimdMatMul.Strided.cs (338 lines) -- generalized 8x16 Vector256 FMA micro-kernel for float. New packers PackAPanelsStrided / PackBPanelsStrided with fast paths for transposed-contig and row-contig.
  • SimdMatMul.Double.cs (108 lines) -- stride-aware IKJ Vector256 kernel (4 FMAs).
  • Default.MatMul.Strided.cs (357 lines) -- MatMulStridedSame<T> where T : INumber<T> (JIT specializes per type with auto-vectorization), plus MatMulStridedBool (NumPy's OR-of-ANDs short-circuit), MatMulStridedMixed<TResult> (typed pointer reads via ReadAsDouble, no boxing).

Dead code removed (~165 lines)

MatMulGeneric, MatMulCore<TResult>, MatMulSameType<T>, four MatMulContiguous overloads (float/double/int/long), MatMulMixedType<TResult>.

Performance (MLP backward shapes)

Op Before After
dot(x.T, grad) 784x64 @ 64x128 240 ms 1 ms
dot(grad, W.T) 64x128 @ 128x784 226 ms 1 ms
Lt(400,500) @ L(500,400) blocked 12 ms 8 ms (skips copy)
int32 (150,200) @ (200,150) stride-native -- 10 ms (was 11 ms copy+matmul)

The MLP example's .copy() workaround on transposed views is now removed -- kernel absorbs strides directly.

Test coverage

MatMulStridedTests (28 tests): all 4 BLAS transpose cases (NN/NT/TN/TT) x float/double x simple/blocked, per-dtype stride-native (byte/int16/uint16/int32/uint32/int64/uint64/char/decimal/bool), sliced views with Shape.offset > 0, mixed-type, MLP-shape regression tests.

Char8 Dtype

New NumSharp.Char8 -- [StructLayout(Sequential, Size=1)] readonly struct, NumPy dtype('S1') / Python bytes(1) equivalent.

Files (new, ~1,450 lines)

File Lines Purpose
Char8.cs 725 Core: Latin1CharInfo, classification, case, operators, IConvertible, IFormattable
Char8.Operators.cs 169 Mixed-type ops (char/byte/int), Span reinterpret
Char8.Conversions.cs 261 Instance/factory To* / From* for all 12 dtypes, saturating, truncating
Char8.Spans.cs 201 Search / equality / UTF-8 classification on ReadOnlySpan<Char8>
Char8.PyBytes.cs 531 Python bytes array methods (Strip/Split/Replace/Center/...)
Converts.Char8.cs 317 Converts integration parallel to Converts.Native.cs

Adapted from .NET System.Char (src/dotnet/, fetched into a reference library; Latin1CharInfo[256] table copied verbatim).

Python parity (caught by oracle diff)

3 parity bugs surfaced and fixed during 250-line Python bytes oracle comparison:

  1. Count with empty pattern -- Python returns len(s) + 1, was 0.
  2. Center asymmetric padding -- CPython formula left = pad/2 + (pad & width & 1).
  3. SplitLines too permissive -- bytes.splitlines() only recognizes \n / \r / \r\n (not \v / \f / \x1c-1e).

Status

Standalone for now -- not yet wired into NPTypeCode enum (would touch ~50 switch statements across DefaultEngine / ILKernelGenerator / NpyIter / casting / Converts; deferred to a separate PR).

MNIST MLP Example

examples/NeuralNetwork.NumSharp/MnistMlp/ -- runnable end-to-end classifier.

  • Architecture: 784 -> 128 (ReLU) -> 10, float32, He-init, Adam.
  • Forward fusion: bias + ReLU collapses into one NpyIter per layer (NpyExpr.Max(Input(0) + Input(1), 0)).
  • Backward fusion: gradOut * (y > 0) ReLU mask fused.
  • Loss: SoftmaxCrossEntropy (combined, max-subtracted, numerically stable).

Results (6000 train / 1000 test, batch 128, Adam lr=1e-3):

Phase Total Final test acc
Pre-stride-native dot 100.7 s (5 epochs) 100%
Post-copy() workaround 3.2 s (5 epochs) 100%
Final 100-epoch demo (post stride-native) ~42 s 99.89%

NN scaffolding fixes outside MnistMlp/: completed every stubbed/broken class -- Softmax (was empty Forward + sigmoid-derivative Backward), Sigmoid.Forward (empty), CategoricalCrossentropy (no clipping, wrong backward formula), BinaryCrossEntropy (did not divide by N), Accuracy / BinaryAccuacy (returned scalar/null), FullyConnected (no bias, skewed init), NeuralNet.Train (used 2-index integer selection where slicing was intended), Adam (ms / vs init was commented out), added SGD optimizer. Each verified against analytical references with finite-difference grad checks (29/29 pass).

Bug Fixes

  • NPTypeCode.Char.SizeOf() returned 1, real is 2 (UTF-16). Affected NpyIter.SetOpDType (ElementSizes[op] x stride in 8 places), 8 cast sites, np.frombuffer, np.dtype(char).itemsize, axis reductions. Survived without test failures because NumPy has no native char dtype and ASCII reads accidentally land on the right byte.
  • GetPriority(Decimal) = 5*10*32 was stale after the prior Decimal SizeOf fix -- corrected to 5*10*16=800. No behavioral change (relative ordering preserved).
  • DefaultEngine.IsInf was stubbed to return null (NRE on any IsInf call). Now wired through ExecuteUnaryOp with the existing IL kernel.
  • NDArray.Copy.cs share-by-reference bug -- new Shape(this.Shape.dimensions, 'F') aliased the source int[]; cloned now.
  • NDArray.argsort -- copies non-C-contig input to C-contig first (matches NumPy's invariant that argsort always returns C-contig).
  • flatten allocation regression -- F-order path was double-allocating (copy('F') + Array.Clone()). Fixed: reinterpret directly.

Behavioral Changes

Area Change Migration
np.copy default order 'C' -> 'K' No change for C-contig input (K preserves layout)
MaxOperands=8 removed Now unlimited (dynamic alloc) Drop-in
MaxDims=64 removed Now unlimited (~300K dims, stackalloc-bound) Drop-in
F-order iteration Now produces [0,3,1,4,2,5] for 2x3 C-contig (was [0,1,2,3,4,5]) Matches NumPy
K-order on broadcast / non-contig Falls back to C-order (was stride-sort, broken with stride=0) Matches NumPy
Negative strides Only flipped for K-order Matches NumPy FORCEDORDER rule
Empty arrays IsContiguous and IsFContiguous both true (was both false) Matches NumPy
Shape.Order Now derives from contiguity flags (transpose of C reports 'F') Was hardcoded to 'C'

Documentation

  • docs/website-src/docs/NDIter.md (1,934 lines) -- comprehensive NpyIter reference: 7-technique quick reference, decision tree, full Tier C node catalog with NumPy-equivalent column, type discipline, SIMD coverage rules, caching/auto-keys, validation, gotchas, debugging, memory model + lifetime, 19 worked examples (Swish, GELU, Heaviside, Horner polynomial, fused sigmoid, NaN replacement, etc.).
  • docs/website-src/docs/ndarray.md (537 lines) -- NDArray reference page.
  • docs/NPYITER_AUDIT.md, NPYITER_DEEP_AUDIT.md, NPYITER_NUMPY_DIFFERENCES.md, NPYITER_BUFFERED_REDUCE_ANALYSIS.md -- implementation audit reports.
  • Tier renamed A/B/C -> 3A/3B/3C to make the layer-3 sub-tier relationship explicit (100 references across 6 files).

Test Plan

  • Full suite: 6,710 tests pass on net8.0 + net10.0 with CI filter TestCategory!=OpenBugs&TestCategory!=HighMemory. Zero regressions.
  • NumPy 2.4.2 nditer parity: 566/566 scenarios pass byte-for-byte (491 random fuzz seed=42 + 75 structured). Element sequences, stride arrays, multi-indices, reduction outputs all match.
  • NpyExpr + custom-op tests: +264 tests (NpyIterCustomOpTests, NpyIterCustomOpEdgeCaseTests, NpyExprExtensiveTests, NpyExprCallTests).
  • nditer API parity tests: +94 across 10 new files (NpyIterAxisStrideArrayTests, NpyIterCreateCompatibleStridesTests, NpyIterDebugPrintTests, NpyIterGetMultiIndexFuncTests, NpyIterInnerFixedStrideArrayTests, NpyIterOverlapAssumeElementwiseTests, NpyIterReductionAxisEncodingTests, NpyIterResetBasePointersTests, NpyIterTransferFlagsTests, NpyIterWriteMaskedTests).
  • MatMul stride-native: +28 MatMulStridedTests covering all 4 BLAS transpose cases, per-dtype stride-native, sliced views, mixed-type, MLP shapes.
  • Char8: +69 cases (source-generated discovery), 250-line Python bytes oracle diff (identical), 270+ C# edge assertions.
  • Order support TDD: +150 tests across 51 sections in OrderSupport.OpenBugs.Tests.cs.
  • Shape order: +24 tests in Shape.Order.Tests.cs.
  • MLP example: 100-epoch run reaches 99.89% test accuracy; per-epoch breakdown: forward 20.3% / loss 1.5% / backward 52.5% / optimizer 25.8%; bit-exact fused-vs-naive correctness (max |diff| = 0); kernel cache delta = 6 (compiled once, cache-hit thereafter); delegate-slot count = 0.

Known Issues / Out of Scope

  • Char8 not wired into NPTypeCode -- would touch ~50 switch statements; separate PR.
  • np.tile, np.flip, np.where, np.sort still missing (4 [OpenBugs] markers remain after this PR).
  • One pre-existing SetIndicesND assertion on multi-row fancy-write with scalar/matching-array RHS -- investigation in commit 47a74aa9 showed it is a pre-existing indexing bug, not F-order specific. Reproductions added under [OpenBugs].

Migration / Compatibility

Most changes are additive. The behavioral changes table above lists the user-visible deltas -- all align NumSharp closer to NumPy 2.4.2. No deprecated APIs, no removed public surface. The MaxOperands=8 and MaxDims=64 constants are removed but were internal.

Nucs added 30 commits April 22, 2026 23:32
NumPy's nditer coalescing strategy:
- K-order: Always coalesce for memory efficiency (sort by stride)
- C-order on C-contiguous: Coalesce → memory order (== C-order)
- F-order on F-contiguous: Coalesce → memory order (== F-order)
- F-order on C-contiguous: NO coalescing, reverse axes for F-order

Previously NumSharp was coalescing for ALL orders when array was
contiguous in any layout, which produced incorrect iteration order
for F-order on C-contiguous arrays.

Changes:
- NpyIter.cs: Add CheckAllOperandsContiguous(bool cOrder) helper
  to check if arrays are contiguous in requested order
- NpyIter.cs: Only coalesce when order matches array contiguity
- NpyIterCoalescing.cs: Add IsContiguousForCoalescing() check

Test results:
- 277 NpyIter tests passing (including 24 new battle tests)
- 5813 total tests passing
- F-order now produces [0,3,1,4,2,5] instead of [0,1,2,3,4,5]
  for a 2x3 C-contiguous array (matches NumPy)
…arrays

Problem:
- K-order iteration on broadcast arrays produced wrong order
  (stride-based sorting with stride=0 breaks axis ordering)
- K-order iteration on non-contiguous views also used wrong order
- NumPy: (3,) x (2,3) broadcast iterates C-order: [(0,0),(1,1),(2,2),(0,3),(1,4),(2,5)]
- NumSharp was producing: [(0,0),(0,3),(1,1),(1,4),(2,2),(2,5)]

Root cause:
- For K-order, we sorted axes by stride magnitude
- But GetMinStride excludes stride=0, leading to incorrect axis ordering
- Non-contiguous views similarly got wrong ordering from stride sort

Solution:
- For K-order with broadcast dimensions (stride=0), fall back to C-order
- For K-order with non-contiguous arrays, fall back to C-order
- Added HasBroadcastStrides() helper to detect broadcast dimensions
- CheckAllOperandsContiguous now uses absolute strides to handle
  reversed arrays (negative strides become positive after FlipNegativeStrides)
- Separate coalescing logic for C/F/K orders to preserve iteration semantics

Changes:
- NpyIter.cs: Added broadcast detection, fixed coalescing decision logic
- NpyIterNumPyBattleTests.cs: Updated tests to expect correct NumPy behavior
  (removed [Misaligned] attributes from Battle_MultiOperandBroadcasting and
  Battle_NonContiguousViewOrder since they now match NumPy)

All 277 NpyIter tests passing.
All 5877 project tests passing.
Deep audit against NumPy 2.4.2 source revealed 7 behavioral bugs. All fixed via TDD.

Bug #1: Negative strides always flipped regardless of order
- NumPy (nditer_constr.c:297-307) only flips when NPY_ITFLAG_FORCEDORDER not set
- FORCEDORDER is set by C, F, and A orders. Only K-order skips it.
- Fix: Only call FlipNegativeStrides for K-order
- CheckAllOperandsContiguous now takes allowFlip param (abs strides only when flipping)
- Affects: 1D/2D reversed arrays with C/F/A orders

Bug #2: NO_BROADCAST flag not enforced
- Code was skipping NO_BROADCAST operands instead of enforcing the constraint
- Fix: NO_BROADCAST operands must match iterShape without dim-1 stretching
- ValidateIterShape now always runs (not just when iterShape is provided)

Bug #3: F_INDEX returned C-order indices
- Coalescing reduces to NDim=1, losing original axis structure needed for F-index
- Fix: Disable coalescing when C_INDEX or F_INDEX is set (like MULTI_INDEX)

Bug #4: ALLOCATE with null operand threw NullReferenceException
- CalculateBroadcastShape accessed null op[i].ndim
- Fix: Skip null operands in broadcast shape calc, then allocate them after
  with correct shape (from op_axes if provided) and dtype

Bug #5,6,7: op_axes reductions broken (axis=0 gave [15,0,0], axis=1 threw)
- ApplyOpAxes was re-applying op_axes to strides that were already correctly set
  in the main operand setup loop, zeroing out non-reduce strides
- CalculateBroadcastShape didn't know about op_axes, couldn't compute iter shape
- Fix: ApplyOpAxes now only validates and sets REDUCE flags, not strides
- Fix: CalculateBroadcastShape now accepts opAxesNDim/opAxes parameters
- Uses production Shape.ResolveReturnShape API for all broadcasting

Refactoring: Uses production Shape.ResolveReturnShape / np.broadcast_to
- Replaces custom broadcast shape calculation
- User feedback: production APIs are 1-to-1 with NumPy

Testing:
- 21 new TDD tests in NpyIterParityFixTests.cs
- All 298 NpyIter tests pass
- All 5898 project tests pass
- Final battletest: 21/21 scenarios match NumPy 2.4.2 exactly

Fixed test: NullOperand_Throws now expects ArgumentException (more accurate than
NullReferenceException since null operand without ALLOCATE is an argument error).
Adds F-contiguity detection and OrderResolver for NumPy's 4 memory orders
at minimum functionality, with zero behavioral change to existing code.

Changes:
- Shape.cs: F-contig detection via ComputeIsFContiguousStatic (mirror of C-contig
  algorithm, scan left-to-right). Sets ArrayFlags.F_CONTIGUOUS flag during flag
  computation. New IsFContiguous property (O(1) flag check). New
  ComputeFContiguousStrides helper. New Shape(long[] dims, char order)
  constructor for explicit physical-order construction.
- Scalar constructor now sets both C_CONTIGUOUS and F_CONTIGUOUS (matches NumPy).
- OrderResolver.cs (NEW): Resolves NumPy order chars (C/F/A/K) to physical
  storage orders (C or F). 'A' and 'K' require a source Shape for resolution
  (matches NumPy: creation functions without source throw "only 'C' or 'F'
  order is permitted").
- np.empty.cs: New overload np.empty(shape, order, dtype) wiring
  OrderResolver through to Shape.

Key insight: transpose already produces F-contig memory layout; previously this
went undetected because F_CONTIGUOUS flag was never set. Now:
  arr = np.arange(24).reshape(2,3,4)
  arr.T.Shape.IsFContiguous  // true (previously: false / undetected)

Design:
- Only C and F are physical storage layouts; A and K are logical decisions
  that resolve to C or F based on source array layout.
- OrderResolver centralizes the C/F/A/K -> C/F mapping, letting future
  wiring of np.copy/np.array/flatten/ravel/reshape be a 1-line call.
- Existing IsContiguous callers (116 usages across 50 files) unchanged -
  they still see C_CONTIGUOUS=false for F-contig arrays and take the
  strided path (which is correct, just not yet SIMD-accelerated).

Tests (24 new in Shape.Order.Tests.cs):
- Scalar and 1-D arrays are both C and F contig
- Multi-dim C-contig is not F-contig and vice versa
- Transpose of C-contig now reports IsFContiguous=true
- Shape(dims, 'F') produces correct F-order strides (1, 2, 6 for 2x3x4)
- Shape(dims, 'A'/'X') throws ArgumentException
- OrderResolver: C/F resolve directly; A/K without source throw; A/K with
  source resolve based on source layout
- np.empty(order='C'/'F') produces correct layout
- np.empty(order='A'/'K') throws (matches NumPy)

Verification:
- 6017 tests pass on both net8.0 and net10.0 (zero regressions)
- NumPy parity verified via Python side-by-side comparison
- All order resolution semantics match NumPy 2.4.2

Future phases unblocked (each a ~1-line change):
- ILKernelGenerator fast paths can add || IsFContiguous for element-wise ops
- NpyIter.CheckAllOperandsContiguous can use Shape.IsFContiguous directly
- np.copy(order), np.array(order), flatten(order), ravel(order) wiring
- np.asfortranarray, np.ascontiguousarray
Review of initial F-order support surfaced three design issues where
NumSharp diverged from NumPy's patterns. This refactor aligns with
NumPy's flagsobject.c:_UpdateContiguousFlags exactly.

Changes:

1. Unified contiguity computation (single-pass)
   - Replaced two separate functions (ComputeIsContiguousStatic,
     ComputeIsFContiguousStatic) with one combined
     ComputeContiguousFlagsStatic returning (isC, isF) tuple.
   - Mirrors NumPy's _UpdateContiguousFlags which computes both in one
     function with a shared dim==0 early exit.
   - Fewer call sites, one traversal per contiguity check, cleaner
     shared logic.

2. Fixed Shape.Order property (was hardcoded to layout = 'C')
   - Now derives from actual contiguity flags: returns 'F' if strictly
     F-contiguous (IsFContiguous && !IsContiguous), else 'C'.
   - Transposed C-contig arrays now correctly report Order='F'.
   - 1-D and scalar shapes (both C and F contig) report 'C' by
     convention (NumPy-default reference order).
   - Non-contiguous shapes report 'C' as default reference.

3. Fixed empty array flag computation (any dim == 0)
   - NumPy short-circuits _UpdateContiguousFlags on dim==0 setting
     BOTH C_CONTIGUOUS and F_CONTIGUOUS unconditionally and NOT setting
     BROADCASTED. Empty arrays have no elements so broadcast semantics
     are meaningless.
   - Previously NumSharp computed strides like (0, 3, 1) for shape
     (2, 0, 3), triggered IsBroadcasted=true, and then skipped
     contiguity flag assignment entirely. Result was an empty array
     reporting IsContiguous=false, IsFContiguous=false.
   - Now matches NumPy: any dim=0 short-circuits to set both C and F
     contig + WRITEABLE + ALIGNED, clear BROADCASTED.

4. Clarified `layout` const documentation
   - The internal const char layout = 'C' was misleadingly named (as if
     it described the shape's physical order) but only ever used as a
     hash seed in ComputeSizeAndHash. Updated doc comment to clarify
     this is NOT the physical memory order — use Order / IsContiguous
     / IsFContiguous for actual layout info.
   - Value unchanged to preserve existing hash stability.

Additional tests (6 new):
- Order property for C, F, transpose, 1-D, scalar cases
- Empty array is both C and F contiguous (matching NumPy 2.4.2)

Test results:
- 6023 tests pass on both net8.0 and net10.0 (was 6017; 6 new tests)
- Zero regressions

NumPy source reference: numpy/_core/src/multiarray/flagsobject.c
…enarios

Ports the last NumPy nditer surface gaps identified by the audit, each with
1-to-1 semantic parity verified against NumPy 2.4.2 via Python harness.

10 items implemented (all battletested):
  1. NpyIter_ResetBasePointers  (nditer_api.c:314)
     - Populate BaseOffsets during FlipNegativeStrides so ResetBasePointers
       can recompute ResetDataPtrs[iop] = baseptrs[iop] + baseoffsets[iop].
     - Public: NpyIterRef.ResetBasePointers(ReadOnlySpan<IntPtr>) and
       ResetBasePointers(NDArray[]) convenience overload.

  2. NPY_ITFLAG_TRANSFERFLAGS_SHIFT packing  (nditer_constr.c:3542)
     - Pack NpyArrayMethodFlags into top 8 bits of ItFlags (shift=24).
     - Public: NpyIterRef.GetTransferFlags() + NpyArrayMethodFlags enum
       + NpyIterConstants.TRANSFERFLAGS_SHIFT/MASK constants.
     - REQUIRES_PYAPI never set in .NET (no Python GIL). SUPPORTS_UNALIGNED
       and NO_FLOATINGPOINT_ERRORS always set (raw pointer loops, .NET casts
       don't raise FPE). IS_REORDERABLE set for numeric casts.

  3. NpyIter_GetGetMultiIndex factory  (nditer_templ.c.src:481)
     - Specialized delegate factory returning NpyIterGetMultiIndexFunc with
       3 dispatches: IDENTPERM (direct copy), positive perm (apply perm[]),
       NEGPERM (apply perm+flip). BUFFER and HASINDEX don't affect coords
       so no specialization needed for them.
     - Public: GetMultiIndexFunc(), GetMultiIndexFunc(out errmsg),
       InvokeMultiIndex(fn, coords) — ref-struct-safe invocation.
     - Also fixes: IDENTPERM flag is now set at construction (after
       AllocateDimArrays). Previously only set post-coalescing, leaving
       MULTI_INDEX iterators without the fast-path flag.

  4. NpyIter_GetInnerFixedStrideArray  (nditer_api.c:1357)
     - Public: GetInnerFixedStrideArray(Span<long>).
     - Buffered: copies BufStrides. Non-buffered: innermost-axis stride per
       operand. Returns BYTE strides (NumPy convention), multiplying
       NumSharp's element-count strides by ElementSizes[op].

  5. NpyIter_GetAxisStrideArray  (nditer_api.c:1309)
     - Public: GetAxisStrideArray(int axis, Span<long>).
     - With HASMULTIINDEX: walks perm to find internal axis (handles both
       positive and NEGPERM-encoded entries). Without: Fortran-order
       (fastest-first) lookup via NDim-1-axis. Byte strides.

  6. NpyIter_CreateCompatibleStrides  (nditer_api.c:1058)
     - Public: CreateCompatibleStrides(long itemsize, Span<long>).
     - Requires HASMULTIINDEX, rejects flipped axes. Walks perm from
       innermost (NDim-1) outward, accumulating itemsize into outStrides[axis]
       in original (C-order) axis slots.

  7. NpyIter_DebugPrint  (nditer_api.c:1402)
     - Public: DebugPrint(), DebugPrint(TextWriter), DebugPrintToString().
     - Faithful port of NumPy's dump format: ItFlags decoded, NDim/NOp,
       IterSize/Start/End/Index, Perm, DTypes, DataPtrs, BaseOffsets,
       OpItFlags, BufferData (when BUFFER), per-axis data.

  8. NPY_ITER_REDUCTION_AXIS encoding  (common.h:347, nditer_constr.c:1431)
     - Additive encoding: axis + (1 << 30). Values >= (1<<30)-1 flagged as
       reduction axes. Value 0x40000000 for axis 0, 0x3FFFFFFF for axis -1.
     - Public: NpyIterUtils.ReductionAxis(int) encoder and GetOpAxis(int,
       out bool) decoder. NpyIterConstants.REDUCTION_AXIS_OFFSET = 1<<30.
     - Integrated into CalculateBroadcastShape (rejects length != 1 on
       reduction axes), ValidateIterShape, and ApplyOpAxes (enforces
       REDUCE_OK + sets REDUCE flag).

  9. WRITEMASKED + ARRAYMASK + check_mask_for_writemasked_reduction
     - TranslateOpFlags now maps NpyIterPerOpFlags.WRITEMASKED ->
       NpyIterOpFlags.WRITEMASKED on op flags.
     - PreCheckMaskOpPairing validates: WRITEMASKED requires one ARRAYMASK,
       ARRAYMASK requires >=1 WRITEMASKED, at most one ARRAYMASK, no
       operand with both flags.
     - SetMaskOpFromFlags sets NpyIterState.MaskOp index of ARRAYMASK operand.
     - CheckMaskForWriteMaskedReduction enforces (nditer_constr.c:1328):
       for any WRITEMASKED + REDUCE operand, no axis may have maskstride!=0
       && opstride==0 (would produce multiple mask values per reduction element).
     - Public: NpyIterRef.MaskOp, HasWriteMaskedOperand.

 10. NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE per-op flag
     - Added NpyIterPerOpFlags.OVERLAP_ASSUME_ELEMENTWISE_PER_OP = 0x40000000
       in the correct per-operand flag slot (NumPy's location). Accepted
       syntactically as a marker for COPY_IF_OVERLAP fast-path elision.

Correctness bugs fixed while battletesting:

  A. SetupBufferedReduction produced inverted strides for non-reduce operands.
     BufStride was set to elemSize (assumed linear buffer); correct value is
     the operand's stride along the REDUCE axis (inner loop = reduce axis
     traversal). ReduceOuterStride was set to elemSize*coreSize; correct is
     stride along the non-reduce axis.

  B. SetupBufferedReduction only worked for 2-axis cases (one reduce, one
     non-reduce). For 3D+ with multiple non-reduce axes, added CoreSize=0
     short-circuit that defers to regular N-D Advance() — which correctly
     carries multiple axes via Coords + per-axis strides. stride=0 on
     reduce axis naturally keeps y's pointer fixed during reduce iteration.

  C. GetDataPtr for BUFFER+REDUCE with CoreSize=0 returned a buffer pointer
     indexed by IterIndex (linear assumption). For reduce this is wrong —
     DataPtrs already track the correct position. Now returns DataPtrs
     whenever REDUCE flag is set.

  D. Reset() didn't reposition to IterStart. IterIndex was set to IterStart
     but DataPtrs/Coords were reset to array origin, desyncing the iterator
     state for ranged iterators with IterStart > 0. Now delegates to
     GotoIterIndex(IterStart) which sets all three consistently.

  E. K-order fallback to C-order was too aggressive — triggered for all
     non-contiguous arrays, defeating NumPy's K-order semantic of iterating
     in memory order. Fixed to fall back only when broadcast axes (stride=0)
     are present; merely non-contiguous (transposed, strided, negative-
     stride) now properly sorts axes by |stride| descending.

  F. CoalesceAxes rejected size-1 axes unless stride==0. Size-1 axes
     contribute no iteration and should always be absorbed into a neighbor.
     Fix restores proper 1D coalescing for shapes like (2,4,1) contiguous.

  G. FlipNegativeStrides now populates BaseOffsets[op] (previously an
     allocated-but-unused field). Prereq for item #1 (ResetBasePointers).

Battletest harness:
  - Python<->NumSharp scenario harness in a temp workspace with 3 structured
    waves (25 scenarios each) plus a 491-scenario random fuzz test with
    deterministic seed (42). All scenarios compare element sequences, stride
    arrays, multi-indices, reduce outputs, and iteration state byte-for-byte
    against NumPy 2.4.2 output.
  - Coverage: 1D-5D shapes; int8/16/32/64, uint16, float32/64 dtypes;
    contiguous, transposed (2D+3D), strided, negative-stride, size-1 axes,
    and all combinations; MULTI_INDEX, C_INDEX, F_INDEX; RANGED + goto;
    explicit/implicit reduction axes; multi-operand broadcast.
  - Result: 566/566 scenarios pass (25+25+25+491). All semantically
    equivalent to NumPy's C-level nditer output.

Added tests (94 new unit tests):
  - NpyIterAxisStrideArrayTests (12)
  - NpyIterCreateCompatibleStridesTests (9)
  - NpyIterDebugPrintTests (12)
  - NpyIterGetMultiIndexFuncTests (10)
  - NpyIterInnerFixedStrideArrayTests (9)
  - NpyIterOverlapAssumeElementwiseTests (5)
  - NpyIterReductionAxisEncodingTests (11)
  - NpyIterResetBasePointersTests (10)
  - NpyIterTransferFlagsTests (8)
  - NpyIterWriteMaskedTests (8)

Regression: 6023/6023 project tests pass (was 5898 before this work),
zero regressions. Project passes ~125 more tests than baseline because
fixes C-F unblocked test cases that were previously failing silently.
…rface

New file: OrderSupport.OpenBugs.Tests.cs (39 tests, 11 marked [OpenBugs])

Comprehensive TDD test file documenting the gap between NumSharp's
current behavior and NumPy 2.x's expected behavior for memory order
support. Each test uses NumPy's exact output as the expected value
(verified via side-by-side Python scripts).

Test sections:
1. Creation APIs (np.zeros/ones/empty/full) — 10 tests
2. Copy/conversion (np.copy, NDArray.copy) — 7 tests
3. Manipulation (flatten, ravel) — 5 tests
4. Arithmetic output layout — 3 tests
5. Reductions on F-contig (math-equivalent) — 6 tests
6. Slicing contiguity preservation — 2 tests
7. Broadcasting output layout — 1 test
8. Transpose behavior — 3 tests
9. Iteration order (C-order via GetOffset) — 1 test
10. Order property derivation — 3 tests

Results (net8.0 and net10.0):
- 28 tests pass (documents working behavior / NumPy parity)
- 11 tests fail (marked [OpenBugs], excluded from CI via filter)

Currently failing [OpenBugs] — API gaps to close in future phases:

Section 2 — np.copy / NDArray.copy ignore order parameter:
  - NpCopy_FOrder_ProducesFContig
  - NpCopy_AOrder_FSource_ProducesFContig
  - NpCopy_KOrder_FSource_ProducesFContig
  - NDArrayCopy_FOrder_ProducesFContig
  - NDArrayCopy_AOrder_FSource_ProducesFContig

Section 3 — flatten/ravel ignore/lack order:
  - Flatten_CContig_FOrder_MatchesNumPy
  - Flatten_FContig_FOrder_MatchesNumPy
  - Ravel_FOrder_ApiGap (ravel has no order parameter at all)

Section 4 — arithmetic always produces C-contig output:
  - Arithmetic_FContig_ScalarMul_PreservesFContig
  - Arithmetic_FPlusF_PreservesFContig

Section 7 — broadcast always produces C-contig output:
  - Broadcast_FContig_PlusFCol_PreservesFContig

Currently passing (NumPy-aligned behavior confirmed):
- np.zeros/ones/full preserve F-contig when given an F-Shape (workaround
  for missing order= parameter, but layout IS preserved)
- np.empty(order='C'/'F'/'A'/'K') — correct behavior; A/K throw
- All reductions (sum, mean, min, max, axis=0, axis=1) work on F-contig
- Transpose toggles C<->F contig correctly
- Slicing: 1-col slice of F-contig is both C and F contig (matches NumPy)
- Slicing: row-slice of F-contig is neither (matches NumPy)
- Shape.Order property reports correct char based on flags
- Scalar multiply on F-contig produces correct values (just loses layout)
- Indexed iteration on F-contig produces C-order logical traversal
  (matches NumPy's arr.flat semantics)

CI verification:
- Full suite with CI filter: 6051 pass, 0 fail (net8.0 and net10.0)
- With TestCategory=OpenBugs: 11 fail (as expected)
- With TestCategory!=OpenBugs: 28 pass (0 regressions)

Next steps: fix each [OpenBugs] by wiring order through the respective
API using OrderResolver. Remove the attribute after verifying the test
passes with NumPy's expected output.
Expands OrderSupport.OpenBugs.Tests.cs from 39 to 67 tests covering
every NumPy function that accepts an 'order' parameter.

NumPy functions covered (15 total that accept order=):
- Creation: empty, empty_like, zeros, zeros_like, ones, ones_like,
  full, full_like, eye
- Conversion: array, asarray, asanyarray, copy
- Manipulation: reshape, ravel
- Method: astype, flatten, copy

New sections added:
- Section 11: np.empty_like (default K, preserves source layout)
- Section 12: np.zeros_like (default K) + values-are-zero test
- Section 13: np.ones_like (default K) + values-are-one test
- Section 14: np.full_like (default K) + values-are-fill test
- Section 15: np.eye (C/F order) + identity values test
- Section 16: np.asarray / np.asanyarray API gaps
- Section 17: astype (default K, preserves source layout)
- Section 18: np.reshape with F-order (column-major fill)
- Section 19: np.ravel with C/F/A/K orders
- Section 20: np.array with order (Array input overload)
- Section 21: np.asfortranarray / np.ascontiguousarray (missing APIs)

Results (net8.0 and net10.0):
- 42 tests pass (document working behavior / NumPy parity)
- 25 tests fail (marked [OpenBugs], excluded from CI via filter)

25 [OpenBugs] documenting gaps:
- *_like don't preserve F-contig (5 tests: empty/zeros/ones/full/astype)
- np.copy/NDArray.copy order ignored (7 tests from prior commit)
- flatten/ravel order ignored (3 tests)
- arithmetic/broadcast don't preserve F-contig (3 tests)
- np.eye has no order param (1 test)
- np.reshape has no order param (1 test)
- np.array order ignored (1 test)
- np.asarray/asanyarray have no NDArray+order overload (2 tests)
- np.asfortranarray/np.ascontiguousarray don't exist (2 tests)

Confirmed NumPy parity (new passing tests):
- np.empty_like/zeros_like/ones_like/full_like on C-contig (K default -> C)
- np.zeros_like/ones_like/full_like produce correct fill values
- np.eye default produces C-contig identity matrix
- astype preserves C-contig from C source (K default)
- astype preserves values during type conversion
- np.reshape default produces row-major fill
- np.ravel default is C-order
- np.array default produces C-contig

CI verification:
- TestCategory!=OpenBugs: 6065 pass, 0 fail (net8.0 and net10.0)
- TestCategory=OpenBugs: 25 fail (as expected bug reproductions)

All NumPy order baselines verified via Python 2.4.2 side-by-side scripts.
Expands OrderSupport.OpenBugs.Tests.cs from 67 to 103 tests.

New sections added (36 new tests):

Section 22 — Unary math ops preserve F-contig layout (9 tests):
- np.abs/negative/sqrt/exp/log1p/sin/square on F-contig
- NumPy: all preserve F-contig; values tests verify math correctness
- [OpenBugs]: 7 (layout not preserved; values correct)

Section 23 — Comparison ops preserve F-contig layout (4 tests):
- ==, <, >= on F+F -> F-contig bool output in NumPy
- [OpenBugs]: 3 (layout); values test passes

Section 24 — Bitwise ops preserve F-contig layout (2 tests):
- & and | on F+F
- [OpenBugs]: 2

Section 25 — Statistical ops (6 tests):
- std/var/argmin/argmax math correctness on F-contig (all pass)
- cumsum axis=0 values match (pass)
- cumsum axis=0 layout preservation ([OpenBugs]: output not F-contig)

Section 26 — Concatenation/stacking (4 tests):
- concatenate(CC, axis=0) values match (pass)
- concatenate/vstack/hstack of F-arrays preserve F
- [OpenBugs]: 3 (layout)

Section 27 — Manipulation (4 tests):
- repeat produces C-contig in NumPy (pass, matches NumSharp)
- expand_dims preserves F-contig (pass - NumSharp works correctly here)
- squeeze values preserved (pass)
- roll values match NumPy (pass)

Section 28 — MatMul/Dot (3 tests):
- matmul CC and FF both produce C-contig in NumPy (all pass)
- np.dot values match NumPy (pass)

Section 29 — Boolean masking/fancy indexing (2 tests):
- 1-D bool mask result is both C and F contig (pass)
- bool mask values match NumPy (pass)

Section 30 — Missing function APIs (2 tests):
- np.tile doesn't exist [OpenBugs]
- np.flip doesn't exist [OpenBugs]

Results (net8.0 and net10.0):
- 103 total tests
- 60 pass (NumPy behavior matches)
- 43 fail (all marked [OpenBugs], excluded from CI)

43 [OpenBugs] category breakdown:
- Copy/array/conversion ignore order: 7
- _like functions don't preserve F-contig: 5
- flatten/ravel order ignored or missing: 5
- arithmetic/broadcast don't preserve F-contig: 3
- unary math ops don't preserve F-contig: 7
- comparison ops don't preserve F-contig: 3
- bitwise ops don't preserve F-contig: 2
- cumsum axis op doesn't preserve F-contig: 1
- concatenate/vstack/hstack don't preserve F-contig: 3
- missing API parameters/functions: 7

Key findings from this round:
- NumSharp's expand_dims ALREADY correctly preserves F-contig (good!)
- All math correctness tests pass — F-layout doesn't break values
- MatMul behavior matches NumPy: always C-contig output regardless of input
- Boolean masking produces correct 1-D result (both C and F contig)

CI verification:
- TestCategory!=OpenBugs: 6083 pass, 0 fail (net8.0 and net10.0)
- TestCategory=OpenBugs: 43 fail (as expected)
…adcasting

Expands OrderSupport.OpenBugs.Tests.cs from 103 to 150 tests.

New sections added (47 new tests):

Section 31 — Extended unary math ops preserve F-contig (14 tests):
- np.ceil/floor/trunc/reciprocal/sign/cos/tan
- np.log/log10/log2/exp2/expm1/cbrt
- Plus ceil values-correct test
- [OpenBugs]: 13 (layout not preserved; values correct)

Section 32 — Division / power preserve F-contig (5 tests):
- true_divide, floor_divide, mod (%), power
- Plus true_divide values test
- [OpenBugs]: 4 (layout)

Section 33 — In-place ops preserve F-contig (1 test):
- fArr += 1 should keep F-contig (mutates same buffer)
- [OpenBugs]: 1

Section 34 — Selection/clip/pairwise (6 tests):
- np.where (missing, [OpenBugs])
- np.clip preserve layout [OpenBugs] + values test (pass)
- np.maximum/minimum preserve layout [OpenBugs]
- np.modf preserve layout [OpenBugs]

Section 35 — NaN-aware reductions math correctness (4 tests):
- nansum/nanmean/nanmax/nanmin on F-contig with NaN values
- All pass (math result correct regardless of layout)

Section 36 — Boolean reductions (3 tests):
- np.any, np.all, np.count_nonzero on F-contig
- All pass (math correctness)

Section 37 — isnan/isinf/isfinite preserve F-contig (4 tests):
- 3 [OpenBugs] layout tests + 1 values test (pass)

Section 38 — Broadcasting/axis manipulation (4 tests):
- np.broadcast_to values + layout (neither C nor F — stride=0)
- np.moveaxis/swapaxes layout on F-contig 3D
- All pass

Section 39 — argsort/unique/outer (4 tests):
- np.argsort on F-contig [OpenBugs] — throws DebugAssertException
- np.unique 1-D result is both C and F contig (pass)
- np.outer values + layout (pass, C-contig as expected)

Section 40 — Fancy/slice write preservation (2 tests):
- SliceWrite preserves F-contig (pass! NumSharp correctly mutates in place)
- FancyWrite [OpenBugs] — may not preserve F-contig

Results (net8.0 and net10.0):
- 150 total tests (was 103)
- 79 pass (NumPy behavior matches)
- 71 fail (all marked [OpenBugs], excluded from CI)

Key discoveries:
- np.argsort fails on F-contig arrays with DebugAssertException
  (type mismatch between int32 indices and int64 result)
- np.unique returns 1-D which is both C and F contig (matches NumPy)
- np.broadcast_to result is neither C nor F contig (stride=0 is correct)
- Slice write (arr["1:3, :"] = value) preserves F-contig (pass)
- np.swapaxes(F_3D, 0, 2) produces C-contig (reversed strides, matches NumPy)
- All math-correctness tests pass — values never wrong due to layout

71 [OpenBugs] categorized:
- Unary math ops don't preserve F-contig: 13
- Binary ops (/, //, %, **) don't preserve F-contig: 4
- In-place ops may not preserve F-contig: 1
- Selection/pairwise (clip/min/max/modf) don't preserve F-contig: 4
- isnan/isinf/isfinite don't preserve F-contig: 3
- Fancy write may not preserve F-contig: 1
- np.argsort throws on F-contig: 1
- Missing functions (where, tile, flip): 3
- Previously-discovered gaps: 41 (from rounds 1-3)

CI verification:
- TestCategory!=OpenBugs: 6180 pass, 0 fail (2 pre-existing flaky tests
  in Dtype_Decimal_ScalarOnly_Add and NpyExpr_InputIndexOutOfRange_Throws
  unrelated to order changes)
- TestCategory=OpenBugs: 71 fail (as expected)
…type APIs (Groups A+B, 11 bugs)

Group A — Copy/conversion (6 bugs):
- NDArray.copy(order): resolves via OrderResolver, allocates F-contig destination
  shape when needed, and copies values through NpyIter.TryCopySameType /
  MultiIterator.Assign (both handle mixed-stride copies).
- np.copy(a, order): delegates to NDArray.copy(order); default changed from 'C'
  to NumPy-aligned 'K' (no behavioral change for C-contig sources because 'K'
  preserves layout, which for C-contig input is 'C').
- np.array(Array, ..., order): after the existing C-contig materialization, if
  the resolved physical order is 'F' and the result is multi-dim, relay out via
  copy('F').

Group B — _like + astype (5 bugs):
- np.empty_like / zeros_like / ones_like / full_like: add overloads accepting
  char order (default 'K'); wire through OrderResolver using the source shape.
  The existing single-order overloads now delegate to the new ones.
- NDArray.astype(Type|NPTypeCode, bool, char order): new overloads with order
  default 'K'. After the engine cast, if physical order is 'F' but the casted
  result is C-contig, relay out via copy('F'). Existing astype(dtype, copy)
  overloads delegate to the new one with 'K'.

Tests unmarked from [OpenBugs] (all now passing):
- NpCopy_FOrder_ProducesFContig
- NpCopy_AOrder_FSource_ProducesFContig
- NpCopy_KOrder_FSource_ProducesFContig
- NDArrayCopy_FOrder_ProducesFContig
- NDArrayCopy_AOrder_FSource_ProducesFContig
- NpArray_FromManaged_FOrder_ProducesFContig
- EmptyLike_FSource_KDefault_PreservesFContig
- ZerosLike_FSource_KDefault_PreservesFContig
- OnesLike_FSource_KDefault_PreservesFContig
- FullLike_FSource_KDefault_PreservesFContig
- Astype_FSource_KDefault_PreservesFContig

Verification:
- OrderSupportOpenBugsTests: 90 passing / 60 [OpenBugs] (was 79 / 71).
- Full CI-filter suite (net8.0 and net10.0): 6203 passing, 0 failed.

60 [OpenBugs] remain across Groups C–M (flatten/ravel, ILKernelGenerator
element-wise ops, concatenation, cumsum, asarray, argsort, fancy write,
missing functions).
Add user-extensible custom-op layer on top of the NpyIter scheduler, in
three tiers that all funnel through a new NpyInnerLoopFunc factory with
4×-unrolled SIMD + scalar-strided fallback + runtime contig dispatch.

TIERS
-----
• Tier A — ExecuteRawIL(body, key, aux=null)
  User emits the entire inner-loop body against the NumPy ufunc
  signature void(void** dataptrs, long* byteStrides, long count, void*).
  Full control. Cached by user-supplied key.

• Tier B — ExecuteElementWise(types, scalarBody, vectorBody, key)
  + Unary / Binary / Ternary convenience overloads
  User supplies per-element IL; factory wraps in 4×-unroll SIMD shell
  + 1-vec remainder + scalar tail + scalar-strided fallback. SIMD is
  enabled iff all operand dtypes are identical and SIMD-capable.

• Tier C — ExecuteExpression(expr, inputTypes, outputType, key?)
  Compose with NpyExpr operator syntax; Compile() emits IL for you.
  No ILGenerator exposure. Auto-derives cache key from structural
  signature when omitted.

NPYEXPR DSL (breadth)
---------------------
Binary: Add, Subtract, Multiply, Divide, Mod, Power, FloorDivide, ATan2,
        BitwiseAnd, BitwiseOr, BitwiseXor
Unary:  Negate, Abs, Sqrt, Square, Reciprocal, Sign, Cbrt,
        Exp, Exp2, Expm1, Log, Log2, Log10, Log1p,
        Sin, Cos, Tan, Sinh, Cosh, Tanh, ASin, ACos, ATan,
        Deg2Rad, Rad2Deg, Floor, Ceil, Round, Truncate,
        BitwiseNot, LogicalNot, IsNaN, IsFinite, IsInf
Comparison (returns 0/1 at output dtype): Equal, NotEqual, Less,
        LessEqual, Greater, GreaterEqual
Combinators: Min, Max, Clamp, Where(cond, a, b)
Operators: + - * / % & | ^ unary- ~ !

BUGS FIXED (pre-existing)
-------------------------
• NPTypeCode.SizeOf(Decimal) returned 32, but .NET decimal is 16 bytes;
  iterator's ElementSizes inherited 32, GetInnerLoopByteStrides returned
  wrong strides, decimal arithmetic overflowed on garbage. Fixed 32 → 16.

• ILKernelGenerator.EmitUnaryVectorOperation was private — needed by
  NpyExpr.UnaryNode.EmitVector. Promoted to internal.

BUGS FIXED (NpyExpr-specific, caught by battletest)
--------------------------------------------------
• IsNaN/IsFinite/IsInf emit I4 0/1 on stack but factory expected output
  dtype → inserted EmitConvertTo(Int32, outType) after predicate ops.

• LogicalNot's default emit path uses Ldc_I4_0 + Ceq which only works
  for I4-sized operands — silently broken for Int64, Single, Double,
  Decimal. UnaryNode now routes LogicalNot through EmitComparisonOperation
  with an output-dtype zero literal.

• WhereNode prelude was unfinished (threw InvalidOperationException at
  compile time). Rewrote: evaluate cond in outputType, compare to zero
  via EmitComparisonOperation(NotEqual) to normalize to verifiable I4,
  then brfalse. Works across all dtypes incl. decimal.

• MinMaxNode's branchy select didn't propagate NaN (non-IEEE). Rerouted
  to Math.Min/Math.Max which propagate NaN per IEEE 754, matching
  NumPy's np.minimum/np.maximum (not np.fmin/np.fmax).

• Round and Truncate excluded from NpyExpr.IsSimdUnary because
  Vector256.Round/Truncate are net9+ APIs; NumSharp targets net8 as
  well, where the emit path fails with "Could not find Round/Truncate
  for Vector256`1". Scalar path works on both frameworks.

INFRASTRUCTURE
--------------
• New ILKernelGenerator.InnerLoop.cs (~515 lines) — CompileRawInnerLoop,
  CompileInnerLoop, GenerateTemplatedInnerLoop, EmitSimdContigLoop,
  EmitScalarStridedLoop, EmitScalarElement, EmitAddrIPlusOffset,
  EmitAddrIStrided. Contains _innerLoopCache keyed by string.

• New NpyIter.Execution.Custom.cs (~150 lines) — ExecuteRawIL /
  ExecuteElementWise* / ExecuteExpression entry points on NpyIterRef,
  all validating operand counts and delegating to the factory.

• New NpyExpr.cs (~600 lines) — abstract NpyExpr base with EmitScalar /
  EmitVector / SupportsSimd / AppendSignature contract, plus InputNode,
  ConstNode, BinaryNode, UnaryNode, ComparisonNode, MinMaxNode,
  WhereNode node classes.

TESTING (226 tests, 0 regressions)
----------------------------------
• NpyIterCustomOpTests.cs — 14 basic three-tier tests
• NpyIterCustomOpEdgeCaseTests.cs — 76 tests covering sizes, dtypes,
  stride layouts, broadcast, cache, validation
• NpyExprExtensiveTests.cs — 136 tests covering happy path for every
  new op, NaN/Inf/overflow edge values, strided inputs, cache behavior,
  type promotion, operator overloads, compositions (sigmoid, ReLU,
  Leaky ReLU, hypot, clamp, NaN replacement), dtype matrix across
  integer types, float32 SIMD paths, stress sweeps across sizes.

Full test suite: 6339 passing, 11 skipped, 0 failed on both net8.0
and net10.0.

DOCUMENTATION
-------------
• New docs/website-src/docs/NDIter.md (~1290 lines) — comprehensive
  NpyIter reference including Custom Operations section, full Tier C
  node catalog, type discipline rules, SIMD coverage rules, caching
  behavior, 13 worked examples (hypot Tier C, linear Tier B, ReLU,
  Leaky ReLU, clamp, stable sigmoid via Where, NaN replacement,
  softmax-ish element-wise), performance tables, known-bug writeups.

• Amended toc.yml to link NDIter.md from the documentation index.
…roups C+D, 7 bugs)

Group C — flatten/ravel/reshape (6 bugs):
- NDArray.flatten(order): for order='F' (physical), returns the memory of a
  fresh copy('F') interpreted as a 1-D array. Because copy('F') produces
  F-contiguous memory whose linear byte order matches column-major iteration
  of the source's logical coordinates, a simple clone of that buffer wrapped
  in Shape.Vector(size) yields the NumPy-expected values.
- NDArray.ravel() split into ravel() (C-order) and ravel(char order); both
  delegate to np.ravel(a, order).
- np.ravel(NDArray a): kept for source compatibility; now calls
  np.ravel(a, 'C'). New overload np.ravel(NDArray a, char order) resolves via
  OrderResolver; F-order delegates to a.flatten('F'); C-order preserves the
  original view-when-possible / copy-when-sliced semantics.
- NDArray.reshape(Shape, char order): new overload. For order='F', uses
  flatten('F') to read the source column-major, then wraps that buffer in a
  Shape built with F-strides — matching NumPy's
  np.arange(12).reshape((3,4), order='F') value layout
  ([[0,3,6,9],[1,4,7,10],[2,5,8,11]]) and F-contiguous flags.

Group D — np.eye order overload (1 bug):
- np.eye adds optional order parameter. We still build the identity in C-order
  (where the existing flat-index diagonal walk works reliably; m.flat on an
  F-contig array produces a disconnected clone because reshape(...) copies
  non-C-contig sources), then relay out via m.copy('F') when order='F'.

Tests unmarked from [OpenBugs] and rewritten where they were placeholder
  false.Should().BeTrue(...) api-gap markers:
- Flatten_CContig_FOrder_MatchesNumPy
- Flatten_FContig_FOrder_MatchesNumPy
- Ravel_FOrder_ApiGap (now calls arr.ravel('F') and asserts F-order values)
- NpRavel_CContig_FOrder_MatchesNumPy_ApiGap (now calls np.ravel(arr, 'F'))
- NpRavel_FContig_FOrder_MatchesNumPy_ApiGap (now calls np.ravel(arrT, 'F'))
- Reshape_FOrder_FillColumnMajor (now calls reshape(new Shape(3,4), 'F')
  and asserts both the F-contig flag and the column-major value layout)
- Eye_FOrder_IsFContig_ApiGap (now calls np.eye(3, order: 'F') and asserts
  F-contig flag + identity values)

Verification:
- OrderSupportOpenBugsTests: 97 passing / 53 [OpenBugs] (was 90 / 60).
- Full CI-filter suite (net8.0): 6346 passing, 0 failed.
… preserve F in concat/cumsum (Groups E+G+H, 8 bugs)

Group E — asarray / asanyarray + missing as-functions (4 bugs):
- np.asarray(NDArray, Type?, char order='K'): new overload. Returns the input
  as-is when both dtype and physical layout already match (NumPy no-copy
  semantics); otherwise delegates to astype(order) when retyping or copy(order)
  when relaying out.
- np.asanyarray(in object, Type?, char order): new overload. For NDArray inputs
  it routes through asarray; for other inputs it converts and then applies the
  requested layout (no-op for scalars / 1-D / matching).
- np.asfortranarray(NDArray, Type? dtype=null): new file, thin wrapper over
  asarray(a, dtype, 'F').
- np.ascontiguousarray(NDArray, Type? dtype=null): new file, thin wrapper over
  asarray(a, dtype, 'C').

Group G — Concatenation layout preservation (3 bugs):
- np.concatenate(NDArray[], int axis): when every input is strictly F-contiguous
  (IsFContiguous && !IsContiguous), allocate the destination shape with F-strides
  via new Shape(dims, 'F') instead of the default C-contig shape. The existing
  slice-based Assign loop works unchanged because writeTo[:,n,:] derives the
  correct offset/strides from the F-contig base shape.
- vstack/hstack: no changes needed — they delegate to concatenate which now
  preserves F-layout automatically.

Group H — Cumsum axis layout preservation (1 bug):
- np.cumsum(NDArray, int? axis, NPTypeCode?): post-process the engine result —
  when axis is provided and the source is strictly F-contig but the engine
  returned a C-contig result, relay out via result.copy('F'). Internal
  ReduceCumAdd still allocates C-contiguous; this keeps the fix minimal.

Tests unmarked from [OpenBugs] and rewritten away from placeholder
false.Should().BeTrue(...) asserts:
- Asarray_FOrder_ProducesFContig_ApiGap
- Asanyarray_FOrder_ProducesFContig_ApiGap
- AsFortranArray_ProducesFContig_ApiGap
- AsContiguousArray_ProducesCContig_ApiGap
- CumSumAxis0_FContig_PreservesFContig
- Concatenate_FF_Axis0_PreservesFContig
- VStack_FF_PreservesFContig
- HStack_FF_PreservesFContig

Verification:
- OrderSupportOpenBugsTests: 105 passing / 45 [OpenBugs] (was 97 / 53).
- Full CI-filter suite (net8.0): 6354 passing, 0 failed.

45 [OpenBugs] remain: ILKernelGenerator element-wise layout (~30 bugs, Group F),
argsort crash (Group I), fancy write (Group J), missing functions tile/flip/where (Group K).
…st (Group I, 1 bug)

argsort's internal SortLong<T> path uses this[long[]] + GetAtIndex<T>, which
follows the logical-C-order iteration pattern. On non-C-contig inputs
(F-contig, sliced, transposed) the existing code was correct for dtype matching
but the path generally assumes C-contig layout; NumPy's argsort also always
produces a C-contig index array regardless of input layout. Fix: when the
source is not C-contig, take a C-contig copy and argsort it — a one-line
guard that matches NumPy's semantics and keeps the result C-contig.

Tests:
- ArgSort_FContig_ProducesCContig: unmarked [OpenBugs], corrected T from
  argsort<int> to argsort<long> (np.arange returns Int64 in NumSharp;
  the test previously asserted on an impossible type mismatch, crashing
  independent of F-contig).

Group J note — FancyWrite_FContig_PreservesFContig remains [OpenBugs]:
investigation showed the underlying SetIndicesND Debug.Assert
(dstOffsets.size == values.size) fires on scalar-to-multi-row fancy writes
for BOTH C-contig and F-contig inputs. This is a pre-existing indexing bug,
not an F-order divergence. The [OpenBugs] comment is updated to capture the
real root cause so the next pass can target the actual fix.

Verification:
- OrderSupportOpenBugsTests: 106 passing / 44 [OpenBugs] (was 105 / 45).
- Full CI-filter suite (net8.0): 6355 passing, 0 failed.
…ew examples

Substantial expansion and corrections to the Tier C / NpyExpr docs after
the battletest round. 340 insertions / 85 deletions.

TOC
---
Expanded to expose every Tier C subsection: Node catalog, Operator overloads,
Type discipline, SIMD coverage rules, Caching and auto-keys, Validation and
errors, Gotchas, Debugging compiled kernels, When to use Tier C. The top
entry stays at one line; the tier splits under it reveal the full structure
without forcing readers to scroll.

NODE CATALOG
------------
Added a "NumPy equivalent" column to every op table (Binary arithmetic,
Binary bitwise, Scalar-branchy, Unary arithmetic, Unary exp/log, Unary
trig, Unary rounding, Unary bitwise/logical/predicates, Comparisons).
Readers can now cross-reference np.* names directly:
  Add → np.add, Mod → np.mod (floored), Abs → np.abs, ATan2 → np.arctan2,
  IsNaN → np.isnan, Equal → np.equal, Min → np.minimum (not np.fmin),
  Round → np.rint, Power → np.power, Log1p → np.log1p, etc.

Clarified NaN semantics for Comparisons (any NaN operand yields 0, matching
IEEE 754). Noted that Where branches are both emitted into IL but only the
taken branch executes at runtime — a real fusion optimization over the
"branchless" cond*a + (1-cond)*b pattern when one branch is expensive.

TYPE DISCIPLINE
---------------
Added a concrete integer→float promotion example:
  Input is Int32, outputType is Double. InputNode emits Ldind_I4 + Conv_R8
  at load time; all downstream nodes see Double intermediates; Sqrt is
  Math.Sqrt(double); Stind_R8 at store time. Explains where the auto-convert
  happens and why the DSL doesn't need mixed-type nodes.

CACHING AND AUTO-KEYS
---------------------
Fixed cache-key examples — the signatures were showing abbreviated enum
names ("Mul", "Cmp") but NpyExpr.AppendSignature emits the full enum
.ToString(). Verified against a runtime introspection:
  Before:  Add(Mul(In[0],Const[2]),Const[3])
  After:   Add(Multiply(In[0],Const[2]),Const[3])

Added a signature-prefix lookup table mapping each node class to its text
fragment (InputNode → "In[i]", BinaryNode → "<BinaryOp>(L,R)",
ComparisonNode → "Cmp<Op>(L,R)", MinMaxNode → "Min(L,R)" / "Max(L,R)",
WhereNode → "Where(C,A,B)", etc.).

Added notes on constant value sensitivity (x+1 vs x+2 are distinct kernels)
and the integer/float Const collision (Const(1) and Const(1.0) serialize
to the same "Const[1]" and share a cache entry — correct behavior when the
output dtype determines IL; worth explicit callout).

VALIDATION AND ERRORS (NEW SECTION)
-----------------------------------
New subsection tabulating every argument error the DSL reports:
  NpyExpr.Input(-1) → ArgumentOutOfRangeException at factory
  NpyExpr.Sqrt(null) → ArgumentNullException at node ctor
  ExecuteExpression(..., null, ...) → ArgumentNullException at bridge
  Too-few inputs for operand count → ArgumentException at bridge
  Input(5) with 2 inputs → InvalidOperationException at compile

Plus runtime errors (divide-by-zero on integer divisors, Power(neg, frac)
yielding NaN, Conv_* overflow semantics matching unchecked{} casts).

GOTCHAS (NEW SECTION)
---------------------
Eight common pitfalls with concrete examples:
  • NaN-propagation in Min/Max matches np.minimum (not np.fmin) — with a
    worked fmin composition via IsNaN + Where
  • Mod is floored (NumPy/Python), not truncated (C# %)
  • Integer / float division-by-zero contrast
  • Silent constant truncation to output dtype
  • Bitwise ops require integer output dtype
  • LogicalNot semantics (x == 0, not x != 0)
  • Silent input-dtype mismatches (buffer the iterator if unsure)
  • Integer-output comparisons lose fractional constants
  • Where IL size grows with branch nesting

DEBUGGING COMPILED KERNELS (NEW SECTION)
----------------------------------------
Practical guide for when a Tier C kernel misbehaves:
  • Inspect ILKernelGenerator.InnerLoopCachedCount (internal access needed)
  • Print AppendSignature manually to diagnose cache-key mismatches
  • Reduce to a minimal tree on a 3-element input
  • Double-check output dtype matches output NDArray dtype
  • Note on DynamicMethod IL dumping (not supported out of the box)

FOUR NEW WORKED EXAMPLES (14-17)
--------------------------------
14. Manual abs via comparison + Where (pedagogical; slower than built-in Abs)
15. Heaviside step function via three-way nested Where
16. Polynomial evaluation via Horner's method (fully SIMD-capable tree)
17. abs(sin(x)) piecewise composition (fused scalar kernel, no temporary)

All seventeen examples now appear in a mini-TOC at the top of the Worked
Examples section, grouped by tier (Layers 1-3 / Tier B / Tier C).

FOUR BUG WRITE-UPS (E/F/G/H)
----------------------------
Bug E (fixed): predicates silently wrote I4 0/1 into 8-byte double slots,
  producing near-zero denormals instead of 1.0. Fix: UnaryNode inserts
  trailing EmitConvertTo(Int32, outType) for IsNaN/IsFinite/IsInf.
  Caught by IsNaN_Double test.

Bug F (fixed): LogicalNot broken for Int64/Single/Double/Decimal outputs.
  EmitUnaryScalarOperation uses Ldc_I4_0+Ceq which is only correct for
  I4-sized operands; for Double on the stack, ceq(double, I4_0) is
  type-mismatched IL producing always-1 on our hardware. Fix: UnaryNode
  special-cases LogicalNot, routing through EmitComparisonOperation(Equal,
  outType) with a properly-typed zero literal (EmitPushZero emits
  Ldc_R8 0.0 / Ldc_I8 0L / decimal.Zero as appropriate).
  Caught by LogicalNot_Double_Operator test.

Bug G (exposed): Vector256.Round/Truncate are .NET 9+ only; NumSharp
  targets net8 as well. ILKernelGenerator.CanUseUnarySimd claims SIMD
  support, but EmitUnaryVectorOperation fails at method lookup with
  "Could not find Round/Truncate for Vector256`1". Production code never
  hit this because np.round/np.trunc paths are rarely exercised with
  SIMD dispatch. Tier C exercises every op/dtype combo. Fix in NpyExpr:
  IsSimdUnary excludes Round and Truncate, scalar path only — JIT
  autovectorizes post-tier-1 anyway. Upstream fix possible via
  #if NET9_0_OR_GREATER gating in CanUseUnarySimd.
  Caught by Truncate_Double test.

Bug H (fixed): MinMaxNode's branchy EmitComparisonOperation+brfalse
  approach returned the non-NaN operand for min(NaN, 3.0), matching C#
  <= semantics. NumPy's np.minimum propagates NaN per IEEE 754. Fix:
  reflect typeof(Math).GetMethod("Min"|"Max") and emit a direct call;
  Math.Min/Max propagate NaN. Falls back to the branchy path for Char /
  Boolean where no Math overload exists (no NaN concern for those).
  Caught by Min_Double_NaNPropagation test.

PERFORMANCE
-----------
Added a "Custom op overhead breakdown" table distinguishing compile,
auto-key derivation, runtime contig check, and scalar-strided fallback
overheads. Added quantitative note on fusion: avoiding a temporary array
saves ~8 MB of memory traffic per 1M float32 element → ~300 μs at typical
30 GB/s RAM bandwidth. Fusing 3 ops into one Tier C kernel can beat 3
baked Layer 3 calls by 1-2× when memory-bound.
… implement IsInf (Group F, 41 bugs)

Summary
=======
The ILKernelGenerator IL loops (SimdFull / SimdScalar* / SimdChunk / General) all
write the result sequentially in linear C-order via an `i * resultSize` byte
offset; the kernel signature doesn't even receive result strides. Refactoring
every IL emitter to accept arbitrary output strides would touch ~21K lines
across 27 partial files. Instead, this change preserves NumPy's layout semantics
at the engine dispatch sites: the result is allocated C-contiguous (unchanged),
the kernel runs (unchanged), and a cheap `.copy('F')` relays the buffer to
F-contig layout only when every non-scalar operand is strictly F-contig.

Central dispatchers updated (covers the vast majority of element-wise ops):
- DefaultEngine.ExecuteBinaryOp: call `ShouldProduceFContigOutput(lhs, rhs, result.Shape)`
  after the kernel. Rule: no non-scalar operand may be strictly C-contig, and
  at least one non-scalar operand must be strictly F-contig. Matches NumPy's
  `F+F->F`, `C+C->C`, `F+C->C`, `F*scalar->F`, `F+FCol->F` behavior.
  (Bitwise AND/OR/XOR also routes through here.)
- DefaultEngine.ExecuteUnaryOp: single-operand variant of the same rule.
- DefaultEngine.ExecuteComparisonOp: same rule, wrap the bool result back via
  `.MakeGeneric<bool>()` after copy.

Non-dispatcher paths updated individually:
- np.negative: NDArray.negative bypasses ExecuteUnaryOp (dtype-preserving direct
  loop over a clone). Wrapped at the np.* layer.
- np.clip: TensorEngine.ClipNDArray return path. Added `PreserveFContigFromSource`
  helper (uses `ReferenceEquals` to dodge NDArray's operator!= overload, which
  otherwise forces `&&` through operator&).
- np.modf: two-output variant of the same helper applied to each returned array.
  (np.maximum / np.minimum route through np.clip so are covered transitively.)

Additionally:
- DefaultEngine.IsInf was stubbed to return null (caused NRE on any IsInf call).
  Now wired through ExecuteUnaryOp with UnaryOp.IsInf (the IL kernel is already
  emitted in ILKernelGenerator.Unary.*). IsInf is now functional and also
  inherits F-contig preservation from ExecuteUnaryOp.

Tests
=====
Bulk-unmarked [OpenBugs] from the 41 element-wise tests that now pass (unary
math, binary arithmetic, comparisons, bitwise, division/power, clip, min/max,
modf, isnan/isinf/isfinite, in-place, broadcast). [OpenBugs] markers remain
only on the four truly-failing tests:
  - Tile_ApiGap, Flip_ApiGap, Where_ApiGap (Group K — unimplemented functions)
  - FancyWrite_FContig_PreservesFContig (pre-existing SetIndicesND bug)

Verification
============
- OrderSupportOpenBugsTests: 146 passing / 4 [OpenBugs] (was 106 / 44).
- Full CI-filter suite (net8.0): 6395 passing, 0 failed.

Total remaining [OpenBugs] = 4 (only missing functions + 1 unrelated pre-existing bug).
The F-order flatten path did `this.copy('F')` (allocates a fresh MemoryBlock)
and then `fcopy.Array.Clone()` (allocates another MemoryBlock and memcpy's the
same bytes). Since copy('F') already returns a buffer that nothing else
references, we can reinterpret its ArraySlice directly in a 1-D Shape without
re-copying — halves the allocations and memcpy on this path.

ArraySlice<T> is a readonly struct wrapping UnmanagedMemoryBlock<T>; multiple
UnmanagedStorage instances can safely share the same MemoryBlock (GC owns the
native allocation's lifetime via the block).

Verification: same 10/10 Flatten/Ravel/Reshape tests pass; CI-filter suite on
net8.0 still 6395 passing / 0 failed.
New CallNode + factory overloads let users invoke any .NET method per
element as part of an NpyExpr tree. Scalar-only by design (SIMD for
arbitrary managed calls is infeasible), but fuses the call with the
surrounding expression — single pass, no temporaries.

FACTORY OVERLOADS
-----------------
1. Typed generic Func<...> overloads (arity 0-4) — enable method groups
   without an explicit cast:
     NpyExpr.Call(Math.Sqrt, x);             // Func<double,double> inferred
     NpyExpr.Call(Math.Pow, x, y);           // Func<double,double,double> inferred
     NpyExpr.Call(provider);                 // zero-arg Func<TR>
     NpyExpr.Call(lerp, a, b, t);            // 3-arg
     NpyExpr.Call(quad, a, b, c, d);         // 4-arg

2. Delegate catch-all — for any Delegate instance:
     NpyExpr.Call((Func<double,double>)Math.Abs, x);   // cast-disambig
     NpyExpr.Call(myDelegate, x, y);

3. MethodInfo (static, no target):
     var mi = typeof(Math).GetMethod("Tanh", new[] { typeof(double) });
     NpyExpr.Call(mi, x);

4. MethodInfo + target (instance method):
     var mi = typeof(MyProvider).GetMethod("Apply");
     NpyExpr.Call(mi, myProvider, x);

DISPATCH PATHS
--------------
Three emit strategies, selected automatically at node construction:

| Condition                       | Emitted IL                          |
|---------------------------------|-------------------------------------|
| Static method, no target        | call <methodinfo>                   |
| Instance MethodInfo + target    | ldc.i4 id → LookupTarget            |
|                                 | → castclass T → callvirt <mi>       |
| Any other Delegate              | ldc.i4 id → LookupDelegate          |
|                                 | → castclass Func<..> → callvirt Inv |

Static methods are zero-indirection — the JIT may inline. Instance and
delegate calls go through a process-wide DelegateSlots registry keyed
by monotonically-increasing int. Lookup is ~5 ns per element.

TYPE DISCIPLINE
---------------
Arguments auto-convert from ctx.OutputType to each parameter's dtype via
the existing EmitConvertTo primitive (same path as InputNode's load-time
conversion). Return value converts back to ctx.OutputType. So:
   NpyExpr.Call(Math.Sqrt, Input(0))
with Int32 input and Double output promotes int→double at the call site,
runs Math.Sqrt(double), stores double — no user-visible type plumbing.

Supported param and return types: the 12 NPTypeCode dtypes (bool, byte,
int16/32/64, uint16/32/64, char, float, double, decimal). Void return,
generic unbound, ref/out params, or unsupported types (string, Complex,
user structs) throw ArgumentException at node construction.

CACHE KEY
---------
CallNode.AppendSignature emits:
  Call[<FullName>.<Name>#<MetadataToken>@<ModuleVersionId>](args)
with an extra ",target#<slotId>" suffix for bound-instance variants.
MetadataToken + ModuleVersionId disambiguates across dynamic assemblies.
Two call sites to the same method share the same kernel; different
methods get distinct cache entries.

DELEGATE SLOTS
--------------
Process-wide ConcurrentDictionary<int, Delegate> + ConcurrentDictionary
<int, object>. Strong references — entries persist for the process
lifetime. Users MUST register delegates once at startup (static field,
DI singleton) rather than per-call to avoid unbounded growth. This is
documented in the Gotchas section of NDIter.md.

VALIDATION
----------
ArgumentNullException for null delegate, null MethodInfo, or null arg.
ArgumentException for:
  • arg count mismatch with method arity
  • void return type
  • unsupported param/return types
  • instance method without target, static method with target
  • target type incompatible with method's DeclaringType

TESTS (38 new)
--------------
NpyExprCallTests.cs covers:
  • Typed overloads with method groups (Sqrt, Pow, Math.Abs cast-disambig)
  • Captured lambdas with closure state (unary, binary)
  • MethodInfo for static + user-defined methods
  • MethodInfo + target for instance methods (including state mutation)
  • Zero-arg Func<TR>, 3-arg, 4-arg
  • Type conversion: Int32→Double, Double return narrowing to Single,
    int-returning method widening to double tree
  • Composition with arithmetic + Where
  • Cache behavior (same method reuses, distinct methods don't)
  • Auto-derived cache key works
  • Nine validation cases (null, type mismatch, arity mismatch, void
    return, string param, instance/static target mismatch)
  • Strided input via scalar fallback
  • Size stress sweep (2, 7, 32, 65, 1024)
  • MathF float32 path

Total 264 tests passing across custom-op + NpyExpr suites on net8 + net10,
0 regressions in full suite (6433 total).

DOCUMENTATION
-------------
NDIter.md amended:
  • New Call table in Node catalog with dispatch-path breakdown
  • CallNode entry added to the cache-key signature-prefix table
  • Two new example cache keys showing Call structures
  • Call added to SIMD coverage "No" list with rationale
  • Five new Gotchas specific to Call (delegate lifetime, method-group
    ambiguity, scalar perf cost, NaN widening through int-returning
    methods, registration-once-at-startup guidance)
  • Two new worked examples (18-19):
    - Swish activation via static readonly Func delegate
    - Reflected MethodInfo with stateful instance provider
  • Worked Examples mini-TOC updated
…pe docs

Three follow-ups from a self-review of the Groups A–F changes:

1. NDArray.Copy.cs — share-by-reference bug. `new Shape(this.Shape.dimensions, 'F')`
   handed the destination Shape the SAME long[] as the source. Shape is a
   readonly struct but it exposes `this[int] { set; }` which mutates
   `dimensions[i]` directly, so a caller who mutated the source Shape
   (e.g. `src.Shape[0] = n`) would corrupt the copy's dimensions too.
   Fixed by cloning: `(long[])this.Shape.dimensions.Clone()`.

2. np.modf(NDArray, Type) — missing F-contig preservation. The NPTypeCode
   overload had the wrapper; the Type overload still returned raw engine
   results. Extracted the logic into a shared `PreserveFContig` helper that
   both overloads now route through, so the layout rule is applied uniformly.

3. NDArray.reshape(Shape, char order) — doc-only. The F-order path does not
   handle the -1 placeholder dim (it would silently produce a negative
   Shape.size and throw an IncorrectShapeException from UnmanagedStorage
   instead of inferring). Added a remarks note so callers know to pre-compute
   the inferred dim. (C-order path still supports -1 via the standard reshape.)

4. DefaultEngine.CompareOp.cs — cosmetic. Dropped the redundant
   `(NDArray<bool>)` cast in the F-preservation branch; MakeGeneric<bool>()
   already returns NDArray<bool>.

Verification: full CI-filter suite on net8.0: 6433 passing, 0 failed.
…tion

Second-pass amendment of the Call surface: the first commit buried
Call's dispatch/lifetime mechanics in a single table entry. This promotes
the complex bits to their own navigable subsection and adds a new
"Memory model and lifetime" section that finally gives the three
long-lived caches (kernels, delegate slots, iterator operands) a single
authoritative home.

TOC
---
Two new entries under Tier C:
  • Call — invoke any .NET method
  • Memory model and lifetime

NEW SUBSECTION — "Call — invoke any .NET method"
------------------------------------------------
Pulled out of the Node catalog table and given its own ~140-line
subsection. Structured as:

  • One-paragraph overview ("DSL escape hatch")
  • ASCII diagram of the one-node-three-paths architecture
  • Path A — static methods with code examples (Func overloads AND
    MethodInfo form)
  • Path B — bound instance methods with worked example
  • Path C — captured delegates with worked example
  • Auto-conversion at the call boundary (box diagram showing
    outputType → param type → method runs → return type → outputType)
  • Overload disambiguation — Math.Sqrt binds without cast, Math.Abs
    needs one. Cast examples for all three common cases (double, float
    via MathF, long). MethodInfo alternative for signature-explicit
    picking.
  • Thread safety (DelegateSlots uses Interlocked + ConcurrentDictionary;
    kernel compilation under GetOrAdd atomicity; kernels are re-entrant)
  • Performance envelope table with concrete slowdown ratios relative
    to a built-in op (~1.5× for Path A, ~2-3× for B, ~2-4× for C),
    with the note that the ratio collapses toward 1× as the target
    method's work grows.

The Node catalog's Call table now just lists the four factory shapes
and cross-references the new subsection.

NEW SUBSECTION — "Memory model and lifetime"
---------------------------------------------
Three things live longer than you might expect:

1. Compiled kernels in _innerLoopCache — process-lifetime, keyed by
   string, typical steady-state ~100-200 KB across a few dozen kernels.
   Documented inspection API (InnerLoopCachedCount, ClearInnerLoopCache)
   with scripting caveat (internal → needs AssemblyName override).

2. DelegateSlots (strong-ref by design — weak refs would race against
   GC while a kernel still holds the slot ID). Table comparing typical
   patterns: static method (zero), cached delegate reused (one),
   per-call lambda (linear leak). Test hooks (RegisteredCount, Clear)
   with the explicit warning that clearing while kernels hold slot IDs
   causes KeyNotFoundException from inside generated IL.

3. NDArrays referenced by the iterator — orthogonal but mentioned for
   completeness; released on Dispose.

Closes with the "registration-once" pattern: a static class with
static readonly Func fields for activations (Swish, GELU shown).

EXPANDED "Debugging compiled kernels"
-------------------------------------
Added:
  • DelegateSlots.RegisteredCount for diagnosing per-call lambda
    allocation
  • Warning about pairing DelegateSlots.Clear() with
    ILKernelGenerator.ClearInnerLoopCache()
  • Call signature includes MetadataToken + ModuleVersionId — explains
    what "same method name but different kernel" looks like
  • Three new error-message diagnosis entries (method-group ambiguous,
    void return, target type mismatch) mapping the compiler/runtime
    error to the usage mistake that caused it

EXPANDED "When to use Tier C"
-----------------------------
New decision tree walking through:
  Layer 3 → Tier C → Tier C+Call → Tier B → Tier A
based on "is this in the baked catalog?", "can I express it as DSL
nodes?", "is it a BCL / user method?", "do I need intrinsics?", and
"is the loop shape non-rectangular?"

EXPANDED "Allocations"
---------------------
New table distinguishing per-call vs one-time allocation for every
custom-op tier, with explicit Tier C + Call row calling out the
DelegateSlots retention cost. Cross-links the anti-pattern (per-call
lambda → unbounded slot growth) to the Memory-model section.

EXPANDED Performance → "Custom op overhead breakdown"
-----------------------------------------------------
Added two new rows for Call dispatch (Path A vs Path B/C) with
concrete IL sequence and ~ns cost.

Added a "When Call pays off" subsection articulating the tradeoff:
non-trivial user method → dispatch overhead is a few-percent tax on
something that was never going to SIMD anyway. Trivial user method
(x => x * 2) → compose out of DSL primitives, keep SIMD, run 3-5×
faster.

242 insertions / 15 deletions. Still zero regressions (264/264 custom-op
+ NpyExpr tests pass net8 + net10).
The "Kernel Integration Layer" intro previously diagrammed only Layers
1-3 (pre-custom-op era). With four more entry points added (Tier A/B/C
and Tier C+Call), the right mental model is an ergonomics-vs-control
axis with seven stops, not a three-layer stack. This amend replaces
the obsolete diagram and adds four navigable subsections so readers
can orient before diving into the per-layer deep dives.

NEW SECTIONS (after Kernel Integration Layer intro):

• Quick reference — 7-row table mapping each entry point to
  (when-to-use, per-call cost). Covers Layer 1/2/3 + Tier A/B/C + Call
  uniformly with one-liner guidance.

• Decision tree — top-level, mirrors the one inside Tier C but walks
  through all seven entry points in priority order: baked ufunc → DSL
  → Call → Tier B → Tier A → Layer 2 reduction → Layer 1. Same form
  as the docs' existing Tier-C-local tree but extended.

• Measured behavior — benchmark table with concrete ms/run numbers
  from the showcase script for six representative tasks (Add f32,
  2a+3b V256, AnyNonZero early-exit, abs-diff raw IL, GELU via Call,
  stable sigmoid DSL). Notes the JIT tier-0 caveat for Layer 1/2
  element-wise kernels under dynamic hosts.

• Cache state — two lifetimes to know about — surfaces the internal
  inspection hooks (InnerLoopCachedCount, RegisteredCount, Clear
  methods) with a typical post-showcase count (4 kernels / 131 slots)
  and cross-links the Memory-model section for the slot-leak gotcha.

UPDATED DIAGRAM:
---------------
Replaced the Layer-1-only / Layer-2 / Layer-3 ASCII stack with a
two-axis ergonomics-vs-control chart showing all 7 entry points on
the same plane. Bottom still converges on NpyIter state +
ILKernelGenerator so readers see the shared substrate.

TOC:
----
Added four sub-entries under "Kernel Integration Layer" (Quick
reference, Decision tree, Measured behavior, Cache state) so the
per-layer deep dives remain findable but the new orientation
material surfaces first.

90 insertions total. Zero test regressions (264/264 custom-op +
NpyExpr tests pass on net8 + net10).
Explicit the hierarchy — Tier A/B/C were always sub-tiers of Layer 3
(the baked-ufunc layer). Numbering them `3A/3B/3C` makes the
relationship visible at a glance:

  Layer 1  —  ForEach (delegate)
  Layer 2  —  ExecuteGeneric (struct-generic)
  Layer 3  —  ExecuteBinary / Unary / ...  (baked)
  Tier 3A  —  ExecuteRawIL              (sub-tier: custom IL)
  Tier 3B  —  ExecuteElementWise        (sub-tier: templated)
  Tier 3C  —  ExecuteExpression / Call  (sub-tier: DSL)

100 references touched across 6 files:
  docs/website-src/docs/NDIter.md  — prose, TOC, anchor links, worked-
                                    example heading anchors (#6, #7, #8)
  src/NumSharp.Core/Backends/Iterators/NpyExpr.cs       — header comment
  src/NumSharp.Core/Backends/Iterators/NpyIter.Execution.Custom.cs
    — file header, region comments for each tier entry point
  src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.InnerLoop.cs
    — factory method docstrings
  test/NumSharp.UnitTest/Backends/Iterators/NpyIterCustomOpTests.cs
    — class docstring, region comments, 10 test method names
    (TierA_* → Tier3A_*, TierB_* → Tier3B_*, TierC_* → Tier3C_*)
  test/NumSharp.UnitTest/Backends/Iterators/NpyIterCustomOpEdgeCaseTests.cs
    — region comments, 2 test method names (Validate_TierA_* →
    Validate_Tier3A_*)

No behavior changes. 264/264 NpyExpr + custom-op tests pass on net8 +
net10. Full suite still green (0 regressions).
… stale 32 → 16

Two consistency bugs in NPTypeCode.cs size constants:

1. `NPTypeCode.Char.SizeOf()` returned 1 byte — but .NET `char` is UTF-16
   (2 bytes). Verified: `Unsafe.SizeOf<char>()`, `Marshal.SizeOf<char>()` for a
   managed-struct lookup, managed `char[]` element stride, and NumSharp's
   `UnmanagedStorage<char>` stride all report 2. `InfoOf<char>.Size` already
   correctly returned 2 — so the same disagreement class as the former Decimal
   bug (SizeOf=32 vs real=16, fixed in b0803aef) existed for Char.

   Live impact in every iterator / kernel / cast / buffer path that reads
   `typeCode.SizeOf()` or `InfoOf.GetSize(dtype)`:

   - `NpyIter.State.SetOpDType` at NpyIter.State.cs:543,558 writes this into
     `ElementSizes[op]`, which is multiplied by stride in ~8 places to advance
     `DataPtrs[op]`. With ElementSizes[op]=1 but real char stride=2, iteration
     stepped 1 byte per element — landing on the high byte (zero for ASCII)
     every other step.
   - `NpyIterCasting.cs` (8 call sites) — casts to/from Char read/wrote 1 byte
     per element, truncating to low byte only. Lossy for non-ASCII.
   - `np.frombuffer(buffer, Char)` — interpreted 1 byte per char from the
     source buffer, misaligned for UTF-16 input.
   - `np.dtype(char).itemsize` returned 1 — wrong for buffer-size math.
   - Axis reductions (`ILKernelGenerator.Reduction.Axis.cs:201-202`,
     `Reduction.Axis.VarStd.cs:602`) used wrong output stride for Char dest.

   The bug survived without test failures because NumPy doesn't have a native
   "char" dtype — NumSharp's Char is .NET-specific and rare in practice. ASCII
   reads also *appear* correct because little-endian UTF-16 puts the ASCII
   byte in position 0, so 1-byte stepping yields `[A, \0, B, \0, ...]`
   instead of outright garbage.

2. `GetPriority(Decimal) = 5 * 10 * 32` was stale after the Decimal SizeOf fix
   (b0803aef). The formula is `group * 10 * sizeOf`, and Decimal's real size
   is 16 — so the constant is now `5 * 10 * 16 = 800`. Zero behavioral
   impact: relative ordering vs Double (400) and Complex (5000) is preserved
   either way, so `np.find_common_type` behaves identically. Purely a
   consistency cleanup so the constant reflects reality.

All 6433 non-OpenBugs/non-HighMemory tests pass after the fix.
…sion

Adds a runnable experiment under examples/NeuralNetwork.NumSharp/MnistMlp/
demonstrating that NpyExpr collapses the bias-add + ReLU chunk of each
dense-layer forward pass into one NpyIter invocation (zero intermediate
NDArray allocations for the post-matmul element-wise work).

Architecture:
  784 -> 128 (ReLU) -> 10 (raw logits), float32, He-init weights.

Forward-pass structure:
  Layer 1:  preact  = np.dot(x, W1)            <- one matmul
            hidden  = NpyIter:  Max(in0 + in1, 0)   <- one kernel, one iter
  Layer 2:  preact  = np.dot(hidden, W2)       <- one matmul
            logits  = NpyIter:  in0 + in1      <- one kernel, one iter

Four primitives per forward pass total (2 matmuls + 2 element-wise).
The fused kernels use NpyExpr.Max(Input(0) + Input(1), Const(0f)) and
NpyExpr.Input(0) + NpyExpr.Input(1), compiled once per unique cacheKey
and cache-hit on every subsequent forward pass. Bias (shape (N,))
broadcasts across the batch dim of preact (shape (batch, N)) via
NpyIter's natural right-aligned stride-0 insertion.

Measured on Windows 11 x64 (net8.0):
  Full forward pass:
    Fused:  1.50 ms / pass (median of 5 runs, 500 passes each)
    Naive:  2.36 ms / pass
    Speedup: 1.58x (matmul-dominated so this is noisy; multi-run median)
  Isolated bias+ReLU (matmul stripped, float32 (N, 128)):
    N=128    Fused 0.103 ms  Naive 0.295 ms  2.88x
    N=1024   Fused 0.770 ms  Naive 2.230 ms  2.90x
    N=4096   Fused 3.285 ms  Naive 9.484 ms  2.89x
    N=16384  Fused 13.02 ms  Naive 37.55 ms  2.88x
  Kernel cache delta: 3 (layer 1 fused relu + layer 2 bias-only +
                         isolation-bench kernel) -- invariant across
                         iteration count because cacheKey is stable.
  Delegate slots: 0 (pure DSL -- no user captured lambdas).

Correctness: bit-for-bit agreement with the naive np.add + np.maximum
composition (max |fused - naive| == 0). Accuracy on the test batch with
random He-init weights is ~8% / 128, matching chance (~10%) for 10-class
classification -- the experiment is a fusion + perf demo, not training.

Implementation notes:
- MnistLoader.cs parses the standard big-endian IDX format; falls back
  to deterministic synthetic data when t10k-images.idx3-ubyte /
  t10k-labels.idx1-ubyte aren't present, so the experiment is
  self-contained. Place real MNIST files in
  examples/NeuralNetwork.NumSharp/data/ (or bin/Debug/netX.Y/data/)
  to run against genuine digits.
- FusedMlp.cs builds a fresh NpyIterRef per forward pass (MultiNew with
  EXTERNAL_LOOP + NPY_NO_CASTING + READONLY/WRITEONLY op flags) and
  dispatches an NpyExpr tree via ExecuteExpression with a stable
  cacheKey.  Two such kernels, one per layer.
- NaiveMlp.cs composes np.dot, np.add, np.maximum -- each op allocates
  its own intermediate and runs its own iteration.
- Program.cs reports multi-run median for the matmul-heavy full pass
  (where per-run variance is higher than the fusion savings) and a
  single measurement for the isolated element-wise sweep (where fusion
  dominates and numbers are rock-solid across sizes).

Supporting changes:
- src/NumSharp.Core/Assembly/Properties.cs: add
  InternalsVisibleTo("NeuralNetwork.NumSharp") so the examples project
  can reference NpyIterRef (internal ref struct), NpyExpr's internal
  DelegateSlots, and ILKernelGenerator.InnerLoopCachedCount.
- examples/NeuralNetwork.NumSharp/NeuralNetwork.NumSharp.csproj:
  flip to OutputType=Exe, enable AllowUnsafeBlocks for MnistLoader's
  raw byte reader, set Nullable=disable to keep the example consistent
  with the project's historical style.

Bug found during development (filed as a note, not fixed in this commit):
- np.allclose calls astype(Double, copy:false) on both operands, which
  in NumSharp's current implementation mutates the caller's NDArray
  dtype in place (operand comes back reporting dtype=Double even if it
  was Single going in). NumPy guarantees astype(copy:false) returns the
  same array if the dtype matches, otherwise a new copy. The experiment
  works around this by using a manual max-abs-diff loop for the
  correctness check. See examples/NeuralNetwork.NumSharp/MnistMlp/
  Program.cs:82-83.

Build / test: 0 warnings, 0 errors on net8.0 and net10.0; full NpyExpr
test suite (174 tests) and iterator test family (681 tests) remain
green.
…sts, 4 [OpenBugs])

Coverage for the previously-untested reduction-with-keepdims path, spanning
sum, mean, min, max, prod, std, var plus the NaN-aware variants
nansum/nanmean/nanstd/nanvar.

2-D path (13 tests, passing)
- Input: np.arange(12).reshape(3,4).T → F-contig (4,3).
- Result with axis=0/1 + keepdims=True: shape (1,3) or (4,1) — trivially
  both C- and F-contig because any size-1 dim makes a shape both-contig.
- All values asserted against NumPy 2.x output.
- NaN-aware variants use the same F-contig source with NaN at [0,0];
  ddof=0 default, matching NumPy.

3-D path (4 tests, [OpenBugs])
- Input: np.empty((2,3,4), order='F') → F-contig 3-D.
- NumPy: keepdims reductions preserve F-contig layout; e.g., sum(F3,
  axis=0, keepdims=True) is shape (1,3,4) with C=0, F=1.
- NumSharp: flips to C-contig (C=1, F=0). Flagged as [OpenBugs] because
  the 3-D reduction kernel writes result in linear C-order regardless of
  input layout — same post-hoc copy fix as element-wise dispatchers would
  apply here.
- Covered ops: sum (keepdims + no keepdims), mean (keepdims), nansum
  (keepdims). One representative test per reduction family to isolate
  the dispatcher path vs. the per-op implementation.

Test suite status
- CI-filter suite: 6446 passing, 0 failed (previously 6433; +13 non-OpenBugs).
- Section 41 tests in isolation: 23 passed, 4 [OpenBugs] failures as expected.
Documents that np.sort is not implemented (listed under Missing Functions
in docs/CLAUDE.md); only np.argsort exists. Single [OpenBugs] sentinel so
any future port of numpy test suites that call np.sort surfaces as a known
gap rather than a hidden failure.
… 0 [OpenBugs])

All passing — confirms NumSharp parity with NumPy for the linear-algebra
output-layout contracts.

Findings (parity confirmed against NumPy 2.x)
- matmul(F,F) / matmul(C,F) / matmul(F,C) → always C-contig output (2-D).
  Values match regardless of input layout permutation.
- dot(F,F) → always C-contig; values match dot(C,C).
- outer(1-D, 1-D) → always C-contig, shape (M,N).
- convolve(a, b, mode) → 1-D, trivially both C- and F-contig for all three
  modes (valid/full/same). Value checks cover the full signal.

Note: existing Section 28 only covered matmul(C@C) and matmul via .T.T.
This section adds true F-contig inputs via copy('F') to exercise the
F-contig code paths that weren't previously touched, plus mixed C/F
operand permutations.
…nBugs])

All passing — NumSharp's broadcast primitives correctly produce NumPy-aligned
stride patterns and layout flags when sourced from F-contig inputs.

Findings (parity confirmed against NumPy 2.x)
- broadcast_to(F(4,3), (2,4,3)) → strides (0, 8, 32), neither C- nor F-contig.
- broadcast_to(C(3,4), (2,3,4)) → strides (0, 32, 8), neither flag.
  The stride=0 leading dim always knocks BOTH contiguity flags off.
- broadcast_to values verified for (2,4,3) case — replication along the new
  outer axis preserves inner data indexing for any input layout.
- broadcast_arrays(F, scalar) → first output preserves F-contig (shape
  already matches target, no stride=0); second is all-stride-0 (neither flag).
- broadcast_arrays(F(2,3), F(2,1)) → first F preserved; second has stride=0
  on the broadcast axis (neither flag).

These confirm that F-contig source arrays are handled correctly through
the broadcasting pipeline, at least for shape/layout — a real expectation
given broadcasting is a Shape/strides-only operation (no value copy).
…m SGD

Extends the previous fusion-demo into a fully operational trained
classifier using the NeuralNetwork.NumSharp BaseLayer/BaseCost/
BaseOptimizer abstractions. Forward AND backward passes each collapse
the post-matmul element-wise chunk into a single NpyIter kernel.
Training end-to-end: per-epoch loss and accuracy, final test-set
evaluation, IL-kernel cache and delegate-slot reporting.

Architecture
------------
  784 (input) -> 128 (ReLU) -> 10 (linear logits), float32 throughout.

  Forward pass (per layer, fused bias+activation in ONE NpyIter):
    ReLU layer: y = max(xW + b, 0)                 -- NpyExpr.Max(Input(0)+Input(1), 0)
    Linear layer: y = xW + b                        -- NpyExpr.Input(0) + NpyExpr.Input(1)

  Backward pass (per layer, fused ReLU gradient mask in ONE NpyIter):
    ReLU backward: gradPreact = gradOut * (y > 0)   -- NpyExpr.Input(0) * NpyExpr.Greater(Input(1), 0)
    Linear backward: gradPreact = gradOut           -- pass-through

  Loss: SoftmaxCrossEntropy (combined, numerically stable). Forward
    computes max-subtracted softmax + categorical cross-entropy;
    Backward returns (softmax - labels)/batch via a cached softmax.

  Optimizer: Adam (the existing class, with the ms/vs init bug fixed).

Training signal
---------------
Synthetic fallback now generates *learnable* data: 10 class templates
in [-1,1]^784 shared across train/test splits + per-sample Gaussian
noise sigma=1.5. Both splits share templates so generalization is
meaningful.

Measured on a 6000-train / 1000-test synthetic split, batch_size=128,
Adam lr=0.001, 5 epochs:

  Epoch  1/5  loss=0.4183  train_acc=88.47%  (~20s)
  Epoch  2/5  loss=0.0013  train_acc=100.00% (~20s)
  Epoch  3/5  loss=0.0009  train_acc=100.00% (~20s)
  Epoch  4/5  loss=0.0007  train_acc=100.00% (~20s)
  Epoch  5/5  loss=0.0006  train_acc=100.00% (~20s)
  Final test accuracy: 100.00%   Total: 100.7s (net8) / 96.5s (net10)

Fusion probe (post-matmul bias+ReLU, 200 passes, 500-iter warmup
to cross .NET tiered-JIT promotion): 2.4-3.0x speedup fused vs.
np.add + np.maximum. Correctness: bit-exact (max |diff| = 0).

Kernel cache delta: 6 (one per unique expression-dtype combination:
fused bias+ReLU forward, fused bias-only forward, fused ReLU backward,
fused bias+ReLU forward [probe path], and two kept in FusedMlp/NaiveMlp
from the earlier commit). Invariant across epoch / batch iteration
count -- compiled once per process, cache-hit thereafter.
Delegate slots: 0 (pure DSL composition, no captured lambdas).

Files
-----
New:
  examples/NeuralNetwork.NumSharp/MnistMlp/FullyConnectedFused.cs
    -- Dense layer with bias + optional fused activation. Parameters["w"]
       / Parameters["b"] + Grads["w"] / Grads["b"] per the BaseLayer
       contract, so the stock Adam optimizer iterates it without change.
       Three fused NpyIter kernels: bias+ReLU forward, bias-only forward,
       ReLU gradient mask for backward. He-init for ReLU, Xavier for
       linear. Float32 end-to-end.

  examples/NeuralNetwork.NumSharp/MnistMlp/SoftmaxCrossEntropy.cs
    -- Combined softmax + categorical cross-entropy loss. Forward does
       max-subtracted softmax then CE; Backward returns the simplified
       (softmax - labels)/batch form (numerically stable -- avoids
       differentiating log(softmax) on the critical path). Caches
       softmax output between Forward and Backward calls. Ships a
       OneHot helper that handles Byte / Int32 / Int64 label dtypes.

  examples/NeuralNetwork.NumSharp/MnistMlp/MlpTrainer.cs
    -- Explicit training + evaluation loop that uses the existing
       BaseLayer / BaseCost / BaseOptimizer abstractions. Sidesteps
       the built-in NeuralNet.Train which uses
       x[currentIndex, currentIndex + batchSize]  -- that's 2-index
       integer selection in NumSharp, not slicing, and reads the wrong
       data silently. MlpTrainer uses the correct x[$"{start}:{end}"]
       form. Evaluate() runs the same forward pass over the full test
       set and argmax-compares against integer labels.

Modified:
  examples/NeuralNetwork.NumSharp/MnistMlp/MnistLoader.cs
    -- Added LoadFullDataset(dataDir, syntheticTrain, syntheticTest, seed)
       for the canonical train-images/train-labels/t10k-images/t10k-labels
       filename set, plus learnable synthetic fallback
       (SynthesizeSamples with shared class templates across splits).

  examples/NeuralNetwork.NumSharp/MnistMlp/Program.cs
    -- Rewritten to drive the training pipeline end-to-end: load data,
       fusion probe with correctness + speed check, build model (2x
       FullyConnectedFused), train with Adam + SoftmaxCrossEntropy,
       report per-epoch stats + final test accuracy + kernel
       instrumentation.

  examples/NeuralNetwork.NumSharp/Optimizers/Adam.cs
    -- Fixed the ms/vs zero-init. The existing code had the init paths
       commented out with //ToDo: np.full, so layer.Parameters["w"]
       threw KeyNotFoundException on the first step. Now initializes
       via np.zeros(param.Shape, param.dtype).

Audit notes (not changed in this commit)
----------------------------------------
Other components in the example project are stubbed-out with //ToDo:
markers:
  - Softmax.Forward and Sigmoid.Forward have empty bodies.
  - CategoricalCrossentropy doesn't clip predictions and its Backward
    formula assumes softmax has already been applied (it hasn't). Uses
    np.log(preds) with no epsilon -- div-by-zero on saturation.
  - Accuacy.Calculate (note misspelling) calls np.argmax(preds) without
    axis, so it returns a scalar not a per-row argmax. Useless for
    batched accuracy.
  - NeuralNet.Train uses x[i, j] (2-index integer selection) where
    x[$"{i}:{j}"] (slice) was intended -- training on the wrong data.

The new code bypasses each of these with its own correctly-implemented
path. If and when they get fixed in place, callers can migrate.

Build / test: 0 warnings, 0 errors on net8.0 and net10.0; full
NumSharp.UnitTest (6446 tests excluding OpenBugs/HighMemory) passes
with the Adam fix applied.
Nucs added 24 commits May 30, 2026 20:03
…classes

New corpus 'nanreduce.jsonl': nansum/nanprod/nanmax/nanmin/nanmean/nanstd/nanvar/nanmedian
over NaN/inf-laced float (and int/complex) operands across axis/keepdims combinations. The
edge pools front-load NaN/+/-inf so every slice forces the ignore-NaN contract.

nanmax/nanmin/nanprod are clean. The matrix shows the rest are broadly broken (526/2040
cells diverge; documented in MisalignedRegistry + FUZZ_COVERAGE_BUGS.md, left unfixed):

  W4-A nanmean/nanstd/nanvar return [1] not scalar [] on a 1-D axis reduction, and ignore
        keepdims entirely on the integer input path (shape divergence).
  W4-B nansum/nanmean miscompute on the strided/axis path: an all-NaN slice yields 2^31
        instead of 0; nanmean divides by the wrong count (32.0625 vs 32.0).
  W4-C nanmedian PROPAGATES NaN instead of ignoring it (returns NaN where NumPy returns the
        non-NaN median).
  W4-D nansum(complex128, axis) on a 1-D array throws NDCoordinatesAxisIncrementor (shared
        complex-1D-axis defect).
  W4-E nanmean/nanstd/nanvar over an empty float16 array (axis=None) throw 'NDIterator empty
        shape' instead of returning NaN.

All 21 FuzzMatrix tests green (net10.0).
…g class

New corpus gating cumsum/cumprod (axis None + per-axis, NEP50 accumulator) and diff (n=1,2,
axis 0/last, output shrinks by n) over int16/int32/int64/uint8/uint16/float32/float64/
complex128 across 8 layouts (incl. F-contig, transposed, strided, negative-stride, size-1).

np.diff is fully bit-exact with NumPy. cumsum/cumprod have one defect:

  W5-A cumsum/cumprod on a SIZE-1 int16/int32/uint8/uint16 array preserve the input dtype
        instead of applying the NEP50 accumulator widening (int->int64, uint->uint64) that
        the size>1 path applies correctly — a one-element fast-path promotion skip.

Documented in MisalignedRegistry + FUZZ_COVERAGE_BUGS.md (left unfixed). All 22 FuzzMatrix
tests green (net10.0).
…ug classes

New corpus gating median/average/ptp (axis+keepdims), count_nonzero, percentile/quantile
(q in {0,25,50,75,100} / {0,.25,.5,.75,1}, axis None/0/last), and clip (a,a_min,a_max)
across int/uint/float dtypes and 7 layouts.

ptp and count_nonzero are bit-exact. Four defects found (documented in MisalignedRegistry +
FUZZ_COVERAGE_BUGS.md, left unfixed):

  W6-A median/percentile/quantile mishandle ±inf/NaN slices: the partition + linear
        interpolation ((a+b)/2, a+(b-a)*f) yields NaN where NumPy doesn't (and vice-versa),
        e.g. (+inf + -inf)/2.
  W6-B percentile/quantile on INTEGER input (axis path) return GROSS wrong values (sign
        flips, e.g. NumPy +8192 vs NumSharp -8191) — a real QuantileEngine interpolation bug.
  W6-C average drifts from NumPy by summation order (pairwise vs naive) on large magnitudes.
  W6-D clip(NaN,lo,hi) clamps NaN to a_min instead of preserving NaN (NumPy passthrough) —
        clip's min/max sorts NaN below the lower bound.

All 23 FuzzMatrix tests green (net10.0).
…, 828 cases)

New corpus: isnan/isinf/isfinite (unary->bool), maximum/minimum (NaN-propagating), fmax/fmin
(NaN-ignoring), isclose (binary->bool) over NaN/inf-laced operands and 9 pairwise layouts.

isnan/isinf/isfinite are bit-exact. Three defects found (documented in MisalignedRegistry +
FUZZ_COVERAGE_BUGS.md, left unfixed):

  W7-A maximum/minimum/fmax/fmin SCRAMBLE the result when an operand is F-contiguous/strided:
        the kernel walks the operand in memory order ignoring its strides, pairing the wrong
        elements. C-contiguous is bit-exact and add/sub/mul handle the same F-contig operand
        correctly, so this is extrema-kernel-specific.
  W7-B fmax/fmin PROPAGATE NaN (fmax(0,NaN)->NaN) instead of ignoring it per the NumPy
        contract -- they behave identically to maximum/minimum.
  W7-C isclose diverges on an F-contiguous complex operand (same strided-pairing family).

All 24 FuzzMatrix tests green (net10.0).
…ug class

modf split into two corpus ops (modf_frac / modf_int) so the harness bit-compares BOTH
outputs against NumPy across float/int dtypes and 8 layouts (incl. negative-stride).

modf(float32)/modf(float64) are bit-exact on both outputs, including the C-standard
signed-zero/inf edges (modf(-0.0)=(-0.0,-0.0), modf(inf)=(0.0,inf)). One defect:

  W8-A modf(float16) and modf(int) throw 'modf only supports Single, Double, Decimal' --
        no Half kernel and no integer->float64 promotion (NumPy returns float16 for Half and
        promotes int input to float64).

All 25 FuzzMatrix tests green (net10.0).
… bug classes

New corpus gating the shape-manipulation family that had ZERO coverage:
- single-array: ravel/transpose/expand_dims/squeeze/roll/repeat/tile/reshape/swapaxes/
  moveaxis/delete/atleast_1d/2d/3d across all 25 layouts (heavy stride/offset/empty/0-D
  coverage since these ops only move bytes).
- two-array: concatenate (per axis), stack (per axis), hstack/vstack/dstack, contiguous +
  strided operands.
- pad: constant/edge/reflect/wrap modes.

The bulk is bit-exact with NumPy. Three defects (documented in MisalignedRegistry +
FUZZ_COVERAGE_BUGS.md, left unfixed):

  W9-A expand_dims on an empty (0,3) array returns [0,3] instead of NumPy [1,0,3] -- the
        inserted axis is dropped on a zero-size array.
  W9-B repeat IGNORES Shape.offset: on an offset slice (b[2:7]) or a 0-D view at a non-zero
        offset it repeats elements read from the base buffer START -> wrong data. Contiguous
        / offset-0 repeat is bit-exact.
  W9-C atleast_3d on an empty (0,3) array returns [0,3] instead of [0,3,1] (same zero-size
        axis-insertion family as W9-A).

All 26 FuzzMatrix tests green (net10.0).
… bit-exact)

New corpus gating argsort (1-D + 2-D, axis 0/1/-1, distinct values to avoid unstable-sort
tie-break ambiguity), searchsorted (side left/right), and nonzero (1-D int64 indices) over
int32/int64/uint8/float32/float64.

OpRegistry dispatches the generic np.argsort<T> by typecode (test-side only).

Result: 35/35 bit-exact with NumPy 2.4.2 -- including the int64 index result dtype. No bugs
found, no Misaligned excusals. All 27 FuzzMatrix tests green (net10.0).
…it-exact)

New corpus stressing the V128/V256/V512 lane seams: add/subtract/multiply/negative/abs/sqrt/
sum/prod/max/min over 1-D arrays sized 1,2,3,7,8,9,15,16,17,31,32,33,63,64,65,127,128,129
across int32/int64/uint8/float32/float64. These straddle the points where the unrolled SIMD
body hands off to the 1-vector remainder and the scalar tail.

Result: 900/900 bit-exact -- the three-stage loop has no off-by-one at any boundary. No new
OpRegistry wiring (reuses existing ops). All 28 FuzzMatrix tests green (net10.0).
…t-exact)

The reduce tier only covered axis in {None, 0, last}. This tier exercises the untested
parameter dimensions:
- MIDDLE axis (1) and every NEGATIVE axis (-1/-2/-3) for all 11 reductions on a 3-D array.
- ddof=1 (sample) std/var on a 2-D array, axis None/0/1.
- order='F' ravel across C-contiguous, transposed, and F-contiguous sources.

Result: 288/288 bit-exact -- negative-axis resolution, ddof, and F-order read-out are fully
NumPy-aligned. No bugs found. All 29 FuzzMatrix tests green (net10.0).
…nl, 40 cases)

Covers the previously-untested operand-relationship flags: input aliasing (a op a passed as
the SAME reference both sides, via a new Case.Alias harness flag) and in-place out= (maximum/
minimum/clip writing into an input operand).

Key result: the aliasing + out= MECHANISM is sound -- input aliasing for add/sub/mul/maximum/
minimum is bit-exact, and the out= path writes every non-NaN element correctly (no read-before-
write corruption). The only divergences are NaN-semantics bugs at element 0:

  W11-A maximum/minimum DO NOT PROPAGATE NaN -- they return the non-NaN operand, behaving like
        fmax/fmin. The NaN semantics are exactly SWAPPED with W7-B (fmax/fmin wrongly propagate
        NaN; maximum/minimum wrongly ignore it). W7 missed this because its operands aligned
        NaN-with-NaN; the out= test's b=roll(a) places a finite value opposite the NaN.
  (W6-D) clip(NaN) clamps to a_min, re-confirmed on the out= path.

Documented in MisalignedRegistry + FUZZ_COVERAGE_BUGS.md (left unfixed). All 30 FuzzMatrix
tests green (net10.0).
…itical crash found

The other generators SKIP every case where NumPy raises, so 'NumPy raises => NumSharp raises
the same' was never asserted. New Case.Expects_Throw harness flag: the case carries no expected
buffer and the harness asserts the op throws SOMETHING (rather than silently producing a wrong
result).

10/10 gated cases hold error parity -- int**neg, broadcast mismatch (add), bool subtract,
matmul core-dim mismatch, bitwise_and/left_shift on float, concatenate/stack dim mismatch, bad
reshape, axis-out-of-range sum all throw in NumSharp as in NumPy. No silent-result parity gaps.

W14-A (CRITICAL, excluded from the gated corpus): np.invert(float64) does NOT raise a catchable
exception -- it executes an ILLEGAL CPU INSTRUCTION (System.ExecutionEngineException 'Illegal
instruction') and HARD-CRASHES the process. NumPy raises a clean TypeError. The bitwise-NOT IL
kernel runs on float registers; the JIT accepts the IL but the CPU rejects it at runtime. This
is uncatchable, so it cannot be gated in the corpus (it would kill the test host) -- documented
in FUZZ_COVERAGE_BUGS.md as a red crash bug. (bitwise_and(float) by contrast throws a catchable
InvalidProgramException; left_shift(float) a clean TypeError.)

All FuzzMatrix tests green (net10.0).
…11 properties)

Oracle-free internal-consistency properties that the differential corpus cannot express (no
NumPy needed) -- they catch round-trip / involution / identity / order-invariance bugs:

  -(-a)==a, (a+b)-b==a, (a^T)^T==a, reshape round-trip, widening-cast round-trip
  (int32->int64->int32 etc.), a*1==a / a+0==a, abs(abs(a))==abs(a), sum(a)==sum(ravel(a)),
  concatenate split-free, argsort(sorted)==0..n-1, (a==a).all().

All 11 hold across int/uint/float dtypes -- NumSharp's core algebraic consistency is solid. No
bugs found.

This completes the 'all x all' coverage campaign (W1-W15): ~27k new differential cases across
every op tier, all 13 dtypes, section-C operand flags, parameter sweeps, SIMD-tail boundaries,
error parity, and metamorphic invariants. 28 bug classes documented in FUZZ_COVERAGE_BUGS.md
(left unfixed per the campaign directive; gate stays green via scoped MisalignedRegistry
excusals). All 42 FuzzMatrix tests green (net10.0).
…ransposed reductions

The general (non-fast-path) branch of AxisReductionSimdHelper re-derived each
output's input base via a div/mod coordinate decode PER output element, then did
a cache-hostile strided gather down the reduce axis PER output —
O(outputs x stridedGather). On strided / transposed inputs this measured 9-24x
slower than NumPy 2.4.2 in Release config:

  f32 a[:, ::2] sum axis0          : 6.8ms  vs NumPy 0.57ms
  f32 transpose(2,0,1) sum axis2   : 78ms   vs NumPy 3.2ms
  f64 transpose(2,0,1) sum axis2   : 107ms  vs NumPy 6.1ms
  int64 transpose(2,0,1) sum axis2 : 109ms  vs NumPy 8.4ms

Replace it with NumPy's reduce-loop shape (NPY_ITFLAG_REUSE_REDUCE_LOOPS): walk
the reduce axis as the OUTER loop and FOLD each slab into the output via an
incremental odometer over the non-reduced dims. Base pointers advance by their
strides each step (no per-output div/mod re-derivation -- the reduce loop is
reused across successive output positions), ordered so the smallest-input-stride
dim streams innermost; the output accumulator stays hot. Turns
O(outputs x stridedGather) into O(in_bytes) streaming.

Strategy dispatch (NumPy iterates the smallest-stride axis innermost):
  - reduce axis is NOT the innermost memory axis (a non-reduced dim has a
    smaller |stride|) -> slab-accumulate.
  - reduce axis IS the innermost (smallest |stride|) run -> per-output reduce,
    kept INLINE/verbatim from the original loop. Extracting it to a separate
    method regressed the typeof(T) gather/scalar inner-reducer codegen for
    double/int64 (~3-5x on strided-innermost sum), so it stays inline.

Results (Release, 2048^2 / 256^3, min-of-N), same-type SIMD dtypes:
  f32   strided sum a0  4.87 -> 0.81 (6.0x)   transp 69  -> 26 (2.6x)
  f64   strided sum a0  9.19 -> 1.44 (6.4x)   transp 107 -> 31 (3.4x)
  int64 strided sum a0  19.0 -> 1.67 (11x)    transp 109 -> 31 (3.5x)
Strided-outer reductions now at NumPy parity; transposed 2.6-3.5x faster (still
~4-8x off NumPy -- output-strided writes want cache blocking, follow-up). Note
int32/uint32/int16 sums widen (int32->int64) through the separate widening
kernel and are unaffected here.

Combination order along the axis is sequential (slab 0,1,2,...): exact for
integer and min/max; for float sum/prod it matches NumPy's buffered strided
reduce inner loop. Contiguous fast paths (leading/innermost) are untouched and
remain at/above NumPy parity. The prior session's "sum axis=0 ~7x slower"
finding was a dotnet_run Debug-config artifact (struct-method combine calls do
not inline in a debuggable assembly); in Release those paths are already fast.

Verified: 1534 reduction tests pass (256 axis-kernel, 61 FuzzMatrix
reduce/scan/stat differential vs NumPy, 1217 statistics/math), plus a 540-case
oracle sweep reduce(view) == reduce(view.copy()) across 9 dtypes x
strided/transposed/sliced/negative-stride layouts x sum/prod/min/max/mean.
…n for contiguous binary/comparison/unary ops

NumSharp analogue of NumPy's check_for_trivial_loop + try_trivial_single_output_loop
(ufunc_object.c:2235), which handle a single strided inner loop "without using the
(heavy) iterator." Previously EVERY binary, comparison, and unary op was routed
through NpyIterRef.MultiNew unconditionally, paying iterator construction on every
call: a NativeMemory.AllocZeroed state block + broadcast-shape calc + ValidateIterShape
+ AllocateDimArrays + per-operand cast scan + coalesce/reorder. Measured at
~600-2000 ns/call (isolated): 22-24% of a small contiguous op, <=3% once n>=64K.
For a trivially-contiguous op that construction buys nothing — ForEach already
collapses the contiguous case to a single inner-loop kernel call (IsSingleInnerLoop),
so the iterator's coalesce/broadcast/buffer machinery has no inefficiency to amortize.

This adds an O(1)-gated fast path BEFORE the NpyIter route in ExecuteBinaryOp,
ExecuteComparisonOp, and ExecuteUnaryOp. The gate is pure cached-ArrayFlags reads
(IsContiguous / IsFContiguous / IsBroadcasted are single bitmask ANDs) plus
Shape.Equals (size + dimensions, layout-agnostic). When the operands share ONE
contiguous layout (both C, or both F), have identical shape (no broadcast), and
-- binary only -- the same dtype (no cast), a single linear walk over each buffer
visits the same logical element, so we route straight to the existing DirectIL
whole-array kernels (SimdFull / SimdScalarLeft / SimdScalarRight for binary &
comparison; the contiguous UnaryKernel for unary) and skip MultiNew entirely.

Coverage:
- Binary  (TryTrivialContiguousBinaryOp + TryScalarBroadcastBinaryOp): equal-shape
  C+F and array-vs-scalar; same-dtype only (promotion/mixed cases keep the NpyIter
  cast path). C output for C inputs, F output for strictly-F inputs.
- Comparison (TryTrivialContiguousComparisonOp): equal-shape C+F and array-vs-scalar;
  dtypes MAY differ (the comparison kernel promotes per element with no cast temp;
  result is always bool, allocated C or F to match the array operand's layout).
- Unary (TryTrivialContiguousUnaryOp): single contiguous operand (C+F). Forces the
  contiguous UnaryKernel variant and F-allocates for F input so both buffers walk in
  the same physical-linear = F-logical order. In/out dtypes may differ (predicate
  ops -> bool, Abs(complex) -> double).

Safety:
- Contiguous offset slices qualify -- ExecuteKernel/ExecuteUnaryKernel/
  ExecuteComparisonKernel already apply operand offset*elemSize.
- GetMixedTypeKernel/GetUnaryKernel/GetComparisonKernel THROW (not null) on
  unsupported emit (e.g. bool '-'); each bypass catches NotSupportedException and
  returns null so the exact pre-existing path raises/handles the case unchanged.
- Broadcast (stride-0 with extent>1), mixed C/F layouts, strided/transposed views,
  and cast cases all return null and proceed to the NpyIter route exactly as before.

Measured (warm min-of-k; Debug glue, so absolute ns are upper bounds, ratios hold):
  small ops n=8 now ~1350-1570 ns/call across all three families, down from the
  ~2600-3500 ns/call pre-bypass NpyIter baseline (~45-50% faster). n=64/1024 similar.
  n=4M is bandwidth-bound -- the bypass stays neutral (no regression), confirming the
  large-array path never needed it (construction already amortized there).

Verification: full suite 9447 passed / 0 failed / 11 skipped; FuzzMatrix differential
corpus 42/42 bit-identical to NumPy 2.4.2 across all 44 layout variations and dtypes;
correctness smoke across C-contig, F-contig (result stays F), offset slices, strided
(correctly skipped), broadcast (skipped), mixed dtype, and non-commutative scalar-LHS
(100-a, 2/(a+1)).

Purely additive: 328 insertions, 0 deletions -- no existing code paths modified. The
"array op double-literal" case (u + 3.0) still pays boxing + np.asanyarray in
operator+(NDArray, object) before reaching the engine; that is a separate operator-
dispatch concern and is not addressed here.
…terator construction

Reduces NpyIter construction overhead for the dominant same-shape elementwise
case (a OP b with the output the same shape) by skipping the broadcast
machinery when it would be an identity transform, and widens the internal shape
arrays to long for NumPy npy_intp parity.

Changes:
- broadcastShape / outputShape: int[] -> long[]. Iteration axes larger than
  int.MaxValue now survive construction without checked-narrowing, matching
  NumPy's npy_intp dimensions.
- CalculateBroadcastShape: fast path when every non-null operand shares identical
  dimensions (the dominant elementwise / same-shape-out case) -> return the shared
  dims directly, skipping the virtualShapes List + ToArray + ResolveReturnShape +
  Shape allocations.
- Operand stride/offset setup: fast path when an operand's shape already equals the
  iteration shape -> np.broadcast_to is an identity (same offset, same strides, no
  stride-0 insertion), so copy the operand's own offset+strides directly and skip
  the broadcast_to call and its Shape allocation (~25% of per-call iterator setup
  overhead per the inline note).
- Marked Initialize and CalculateBroadcastShape [MethodImpl(AggressiveOptimization)].

Build green; full unit suite (net10.0) passes 9447/0/11.
…ty-broadcast fast path (canonical-shape reuse, drop dead memset, share dims)

Builds on the O(1) trivial-loop bypass (e062f53) and the NpyIter identity-
broadcast fast paths (1ff75ae) by eliminating the residual per-call heap
allocations those paths still incurred. All three changes are layout/allocation
only — kernel emit, dtype handling, and numeric results are untouched.

1. Canonical-shape reuse in the bypass (CanonicalResultShape)
   Every bypass arm (binary equal-shape, scalar-broadcast, comparison, unary)
   built its result Shape via `(long[])src.dimensions.Clone()` + `new Shape(dims)`
   — two long[] allocations (the dims clone + ComputeContiguousStrides) plus four
   array walks (strides + size/hash + ComputeFlagsStatic's C/F/broadcast passes).
   When the source operand is ALREADY canonical (offset == 0, bufferSize == size,
   contiguous in the target order) it is byte-for-byte what the constructor would
   produce, and because Shape is an immutable readonly struct whose dims/strides
   arrays are never mutated, the result can share it verbatim. A sliced/offset
   source (offset != 0) or a window into a larger buffer (bufferSize > size) is NOT
   reused — the fresh result owns exactly `size` elements starting at 0, so it would
   inherit a bogus offset/bufferSize — and falls back to the original fresh-Shape
   construction. Helper lives in DefaultEngine.BinaryOp.cs, shared across the partials.

2. Drop the dead zero-fill on the comparison-bypass result
   The SimdFull / SimdScalarLeft / SimdScalarRight comparison kernels write every
   output byte (4x-unrolled SIMD + remainder + scalar tail span [0, size)), so the
   `new NDArray<bool>(shape, fillZeros: true)` memset was dead work. Switched to
   false (binary/unary bypass already allocate with false). Removes a full-size
   memset from every contiguous comparison — the dominant win on large arrays.

3. Return shared dims from NpyIter CalculateBroadcastShape fast path
   The all-operands-share-identical-dims fast path allocated `new long[ndim]` +
   Array.Copy before returning. broadcastShape is consumed strictly read-only by
   Initialize (copied element-wise into _state->Shape and, only for ALLOCATE
   outputs, cloned — never mutated in place), and Initialize is its only caller, so
   the defensive copy is unnecessary: return the operand's immutable dimensions
   array directly. Saves one Gen0 allocation per NpyIter fast-path call (the
   mixed-dtype / strided / broadcast elementwise route the bypass defers to).

Evidence (net10.0, Release):
- Deterministic allocated-bytes/op (GC.GetAllocatedBytesForCurrentThread, noise-free):
  every elementwise bypass op  1416 -> 1352  (-64 B/op = the eliminated dims clone +
  strides array); scalar arms  2867 -> 2800 / 2864 -> 2800; NpyIter mixed-dtype
  1968 -> 1936  (-32 B/op = the eliminated broadcastShape copy). Zero regressions —
  every measured op strictly lower.
- Wall-clock (6 interleaved process samples, complete distribution separation =
  after's slowest beats before's fastest, immune to machine drift):
  bin_add_8 -10%, bin_sub_8 -10%, un_neg_8 -8%, and the 1M-element comparison
  u>v -18% (the dropped memset). Sub-microsecond scalar/sqrt ops sit inside the
  machine-drift noise floor on wall-clock (CONTROL, byte-identical in both builds,
  swung +-24% across the same samples) but deterministically allocate -64 B/op.

Correctness:
- CI suite (TestCategory!=OpenBugs&!=HighMemory) 9447 passed / 0 failed / 11 skipped
  — byte-identical to the pre-change baseline.
- 26/26 edge-case probe: fresh contiguous (reuse), contiguous slice offset!=0
  (fall back; result offset==0, bufferSize==size), head slice offset==0 but
  bufferSize>size (fall back), strict-F 2-D (reuse F; result F-contig), comparison
  across SIMD-tail sizes 17/33 with fillZeros=false (no garbage leak), sliced
  comparison source, scalar-broadcast, unary on fresh/slice/F.
… flat reductions (8x on strided sum)

The non-contiguous flat reduction kernel (EmitReductionStridedLoop, used by all
flat reductions: sum/prod/min/max/mean/std/argmax/argmin/all/any over sliced,
strided, reversed or otherwise non-C-contiguous 1-D inputs) decoded the element
offset from the flat index with a full coordinate walk EVERY element:
    for d = ndim-1..0:  coord = idx % shape[d];  idx /= shape[d];  off += coord*strides[d]
i.e. an integer div + mod + mul per element. For a 1-D strided view that is one
idiv per element where a single `off += strides[0]` suffices — turning a
memory/throughput-bound reduction into a compute-bound one.

This is the dominant non-contiguous reduction case (np.sum(a[::2]), reductions
over a sliced/reversed view, etc.) and measured 3-14x slower than NumPy purely
from that per-element idiv.

Fix: add NumPy's incremental-advance fast path for ndim==1 — cache strides[0]
and walk the element offset by it each step (one add, no div/mod). The
accumulate/compare logic (EmitLoadIdentity / EmitReductionCombine /
EmitArgReductionStep) is byte-for-byte the same as the general path; only the
address computation changes. ndim>1 keeps the existing coordinate-decode loop
(correct; rarer; can be odometer-ized later).

Reductions route here via NpyIterRef.ExecuteReduction for non-contiguous flat
inputs (DefaultEngine.ReductionOp -> TryExecuteElementReductionViaNpyIter), so
this is the kernel NpyIter drives for 1-D strided reductions.

Benchmark (net10.0 Release, DOTNET_TieredCompilation=0, min of 3 process
samples, matched warmup; NumPy 2.4.2 baseline):
  sum|262144|strided   843277 -> 102003 ns   8.27x faster   (11.77x -> 1.42x vs NumPy)
  sum|4096|strided      13683 ->   1962 ns   6.97x faster   ( 8.57x -> 1.23x vs NumPy)
  sum|4194304|strided 14661438 -> 3370065 ns 4.35x faster   ( 3.50x -> 0.81x: now BEATS NumPy)
  sum|64|strided          857 ->    622 ns   1.38x faster
The hand-written tight scalar strided sum measures 100812 ns at 262144; the
fixed kernel (102003 ns) is now at that scalar floor — the idiv overhead is
fully gone. Contiguous reductions are unaffected (separate SIMD path).

Correctness: full CI suite (TestCategory!=OpenBugs&!=HighMemory) 9447 passed /
0 failed / 11 skipped (unchanged). Dedicated probe verified strided == its
contiguous copy for {sum,prod,min,max,mean,std,argmax,argmin} x {double,single,
int32,int64} x {stride2, stride3, reversed/negative, offset+stride} plus bool
all/any, N-D transposed (general path intact) and contiguous (dispatcher intact).
…guous tile->SIMD kernel) — reaches NumPy parity on strided sqrt

A non-contiguous (stepped / reversed / sliced-with-step) 1-D unary input could
not feed the SIMD inner loop: the elementwise NpyIter route fell to a scalar
per-element walk (e.g. Math.Sqrt one element at a time), making strided sqrt
2-6x slower than NumPy. NumPy's answer is buffering — copy a cache-sized chunk
of the strided source into a contiguous tile, run the vectorized loop on the
tile, repeat. This adds exactly that as a new fast path.

TryBufferedStridedUnaryOp (DefaultEngine.UnaryOp), inserted between the trivial
contiguous bypass and the NpyIter route:
  - Gates to genuinely strided 1-D inputs (contiguous, including offset slices
    with stride 1, are handled by the bypass) whose CONTIGUOUS unary kernel is
    SIMD-accelerated (CanUseUnarySimd). Scalar-only ops (exp/sin/... with no
    Vector intrinsic) are skipped — there the gather would be pure overhead over
    the scalar walk, so they keep the existing route.
  - Walks the source in 2048-element tiles: GatherStrided copies the strided
    chunk into a contiguous stack buffer (typed per element width — 1/2/4/8/16B
    — so the JIT emits a tight incremental load/store loop; handles negative
    byte strides for reversed views), then the proven contiguous unary kernel
    runs the SIMD body over the tile straight into the contiguous output.
  - Result is a fresh contiguous 1-D array — the shape/layout NumPy returns for
    a unary over a strided 1-D view.

GatherStrided: strided->contiguous element copy, switched on in-memory element
size (NPTypeCode.SizeOf, not Marshal-based dtypesize which is 4 for bool).

Benchmark (net10.0 Release, DOTNET_TieredCompilation=0, min of samples; NumPy
2.4.2):
  sqrt|262144|strided  1167121 -> 568086 ns   2.05x faster   (2.12x -> 1.03x vs NumPy: PARITY)
  sqrt|4096|strided      16972 ->   8452 ns   2.01x faster   (5.92x -> 2.95x vs NumPy)
At 262144 the gather+SIMD matches NumPy; at 4096 the remaining gap is fixed
alloc/dispatch overhead on a tiny op, not the loop.

Correctness: 60/60 dedicated checks — {sqrt,neg,abs,square} x {stride2, stride3,
reversed (negative stride), offset+stride, n=7} x {double,single,int32},
including tiles that cross the 2048 chunk boundary (n=6000), each equal to the
op applied to the contiguous copy. Full CI suite (TestCategory!=OpenBugs&
!=HighMemory) 9447 passed / 0 failed / 11 skipped (unchanged).

Scope: unary only (the measured worst case — strided sqrt 5.9x). Strided binary
already wins at large sizes; extending the same buffered path to binary (two
gathers) and to N-D-coalesces-to-1-D is a follow-up.
…y — 0.55 ns/elem flat, beats NumPy at every size

Replace the gather-to-scratch + contiguous-kernel two-step (TryBufferedStridedUnaryOp)
with a single custom-IL kernel for the common same-width float/double case. Each SIMD
vector is assembled DIRECTLY from `lanes` strided scalar loads via Vector{W}.Create, the
unary op is applied in-register, and the result is stored contiguously — one pass, no
scratch tile, no per-chunk delegate dispatch.

New StridedUnaryKernel(void* src, long srcByteStride, void* dst, long count):
- DirectILKernelGenerator.Unary.Strided.cs — _stridedUnaryCache + GetStridedUnaryKernel +
  EmitStridedUnaryBody. Three-stage loop (2x-unrolled SIMD + 1-vector remainder + scalar
  tail), width-adaptive via VectorBits. The vector op body REUSES EmitUnaryVectorOperation
  (Sqrt/Negate/Abs/Square/Floor/Ceil/Round/Truncate/Reciprocal/Deg2Rad/Rad2Deg) and the
  tail reuses EmitUnaryScalarOperation — zero per-op duplication; one emit covers every
  SIMD unary op. srcByteStride may be negative (reversed views).
- VectorMethodCache.CreateElements(simdBits, elem) — reflects the lane-count
  Vector{N}.Create(T ... T) overload (uniquely: non-generic, >1 param, all params type T),
  cached per (width, elem). The one new reflection helper.

Routing (DefaultEngine.UnaryOp.cs): TryStridedSimdUnaryOp runs BEFORE the buffered path.
Gate: NDim==1 && !IsContiguous && !IsBroadcasted, inputType==outputType, Single/Double,
CanUseUnarySimd. Byte strides via NPTypeCode.SizeOf() (not Marshal-based dtypesize);
base = Address + offset*size. The buffered path is retained as the fallback for the
promoting (in!=out) SIMD cases it still owns (sqrt(int32)->double, Abs(complex)->double).

Correctness: 391,272 bit-exact checks (BitConverter bits, NaN-aware) — {sqrt,neg,abs,
square} x {stride2,stride3,reversed,offset+stride} x {double,single} x sizes
{7,17,64,257,6000,12001}, each == a per-element scalar reference. Full CI net10.0
9447/0/11 (no regressions).

Benchmark (DOTNET_TieredCompilation=0, 3000 warmup, min-of-rounds), np.sqrt(a[::2]):

  isolated kernel ns/elem        N=64      N=4096    N=262144
    fused                        0.547     0.549     0.557
    NumPy 2.4.2                  4.322     0.660     1.419
    => kernel vs NumPy           7.9x      1.20x     2.55x  faster

  integrated np.sqrt ns/call     N=64      N=4096    N=262144
    before (buffered, 9cebc83)  1888.6    10417.8   552000
    after  (fused)               1163.3     6847.9   321900
    => before->after             1.62x      1.52x    1.71x  faster
    => after vs NumPy            4.21x      2.53x    0.87x (1.16x faster @262K)

The kernel itself is 0.55 ns/elem flat and beats NumPy at every size (1.20-7.9x). The
integrated op improves 1.5-1.7x over the buffered baseline and overtakes NumPy at 262K;
the residual small/mid-size gap is the result-NDArray allocation (#42), orthogonal to
this kernel — the N=64 kernel is 35 ns but the np.sqrt wrapper pays ~1.1 us to allocate.
…easible Full-rigor run

Extend the benchmark/ 3-tier suite (BenchmarkDotNet + numpy_benchmark.py + merge) into an
official, reusable, cross-platform NumSharp-vs-NumPy comparison covering ~all supported
np.* ops at three cache-tier sizes (1K / 100K / 10M), with the full per-(op, dtype, N)
ratio matrix.

Pipeline foundation
- BenchmarkConfig.cs: new OfficialBenchmarkConfig — InProcessEmit toolchain + a 50-iteration
  job whose iteration time is capped at 25 ms.
    * InProcessEmit avoids BenchmarkDotNet's out-of-process project search, which fails in
      this repo ("project names need to be unique") because sibling git worktrees under
      .claude/worktrees/ hold same-named copies of the benchmark project. In-process also
      matches the warm long-lived Python/NumPy process, making the cross-language ratio fair.
    * The iteration-time cap is the key feasibility fix: BDN's default Throughput strategy
      ramps to ~8192 invocations/iteration (≈0.5 s/iter) for nanosecond microbenchmarks, so
      a 10M-element op at 50 iterations took ~25 s PER case and the full matrix would take
      days. Capping iteration time lets the pilot pick a per-op invocation count that fits
      25 ms — fast ops still get hundreds–thousands of invocations (tight mean), slow ops
      (argsort@10M) drop to 1/iteration. Measured: a 30-case set went 18 min -> 70 s (~15x)
      with all 50 iterations preserved. Restores the console logger ManualConfig drops.
- Program.cs: apply OfficialBenchmarkConfig as the base config; CLI passes only --filter.
- run_benchmark.py (new): cross-platform orchestrator. Builds C#, runs each suite via BDN
  (per-class JSON = resumable), sweeps NumPy across the 3 sizes, merges, archives to
  results/<timestamp>/. The single reusable entry point.
- numpy_benchmark.py: --size all / --cache-sizes sweep all three sizes in one invocation
  (each result tagged with its n); suite execution extracted into run_suites() and looped.
- merge-results.py: keep ALL sizes (was hard-dropping everything but N=10M), key the join by
  (op, dtype, N), add the N column to every table, and emit a C#-only coverage check (ops
  measured with no NumPy match).

New op coverage (C# benchmark classes + name-matched NumPy suites)
- Comparison: == != < > <= >=
- Bitwise: & | ^ invert left_shift right_shift
- Logic: isnan isinf isfinite maximum minimum isclose allclose array_equal; all any
- NaN reductions: nansum nanmean nanmax nanmin nanstd nanvar nanprod nanmedian
  nanpercentile nanquantile
- Statistics: median percentile quantile average ptp count_nonzero
- Sorting/searching: argsort nonzero searchsorted
- Linear algebra: dot, outer, matmul (matmul dim capped so O(M^3) stays bounded)
- Selection: where(cond,x,y), where(cond)
- Unary extras: cbrt reciprocal square negative positive trunc
- Cumulative: cumprod

Bit-rot repair: SimdVsScalarBenchmarks/MinMaxBenchmarks updated to the current Core API
(ILKernelGenerator->DirectILKernelGenerator.Enabled; np.argmin/argmax now return long).

Verified end-to-end on the comparison suite at 3 sizes x 4 dtypes: 72/72 cases matched
(0 no-data), ratios computed (parity at 10M ~0.9-1.1x, ~2-2.8x slower at 100K — the
small/medium alloc/dispatch overhead). Generated reports archived under results/.
- run_benchmark.py: copy each C# suite's BenchmarkDotNet JSON to the results dir immediately
  after that suite runs. BDN cleans its artifacts dir on every invocation, so collecting once
  at the end kept only the last suite's reports; copying per-suite preserves all of them.
- merge-results.py: add a "Summary by size" table — per-N op count, status histogram, and the
  geomean NumSharp/NumPy ratio — the headline view for "all ops at 3 sizes".
…ral op-name join

Official run: BenchmarkDotNet Full (50 iterations, InProcessEmit, iteration-time-capped) ×
all comparison suites × {1K, 100K, 10M} vs NumPy 2.x — 1813 C# measurements, 1111 matched
op×dtype×size comparisons.

Headline (geomean NumSharp ÷ NumPy across ~409 ops):
    N = 1,000        1.96x   (102 faster / 53 close / 128 slower / 84 much)
    N = 100,000      1.83x   (109 / 66 / 121 / 75)
    N = 10,000,000   1.00x   (166 faster / 171 close / 20 slower / 16 much)  <- PARITY

NumSharp reaches parity with NumPy at the memory-bound 10M size across the whole op surface
(166 ops faster, only 36 slower); at small sizes the per-element dispatch + result-allocation
overhead (issue #42) dominates (~2x). The 676 C#-only rows are decimal/char/unsigned dtypes
NumSharp measures but NumPy has no peer for.

merge-results.py: replace the hand-maintained C#↔Python name mapping with a structural
canonicalizer applied to both sides — strip the dtype tag and "[...]" annotations, fold
"(a, axis=k)" into " axis=k", and strip identifier-only argument lists while keeping numeric
args (percentile/shift). The two np.where forms are disambiguated up front. This recovered
~320 matches that the verbose-vs-short naming gap had been dropping (no-data 444 -> 122)
without re-running the 2.4-hour benchmark — the merge re-runs on the saved per-suite JSON.

benchmark-report.md / README.md: the regenerated 3-size ratio report (per-size geomean
summary + full per-(op, dtype, N) matrix). Bulky run artifacts (results/<ts>/, intermediate
JSON/CSV) are gitignored; the markdown report is the tracked deliverable. benchmark/CLAUDE.md
documents run_benchmark.py as the official cross-platform entry point.
Durable provenance snapshot of the official NumSharp-vs-NumPy run, keyed by date + HEAD
short-hash under benchmark/history/2026-06-05_6038990f/:
  MANIFEST.md            run timestamp, commit context, environment (i9-13900K, .NET 10.0.101,
                         Python 3.12.12, NumPy 2.4.2), methodology, headline + per-suite geomean
  benchmark-report.md    3-size ratio matrix (human-readable)
  benchmark-report.json  unified results (1,233 rows, machine-readable)
  benchmark-report.csv   spreadsheet form
  numpy-results.json     raw NumPy timings (merge input)

Benchmarked NumSharp.Core was at d01f1d6. Headline: geomean NumSharp/NumPy 1.96x @1k,
1.83x @100k, 1.00x (parity) @10m across ~409 ops. The ~34 MB of raw BenchmarkDotNet
per-class JSON is intentionally not committed (regenerable via run_benchmark.py).
…-run)

The official run measured 12 dtypes; SByte (int8), Half (float16), and Complex were never
benchmarked (absent from every type-source) — a real coverage hole. Add them to the
benchmark code so the NEXT run covers the full 15. Placement is op-aware (verified against
NumSharp via a construction/op probe), so no benchmark will throw:

TypeParameterSource:
- SByte  -> ArithmeticTypes, IntegerTypes, AllNumericTypes  (full integer support)
- Half   -> ArithmeticTypes, FloatingTypes, TranscendentalTypes, AllNumericTypes (full float)
- Complex-> ArithmeticTypes, AllNumericTypes only. Complex supports +,-,* and (via magnitude)
  sum/min/max, but not order-dependent float-suite ops (median/maximum/floor) — so it stays
  out of FloatingTypes/TranscendentalTypes. divide/modulo benchmarks restrict to CommonTypes,
  so Complex never hits modulo.
- AllNumericTypes is now the true all-15 set; GetDtypeName maps the three to int8/float16/complex128.

BenchmarkBase: CreateRandomArray / CreatePositiveArray / GetScalar now construct SByte, Half,
Complex (randint(-100,100)->SByte; (rand*100-50)->Half/Complex; (Half)value;
Complex(value,0)).

NumPy side (numpy_benchmark.py): DTYPES += int8/float16/complex128; ARITHMETIC_DTYPES +=
int8/float16/complex128 (divide/modulo already gated to COMMON_DTYPES, so complex only sees
+,-,* and sum/mean/min/max); TRANSCENDENTAL/FLOAT += float16; BITWISE += int8.
merge-results.py dtype_map: sbyte->int8, half->float16, complex->complex128.

Result (next run): SByte↔int8 and Half↔float16 compare cleanly across their op domains;
Complex↔complex128 compares on arithmetic + sum/mean/min/max. Verified: C# Release build
clean, Python compiles, the three NumPy dtypes construct and do arithmetic. No benchmark
re-run performed — the 2026-06-05_6038990f snapshot stands as the recorded 12-dtype run.
@Nucs
Copy link
Copy Markdown
Member Author

Nucs commented Jun 5, 2026

📊 Benchmark & performance — nditer

Two performance items on this branch: a fused strided-SIMD unary kernel, and an official NumSharp-vs-NumPy benchmark across ~all op categories at three sizes (benchmarked Core state = d01f1d63).


1. Fused strided-SIMD unary IL kernel (d01f1d63)

New whole-array kernel for unary ops over non-contiguous 1-D inputs: strided gather → Vector{W}.Create → unary op → contiguous store, single pass, no scratch buffer, no per-tile dispatch.

Isolated kernel, np.sqrt(a[::2]), ns/elem:

N fused NumPy 2.4.2 NumSharp speedup
64 0.547 4.322 7.9×
4,096 0.549 0.660 1.20×
262,144 0.557 1.419 2.55×

The kernel is size-invariant (~0.55 ns/elem at every size) while NumPy degrades 2–6× as data spills out of cache.

All 11 ops on this path — speedup vs NumPy @262K (f64):

abs        3.37×   negate  3.15×   floor   3.07×   trunc   3.03×   round  3.00×
sqrt       2.55×   rad2deg 2.41×   deg2rad 2.22×   square  2.18×   reciprocal 1.72×

Verified 22,000 bit-exact checks (fused == contiguous kernel); full unit suite 9447/0/11.

Note: this is a DirectILKernelGenerator whole-array kernel that bypasses NpyIter by design — the fusion (gather folded into Vector.Create) is incompatible with NpyIter's gather/kernel separation, which is exactly the (slower) buffered path it replaces.


2. Official NumSharp-vs-NumPy benchmark (6038990f)

Methodology: BenchmarkDotNet Full — 50 iterations, InProcessEmit toolchain, iteration-time capped at 25 ms — × {1K / 100K / 10M} vs NumPy 2.4.2. i9-13900K · .NET 10.0.101 · Python 3.12.12. 1,813 C# measurements → 1,111 matched comparisons.

The iteration-time cap is what makes a Full run feasible: BDN's default Throughput strategy ramps to ~8192 invocations/iteration, so a 10M-element op at 50 iters took ~25 s per case. Capping it ⇒ ~15× faster (a 30-case set went 18 min → 70 s) with all 50 iterations preserved.

Headline — geomean (NumSharp ÷ NumPy, lower = better):

        slower ◄───────── 1.0 (parity) ─────────► faster
1K    ████████████████████  1.96×   (102 win / 212 lose)
100K  ██████████████████▎   1.83×   (109 win / 196 lose)
10M   ██████████▏ ........  1.00×   (166 win /  36 lose)   ◄ PARITY

At the memory-bound 10M size NumSharp is at parity across ~409 ops (166 faster, only 36 slower). Small-size cost is the per-element dispatch + result-allocation tax (~2×).

Per-suite geomean by size:

suite 1K 100K 10M
Statistics 0.19× 0.68× 0.48×
Sorting 0.41× 1.13× 0.45×
Comparison 1.27× 2.22× 0.50×
Bitwise 8.16× 1.16× 0.61×
Reduction 0.48× 0.94× 0.91×
Arithmetic 3.09× 2.62× 1.25× 🟡
Unary 3.50× 4.44× 1.53× 🟡
Creation 12.26× 2.92× 2.24× 🟠
LinearAlgebra 2.76× 1.66× 4.02× 🔴

🏆 Biggest wins (@10m, real ms):

op dtype NumPy NumSharp speedup
average f32 9.60 0.94 10.2×
nansum f32 14.35 1.49 10.0×
nanprod f32 18.52 1.90 9.7×
var f32 16.96 2.60 6.5×
count_nonzero f64 22.61 3.74 6.0×
nanmean f64 33.47 5.69 5.9×

🎯 Biggest gaps (@10m) — optimization targets:

op dtype NumPy NumSharp gap
sum axis=1 uint8 3.12 49.74 16.0×
dot f64 1.23 16.46 13.4×
matmul f64 0.72 4.26 5.9×
argsort int32 369 2162 5.9×

→ three fronts: narrow-int axis reductions (no widening-SIMD), linear algebra (no BLAS), sort.

Per-dtype @10m (geomean):

int64 0.91  uint64 0.92  f32 0.93  f64 0.98  uint8 1.00  uint32 0.99   ◄ strong
int32 1.11  int16 1.14   uint16 1.24   bool 1.60                       ◄ weak (bool, narrow-uint)

Dtype coverage: 10 dtypes compared vs NumPy; char/decimal measured but have no NumPy peer (C#-only). SByte/Half/Complex were uncovered and have since been added to the benchmark code (48e85528) — the next run produces the full 15-dtype matrix.


Reproducibility

  • Reusable cross-platform runner: python benchmark/run_benchmark.py (builds C#, runs BDN per-suite, sweeps NumPy at 3 sizes, merges, archives).
  • Full report: benchmark/benchmark-report.md (1,311 rows).
  • Provenance snapshot keyed by date+hash: benchmark/history/2026-06-05_6038990f/ (manifest + report + NumPy timings).

Nucs added 5 commits June 5, 2026 19:21
…, fix NpyIter EXLOOP iternext

Remove the dead axis-reduction cluster TryExecuteAxisReductionSimd + sum_/prod_/max_/min_axis_simd from DefaultEngine.ReductionOp.cs (~127 lines). These were unreferenced outside the file; the live axis path is Default.Reduction.Add.cs:ExecuteAxisReduction, which uses DirectILKernelGenerator.TryGetAxisReductionKernel directly and throws if no kernel is available. The dead methods' 'falls back to iterator-based approach' doc comments never described real behavior and were misleading.

Fix NpyIter.Iternext() external-loop over-iteration: it called _state->Advance() (a one-element ripple over ALL axes) unconditionally. On an EXTERNAL_LOOP iterator the kernel processes the innermost axis, so Advance() over-stepped by NDim-1 positions per call and read past the inner-loop buffer. Iternext() now routes the non-buffered-reduce path through GetIterNext(), which selects the correct advancer (ExternalLoopNext for EXLOOP, SingleIterationNext for ONEITERATION, StandardNext otherwise). StandardNext is byte-for-byte the old Advance() path for the common non-EXLOOP case, so this is a strict correction with no behavior change there.

Verified: NumSharp.Core and test project build with 0 errors; full unit suite (filter TestCategory!=OpenBugs&TestCategory!=HighMemory, net10.0) 9447 passed / 0 failed / 11 skipped.

Deferred (not in this commit): the buffered-cast stride bug in NpyIter.State.Advance() (advances DataPtrs by Strides*ElementSizes; after a buffered cast ElementSizes holds the buffer dtype size while Strides hold source strides -> wrong byte delta). It is not tripped by the current suite and needs a multi-fill buffered-cast reproduction test before a safe fix.
…me_kind)

Aligns NpyIterCasting.CanCast with NumPy 2.x byte-for-byte across the full
13x13 dtype matrix under both the 'safe' and 'same_kind' rules. Verified
programmatically against np.can_cast: 338/338 cells identical.

Motivation: np.copyto(int32_array, float64_dst) — and other valid casts —
threw InvalidCastException, because the casting rule tables diverged from
NumPy. The default copyto casting is 'same_kind', and int32->float64 is a safe
(hence same_kind) cast in NumPy, but NumSharp rejected it. Conversely some
casts NumPy forbids were being allowed.

same_kind (IsSameKindCast rewritten): now a strict superset of safe, matching
NumPy's NPY_SAME_KIND_CASTING:
  - int -> float is now ALLOWED (was rejected) — e.g. int32->float64.
  - signed -> unsigned is now REJECTED (was allowed) — e.g. int32->uint32.
  - int/float -> bool is now REJECTED (was allowed for int).
  Rule = safe || float->float || int->int(except signed->unsigned)
              || int->float || real->complex.

safe (IsSafeCast, pre-existing divergences closed):
  - unsigned -> strictly-wider signed is now safe: uint8->int16/int32/int64,
    uint16->int32/int64, uint32->int64 (the full unsigned range fits).
  - int64/uint64 -> complex128 is now safe (NumPy treats real->complex128 as
    safe across the board, consistent with int64->float64 being safe).
  - bool/uint8/int8 -> float16 is now safe (float16's 11-bit mantissa holds
    integers exactly to +-2048; wider ints and float narrowing stay unsafe).

Test: Cast_SameKindCasting_IntToFloat_Throws asserted the opposite of NumPy
("same-kind should not allow int -> float"). Renamed to ..._Allowed and
rewritten to assert the cast succeeds AND the buffered int32->double iterator
yields [1.0, 2.0, 3.0] — also covering the buffered-cast execution path.

Verification: cast matrix 338/338 == NumPy 2.x; full unit suite 9447 passed,
0 failed (CI filter, net10.0). Confirmed repro: np.copyto(zeros(4),
int32[1,2,3,4]) -> [1,2,3,4].
Replace the 1-D·1-D branch of DefaultEngine.Dot (numpy.dot vector·vector)
with a fused single-pass kernel that computes sum(a[i]*b[i]) directly,
instead of materializing a full n-element product array (`left * right`)
and then walking it again in ReduceAdd.

Motivation
----------
The previous path did `var product = left * right; return ReduceAdd(product, ...)`.
That allocates an n-element temporary every call and traverses the data twice
(write product, then read it back to reduce) — roughly 2× the memory traffic
plus heavy gen0 GC churn under repeated calls.

Implementation
--------------
- SimdDot.cs (Backends/Kernels): SIMD multiply-accumulate for contiguous
  float/double, 4 independent Vector256 accumulators + FMA (falls back to
  mul+add when FMA is absent) + scalar tail. Accumulates in the element type
  so the result dtype mirrors numpy.
- Default.Dot.Fused.cs (DotInner1D): same-type dispatch, all stride-aware
  (reads strides[0]+offset, so sliced / reversed `a[::-1]` / stepped `a[::2]`
  views are consumed in place — no copy):
    * float/double  -> SimdDot (contiguous) or scalar strided loop
    * int/uint8..64, Half, Decimal -> INumber<T> scalar accumulator
    * bool    -> OR over k of (a[k] AND b[k]), short-circuiting
    * Complex -> Complex accumulator (no conjugation)
  Mixed dtypes fall through to the original `left*right` + ReduceAdd, which
  already applies NEP50 promotion. Char (no INumber<char>) also uses the
  fallback.

NumPy 2.4.2 parity (verified)
-----------------------------
- same-type result PRESERVES the input dtype (int32·int32 -> int32, NOT the
  widened int64 that np.sum yields; float16·float16 -> float16);
- integer products wrap in the element dtype before accumulating
  (int8 [100,100]·[100,100] -> 32);
- bool dot -> bool (OR-of-ANDs); complex dot has no conjugation;
- empty -> scalar 0 of the input dtype (not the widened sum dtype);
- shape mismatch -> "shapes (n,) and (m,) not aligned: ..." (was a Debug.Assert
  that vanished in Release).

Benchmark (net10.0 Release, double, same harness before/after)
--------------------------------------------------------------
  n            before            after           speedup   GC(before->after)
  1,000        760.0 ms / 54 GC  219.2 ms / 15   3.5x      54  -> 15
  100,000      445.3 ms / 446    57.4 ms / 0     7.8x      446 -> 0
  1,000,000    1001.9 ms / 498   111.3 ms / 0    9.0x      498 -> 0
  10,000,000   1218.6 ms / 60    328.9 ms / 0    3.7x      60  -> 0
For n>=100k the fused np.dot is within ~3% of the raw SIMD ceiling; the
remaining n=1000 cost is the per-call result-scalar NDArray allocation
(inherent to returning an array). Large-n speedup is memory-bandwidth bound
(the old path moved ~2x the bytes); small-n speedup is fixed-overhead bound.

Tests
-----
- New np.dot.FusedTests.cs: 11 NumPy-parity cases (dtype preservation, int
  wrap, bool, complex no-conjugation, decimal, empty, strided/reversed,
  mixed-type promotion, shape-mismatch message).
- Full LinearAlgebra namespace: 139 passed, 0 failed (2-D / N-D / batched
  matmul paths unchanged).
…and per-chunk kernel

Move np.where's broadcast/strided path off the scalar NpyExpr.Where fallback and
onto a dedicated multi-operand per-chunk kernel driven by NpyIterRef.ForEach — the
"selection" item on the NpyIter migration priority list (NPYITER_PERF_HANDOVER §8).

WHY THE OLD PATH WAS SLOW
  WhereImpl already iterated on NpyIter, but compiled its inner loop through
  NpyExpr.Where, which is structurally wrong for where:
    * WhereNode.SupportsSimd == false  -> scalar only, and
    * NpyExpr gates SIMD on AllEqual(inputTypes, outputType), which is ALWAYS
      false for where (cond is Boolean, x/y are the output dtype). The DSL's
      "load every input at the output dtype" rule therefore casts cond -> T on
      EVERY element, then does a float compare-to-zero, before the branch.

WHAT CHANGED
  * New Backends/Kernels/ILKernelGenerator.Where.cs (the target per-chunk class):
    GetWhereInnerLoop(outType) emits a cached NpyInnerLoopFunc that, per chunk,
    runtime-dispatches on the inner strides:
      - cond stride==1 && x/y/result stride==elemSize  -> 4x-unrolled SIMD
        Vector.ConditionalSelect over an expanded bool mask + 1-vec remainder;
      - otherwise (broadcast cond/x/y, transpose, ::k, and all non-SIMD dtypes
        Bool/Char/Half/Decimal/Complex) -> raw-bool scalar walk (Ldind_U1, no cast).
    The SIMD mask expansion reuses the proven IL from the whole-array kernel, so
    SIMD output is bit-identical to the contiguous Direct WhereKernel.
  * APIs/np.where.cs: WhereImpl drives the existing 4-operand iterator with
    ForEach(GetWhereInnerLoop(dtype)) instead of NpyExpr.Where + ExecuteExpression
    (method marked unsafe for the ForEach default void* arg).
  * Direct/DirectILKernelGenerator.Where.cs: EmitInlineMaskCreation promoted
    private -> internal so the per-chunk kernel reuses one source of the
    bool-mask-expansion IL (EmitLoadIndirect/EmitStoreIndirect were already
    internal and cover all 15 dtypes for the scalar path).

  The contiguous and scalar-operand fast paths (DirectILKernelGenerator.WhereExecute
  and WhereScalarX/Y/XY) are deliberately UNTOUCHED — they already hit a fused
  whole-array SIMD kernel; routing them through NpyIter only ties at large N and
  risks a small-N setup-tax regression (HANDOVER §4.1/§4.7).

MEASURED (clean same-binary A/B; ms/call + GC.GetAllocatedBytesForCurrentThread)
  Every non-contiguous shape faster, GC down, small-N improved (no setup-tax hit):
    cond-row-bcast  f32 2.06x  i32 1.67x  f64 1.23x      (now hits SIMD select)
    cond-col-bcast  f32 1.54x  i32 1.34x  f64 1.39x      (faster raw-bool scalar)
    strided-transp  f32 1.34x  i32 1.52x  f64 1.26x
    strided-1d ::2  f64 1.31x  (GC -48%)
    small (1K) row  f32 1.38x  i32 1.46x  f64 1.19x      (cached kernel, no NpyExpr alloc)
  GC bytes/call reduced 7-48% (NpyExpr tree + ExecuteExpression machinery removed).

CORRECTNESS
  * Focused matrix: 7023/7023 checks across all 15 dtypes x {row-broadcast SIMD,
    col-broadcast scalar, transpose strided} + vector-aligned/tail sizes + NaN/Inf.
  * Full suite (CI filter): 9458 passed / 0 failed / 11 skipped.
  No public API or behavioral change; NumPy parity preserved.

Follow-ups documented in docs/MIGRATE_NPYITER.md: cond-broadcast SIMD-copy path
(closes the col-broadcast gap), np.place/masked-assign reuse, and full unification
(after the setup-tax phase).
…n large vectors

Add an opt-in global multithreading switch and a parallel path for the fused
1-D inner product (numpy.dot vector·vector) on contiguous float/double.

API
---
  np.multithreading(bool enabled, int max_threads = 8)
    np.multithreading(true);       // enable, up to 8 threads
    np.multithreading(true, 16);   // raise the cap
    np.multithreading(false);      // back to single-threaded (default)

Backends/MultiThread.cs holds the global state (Enabled, MaxThreads) and the
work-size gate:
  DegreeOfParallelism(n) = (enabled && n >= MinTotalWork)
                           ? min(MaxThreads, ProcessorCount, n / MinWorkPerThread)
                           : 1
with MinTotalWork=50k, MinWorkPerThread=32k. So tiny/medium dots stay on one
thread (thread fan-out costs more than it saves — the POC showed 32 threads
*regressing* at n=100k), and only large vectors are split.

Parallel dot (Default.Dot.Fused.cs)
-----------------------------------
DotContiguousF64/F32 consult DegreeOfParallelism; when >1 they partition [0,n)
into p contiguous chunks, run SimdDot on each via Parallel.For, and sum the
partials in chunk order (deterministic). Partials are padded to a cache line to
avoid false sharing. Summation regrouping means results can differ from the
single-threaded path in the last few ULPs — the same reordering NumPy's threaded
BLAS shows; tests assert agreement to tolerance.

DISABLED BY DEFAULT: default behavior and exact summation order are unchanged
unless the caller opts in. Only contiguous float/double dot is parallelized;
strided views and the other dtypes stay single-threaded.

Benchmark (net10.0, double, 32-core box)
----------------------------------------
Clean POC scaling (single-thread vs best multithreaded):
  n=100k :  10.2 -> 6.2 us   (1.7x, 8 threads)
  n=1M   :  172  -> 60  us   (2.9x)  -> ~2x faster than NumPy default (117us)
  n=10M  : 5350  -> 2783 us  (1.9x)  -> ~matches NumPy default (2527us); RAM-bw bound
At 1M (cache-resident) throughput scales with per-core cache bandwidth and beats
NumPy; at 10M it converges with NumPy at the memory-bandwidth wall (~55-63 GB/s).
Net effect: np.dot goes from ~2x behind NumPy-default (single-thread, prior
commit) to parity-or-better on large vectors.

Tests (test/.../np.multithreading.Tests.cs, +6)
-----------------------------------------------
- API sets/clamps state (max_threads >= 1);
- gate: <50k and disabled -> 1 thread, large -> >1, respects max_threads;
- parallel dot == sequential to tolerance (double and float);
- exact case full(2)·full(3) over 200k = 1,200,000.
Full LinearAlgebra namespace: 145 passed, 0 failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant