con · asmacdo · Apr 29, 2026 · Apr 29, 2026 · Apr 30, 2026 · Apr 30, 2026
diff --git a/README.md b/README.md
@@ -36,6 +36,10 @@ Try it out using either `duct` or `con-duct run`:
 
 `duct` is most useful when the report-interval is less than the duration of the script.
 
+## Documentation
+
+- [Interpreting duct's resource statistics](docs/resource-statistics.md) — what `pcpu`, `rss`, and the `con-duct plot` chart actually measure, and where the numbers can mislead.
+
 ## Reference
 
 ### con-duct
@@ -207,7 +211,8 @@ options:
                         running process. Sample interval must be less than or
                         equal to report interval, and it achieves the best
                         results when sample is significantly less than the
-                        runtime of the process. (default: 1.0)
+                        runtime of the process. Values below 1.0 behave
+                        erratically. (default: 1.0)
   --report-interval REPORT_INTERVAL, --r-i REPORT_INTERVAL
                         Interval in seconds at which to report aggregated
                         data. (default: 60.0)

diff --git a/docs/resource-statistics.md b/docs/resource-statistics.md
@@ -0,0 +1,244 @@
+# Interpreting duct's resource statistics
+
+duct records resource usage in two places:
+
+- **`usage.jsonl`**: one JSON record per report interval (default: every 60 seconds), capturing per-process and session-total stats aggregated over that window.
+- **`execution_summary`** (printed at exit and stored in `info.json`): a whole-run summary of peak and average values across the full execution.
+
+The numbers in both come from the same sampling pipeline.
+This document explains what those numbers actually measure, how `con-duct plot` renders them, and where they're trustworthy vs misleading.
+
+## How duct samples and aggregates
+
+Duct polls the monitored process tree on two independent intervals:
+
+- **`--sample-interval` (default 1.0s)**: how often duct reads per-pid stats via `ps -s <session_id>`.
+  Each read is a *sample*.
+- **`--report-interval` (default 60.0s)**: how often duct writes an aggregated record to `usage.jsonl`.
+  Each record summarizes all the samples taken during that report window.
+
+Aggregation within a report window uses **max reduction**:
+
+- For each per-pid metric, the reported value is the maximum observed across all samples of that pid in the window.
+- For each session-total metric (`totals.rss`, `totals.pcpu`, etc.), the reported value is the maximum observed across all samples' totals in the window.
+
+Consequences worth knowing:
+
+1. **Short spikes between samples are not recorded.**
+   A process that briefly allocates 10GB and frees it within a single sample interval is invisible to duct.
+2. **Per-pid and session-total peaks may come from different sample moments.**
+   Per-pid max-reduction and total max-reduction are independent.
+   The same record can have `stats[A].rss = X` (A's peak from one sub-sample) and `totals.rss = Y` (the peak simultaneous total from another sub-sample).
+
+---
+
+## CPU — `pcpu`
+
+### What it measures
+
+On Linux, `ps -o pcpu` is computed per process as:
+
+```
+pcpu = ((utime + stime) / (now - process_start_time)) × 100
+```
+
+- `utime + stime` is **cumulative CPU time consumed by the process since it started** (kernel ticks from `/proc/[pid]/stat`).
+- `now - process_start_time` is **wall-clock time elapsed since the process started**.
+
+So `pcpu` is *the fraction of wall time the process has spent on CPU, averaged from birth until the moment of sampling*.
+It is a **lifetime average**, not an instantaneous rate.
+This differs from `top(1)`, which shows instantaneous-over-refresh-interval.
+
+### `etime` is integer seconds
+
+`ps -o etime` reports elapsed time as an integer count of seconds (formatted `[[DD-]HH:]MM:SS`).
+This has consequences for short-lived and freshly-spawned pids:
+
+- During a pid's first second of life, `etime` reads as `00:00`.
+  ps's `pcpu` calculation divides by this `etime`, and the result during sub-second life is unstable.
+- A pid sampled at sub-second age that has accumulated meaningful CPU work across multiple threads can yield extreme `pcpu` readings.
+  Issue [#399](https://github.com/con/duct/issues/399) included a single pid reporting 5347% `pcpu` at `etime=3` on a 20-core machine, which is physically impossible: it came from a sub-second-young sub-sample where ps's calculation was racy.
+
+This is why sample intervals shorter than `1.0s` behave erratically.
+Consecutive samples of the same pid often see the same integer `etime`, so derived measurements (like the `--cpu ps-cpu-timepoint` view, below) discard those points because `Δetime = 0`.
+
+### Three scenarios to build intuition
+
+#### Scenario A: long-running steady-state process at 100% CPU
+
+```
+t = 1s:   cumulative CPU = 1.0s,  elapsed = 1s   → pcpu = 100%
+t = 10s:  cumulative CPU = 10.0s, elapsed = 10s  → pcpu = 100%
+t = 60s:  cumulative CPU = 60.0s, elapsed = 60s  → pcpu = 100%
+```
+
+For steady-state workloads, lifetime-average converges to instantaneous.
+*This is why the mental model "pcpu = current CPU usage" works most of the time.*
+
+#### Scenario B: brief burst, then idle
+
+A process that does 1 second of 100% CPU work, then sits idle:
+
+```
+t = 1s:   cumulative CPU = 1.0s, elapsed = 1s    → pcpu = 100%
+t = 2s:   cumulative CPU = 1.0s, elapsed = 2s    → pcpu =  50%
+t = 10s:  cumulative CPU = 1.0s, elapsed = 10s   → pcpu =  10%
+t = 100s: cumulative CPU = 1.0s, elapsed = 100s  → pcpu =   1%
+```
+
+After the burst, `pcpu` decays toward 0.
+The process "remembers" past CPU work and slowly forgets as its elapsed time grows.
+Counterintuitive if you expected a real-time number.
+
+#### Scenario C: the pathological summation case
+
+Many short-lived, multi-threaded native child processes, as happens under tox when pip compiles C extensions:
+
+```
+Child 1 runs for 200ms on 4 cores, observed by sample at t=150ms:
+  cumulative CPU = 600ms, elapsed = 150ms
+  → pcpu reported = 400%
+
+…30 such children observed during a single sample…
+
+sum across children at sample time = 30 × 400% = 12,000%
+system physical ceiling (20 cores)  =  2,000%
+```
+
+Each individual child's number is correct for what `ps` is answering ("fraction of wall time spent on CPU, averaged from start").
+The problem is that *summing* lifetime-averages across processes that took turns on the CPU produces a total claiming work the system didn't have the cores to do.
+The children ran sequentially, but the sum over the report window treats the spikes as simultaneous.
+
+### When `pcpu` is reliable
+
+| Workload shape                                | `pcpu` reliability                                      |
+|-----------------------------------------------|---------------------------------------------------------|
+| Single long-running steady-state process      | Accurate                                                |
+| Few long-running processes at steady state    | Accurate                                                |
+| Bursty processes that are long-running        | Accurate at the average; misses burst structure         |
+| Many short-lived (few second) child processes | **Unreliable: can inflate dramatically when summed**    |
+| Multi-threaded native code bursts             | Per-process `pcpu` correct; summed totals may overshoot |
+
+---
+
+## Memory — `rss` and `pmem`
+
+`ps -o rss` reports per-process **resident set size**: physical memory currently mapped into the process's address space, in kilobytes.
+This counts:
+
+- Private pages the process has allocated and touched.
+- **Shared pages** (libraries, copy-on-write memory after `fork()`) that the process has mapped, counted independently in **each** process that maps them.
+
+`ps -o pmem` is derived: `rss` divided by total system RAM, expressed as a percentage.
+It inherits every property of `rss` and adds a host-dependent denominator.
+
+### The shared-page issue
+
+When multiple processes share the same physical page, that page appears in **each process's RSS**, but the physical page exists only once.
+
+Example: a Python parent process with 100MB RSS forks 10 child workers.
+Immediately after fork:
+
+```
+Parent RSS:  100MB
+Child 1 RSS: 100MB
+…
+Child 10 RSS: 100MB
+
+Sum of RSS across processes: 1100MB
+Actual physical memory used: ~100MB (all shared with parent)
+```
+
+As children write to their copy of each page, copy-on-write triggers and the page becomes private.
+At that point physical use genuinely grows.
+So `sum(rss)` is a **loose upper bound** on actual usage: never less than true usage, often much more.
+
+For a duct-monitored Python test suite with `pytest-xdist` spawning 8 workers, expect `sum(rss)` to overstate physical memory by 3-5×.
+
+---
+
+## What `con-duct plot` renders
+
+`con-duct plot <usage>` renders, per report-interval record:
+
+- **Per-pid traces**: one faint dotted line per pid.
+  CPU is on the primary y-axis.
+  RSS is on a secondary axis (`twinx`) so the two scales don't fight.
+  Color encodes metric, not pid identity.
+- **Envelope lines**: summarize the per-pid cloud at each timestamp (one solid lower bound + one dashed upper bound).
+- **Optional host-memory annotation**: when `info.json` is alongside the usage file, the rss legend label includes total host RAM (e.g. `rss (host: 256.0GB)`).
+  Useful for SLURM contexts.
+  Without `info.json`, plain `rss`.
+
+### `--cpu` mode flag
+
+duct stores `pcpu` (lifetime average from ps) per pid in `usage.jsonl`.
+The plot can render this two ways:
+
+- **`--cpu ps-pcpu` (default)**: plot the raw lifetime ratio untransformed.
+  "Lossless" view: every point on the chart is an unaltered ps reading.
+  Useful when you want to see exactly what the sampler captured.
+- **`--cpu ps-cpu-timepoint`**: at plot time, derive a per-interval estimate from consecutive `(pcpu, etime)` pairs: `(curr_pcpu × curr_etime − prev_pcpu × prev_etime) / Δetime`.
+  This inverts ps's lifetime-average formula to extract an approximate instantaneous CPU rate.
+  Motivated by [Scenario C](#scenario-c-the-pathological-summation-case): lifetime averages of short-lived bursty processes overstate "current" usage by orders of magnitude.
+
+Both modes have caveats:
+
+- The raw `ps-pcpu` mode shows what ps reported, including lifetime-average inflation.
+  A pid that ran on 4 cores for 150ms and went idle peaks at 400% in the first report interval that observed it, then decays toward its true average as `etime` grows in subsequent intervals.
+- The derived `ps-cpu-timepoint` mode is approximate (delta math on max-reduced samples).
+  It discards each pid's first observation (no prior point to delta against), so short-lived pids that appear in only one record drop out entirely.
+  CPU bursts from those pids are not visible in the timepoint view, but remain visible in the `ps-pcpu` view via `totals.pcpu`.
+
+### Envelope semantics
+
+The plot draws two envelopes over the per-pid trace cloud:
+
+- **Lower bound (solid)**: max-across-pids at each timestamp.
+  Reads as "at least this much was in use."
+- **Upper bound (dashed)**: depends on what's being plotted.
+  - **RSS, and CPU in `ps-pcpu` mode**: `totals.*` from the record.
+    duct computes this as the peak simultaneous total observed across the report window's sub-samples.
+  - **CPU in `ps-cpu-timepoint` mode**: sum-across-pids of the derived (instantaneous) values at each timestamp.
+    Used here because `totals.pcpu` is a peak of *lifetime averages* and doesn't share units with the derived instantaneous values.
+
+## Common questions
+
+### Why is the raw `pcpu` line in `ps-pcpu` mode so much higher than `ncores × 100%`?
+
+Two compounding reasons, either of which can do it alone:
+
+1. **Single-pid extremes from ps.**
+   For pids sampled at sub-second age, ps's `cputime / etime` calculation is unstable (see [`etime` is integer seconds](#etime-is-integer-seconds)).
+   Individual pids can briefly report thousands of percent.
+2. **Summed lifetime-averages across many short-lived pids.**
+   Even if each pid's `pcpu` is finite, summing lifetime averages across processes that took turns on the cores produces a total claiming work the cores couldn't have done.
+   See [Scenario C](#scenario-c-the-pathological-summation-case).
+   Most common in workloads that spawn many short-lived child processes involving native/multi-threaded code: pip install compiling C extensions, `make -j`, tox, any CI/build workflow.
+
+### Why is the `ps-cpu-timepoint` line lower than the `ps-pcpu` line?
+
+`ps-pcpu` plots the lifetime average from ps.
+A burst captured early in a pid's life pulls the reported `pcpu` high, and that pid's trace decays slowly as `etime` grows.
+`ps-cpu-timepoint` instead estimates an instantaneous rate per report interval, so a burst contributes only to the interval that contained it.
+
+Example: a pid that did 600ms of CPU on 4 cores in its first 150ms and was idle thereafter.
+`ps-pcpu` shows ~400% in the first report interval and a decaying trace in subsequent intervals (until the pid dies or the trace falls off the chart).
+`ps-cpu-timepoint` shows ~400% only in the burst interval and ~0% thereafter.
+
+The timepoint view is more "honest" about current usage but loses the cumulative-effort information that `ps-pcpu` carries.
+
+### Why does `totals.*` not equal `sum(per-pid max)` in a record?
+
+duct max-reduces per-pid stats and session totals independently within a report window.
+A pid's reported `rss` is its max across sub-samples in the window; `totals.rss` is the max of the *simultaneous total* across those sub-samples.
+The per-pid peaks may have happened at different moments, so summing them counts moments that never coexisted.
+`totals.*` is the actual peak simultaneous footprint and is the right number for sizing.
+
+### My RSS chart grew a lot when I added more worker processes. Did memory usage really grow proportionally?
+
+Probably not.
+If the workers are forked children of a common parent, each child's RSS counts the shared pages it inherited.
+Per-pid traces and their max envelope grow roughly linearly with child count even when physical memory grows much less.
+The dashed `totals.rss` upper bound is closer to actual physical use, but still over-counts shared libraries linked by independent processes.
+See [The shared-page issue](#the-shared-page-issue).
diff --git a/src/con_duct/_duct_main.py b/src/con_duct/_duct_main.py
@@ -58,6 +58,12 @@ def execute(
         raise ValueError(
             "--report-interval must be greater than or equal to --sample-interval."
         )
+    if sample_interval < 1.0:
+        lgr.warning(
+            "--sample-interval=%.3f is below 1.0s and may behave erratically. "
+            "See docs/resource-statistics.md for details.",
+            sample_interval,
+        )
 
     log_paths = LogPaths.create(output_prefix, pid=os.getpid())
     try:

diff --git a/src/con_duct/_formatter.py b/src/con_duct/_formatter.py
@@ -9,14 +9,31 @@
 lgr = logging.getLogger("con-duct")
 
 
+# Decimal (SI) byte units, single source of truth for byte humanization in
+# duct. Used by SummaryFormatter.naturalsize for run-summary output and by
+# the plot axis formatter for tick labels, so a "kB" means the same thing
+# in both places. A future opt-in could add a 1024-base + IEC-suffix
+# variant (KiB/MiB/GiB/...) and let callers pick.
+FILESIZE_UNITS: list[tuple[str, int]] = [
+    ("B", 1),
+    ("kB", 1000**1),
+    ("MB", 1000**2),
+    ("GB", 1000**3),
+    ("TB", 1000**4),
+    ("PB", 1000**5),
+    ("EB", 1000**6),
+    ("ZB", 1000**7),
+    ("YB", 1000**8),
+]
+
+
 class SummaryFormatter(string.Formatter):
     OK = "OK"
     NOK = "X"
     NONE = "-"
     BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE = range(30, 38)
     RESET_SEQ = "\033[0m"
     COLOR_SEQ = "\033[1;%dm"
-    FILESIZE_SUFFIXES = (" kB", " MB", " GB", " TB", " PB", " EB", " ZB", " YB")
 
     def __init__(self, enable_colors: bool = False) -> None:
         self.enable_colors = enable_colors
@@ -34,7 +51,7 @@ def naturalsize(
             >>> formatter.naturalsize(3000000)
             '3.0 MB'
             >>> formatter.naturalsize(3000, "%.3f")
-            '2.930 kB'
+            '3.000 kB'
             >>> formatter.naturalsize(10**28)
             '10000.0 YB'
             ```
@@ -46,24 +63,23 @@ def naturalsize(
         Returns:
             str: Human readable representation of a filesize.
         """
-        base = 1000
         bytes_ = float(value)
         abs_bytes = abs(bytes_)
 
         if abs_bytes == 1:
             return "%d Byte" % bytes_
 
-        if abs_bytes < base:
+        if abs_bytes < 1000:
             return "%d Bytes" % bytes_
 
-        for i, _s in enumerate(self.FILESIZE_SUFFIXES):
-            unit = base ** (i + 2)
-
-            if abs_bytes < unit:
-                break
-
-        ret: str = format % (base * bytes_ / unit) + _s
-        return ret
+        # Pick the largest unit where the value is at least 1 of that unit.
+        # FILESIZE_UNITS is ordered ascending; iterate to keep updating until
+        # one further step would underflow.
+        name, divisor = FILESIZE_UNITS[1]  # default to "kB" (loop entry value)
+        for n, d in FILESIZE_UNITS[1:]:
+            if abs_bytes / d >= 1:
+                name, divisor = n, d
+        return f"{format % (bytes_ / divisor)} {name}"
 
     def color_word(self, s: str, color: int) -> str:
         """Color `s` with `color`.