-
Notifications
You must be signed in to change notification settings - Fork 6
Add plot --cpu modes, including ps-pcpu-timepoint for derived instantensous CPU #424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
asmacdo
wants to merge
16
commits into
con:main
Choose a base branch
from
asmacdo:plot-pcpu-fix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
690da58
plot: render per-pid pdcpu/pmem/rss instead of summed totals
asmacdo d9e51ca
plot: switch to envelope-based chart shape
asmacdo 7eb086b
plot: rss on secondary axis, drop pmem
asmacdo 2c8cffe
formatter,plot: unify byte-humanization on base 1000
asmacdo 41efb05
plot: drop per-pid notability filter
asmacdo 0c3188e
plot: use totals.rss for rss upper bound
asmacdo 2c2922d
plot: rename per-pid series key pdcpu -> cpu
asmacdo 38fa9d7
plot: add --cpu mode flag (ps-pcpu | ps-cpu-timepoint)
asmacdo 346cdb5
plot: drop sum envelope in --cpu ps-pcpu mode
asmacdo bc21738
plot: use totals.pcpu as upper bound in --cpu ps-pcpu mode
asmacdo 3febf6e
duct: warn when --sample-interval < 1.0s
asmacdo 8e6dedb
docs: add resource-statistics reference + link from README
asmacdo 6752c41
Merge branch 'main' into plot-pcpu-fix
asmacdo bf6c62a
[DATALAD RUNCMD] ./.update-readme-help.py
asmacdo 87346d1
docs: use nested totals.* schema in resource-statistics
asmacdo f284fb9
plot: refresh module docstring to match implementation
asmacdo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,244 @@ | ||
| # Interpreting duct's resource statistics | ||
|
|
||
| duct records resource usage in two places: | ||
|
|
||
| - **`usage.jsonl`**: one JSON record per report interval (default: every 60 seconds), capturing per-process and session-total stats aggregated over that window. | ||
| - **`execution_summary`** (printed at exit and stored in `info.json`): a whole-run summary of peak and average values across the full execution. | ||
|
|
||
| The numbers in both come from the same sampling pipeline. | ||
| This document explains what those numbers actually measure, how `con-duct plot` renders them, and where they're trustworthy vs misleading. | ||
|
|
||
| ## How duct samples and aggregates | ||
|
|
||
| Duct polls the monitored process tree on two independent intervals: | ||
|
|
||
| - **`--sample-interval` (default 1.0s)**: how often duct reads per-pid stats via `ps -s <session_id>`. | ||
| Each read is a *sample*. | ||
| - **`--report-interval` (default 60.0s)**: how often duct writes an aggregated record to `usage.jsonl`. | ||
| Each record summarizes all the samples taken during that report window. | ||
|
|
||
| Aggregation within a report window uses **max reduction**: | ||
|
|
||
| - For each per-pid metric, the reported value is the maximum observed across all samples of that pid in the window. | ||
| - For each session-total metric (`totals.rss`, `totals.pcpu`, etc.), the reported value is the maximum observed across all samples' totals in the window. | ||
|
|
||
| Consequences worth knowing: | ||
|
|
||
| 1. **Short spikes between samples are not recorded.** | ||
| A process that briefly allocates 10GB and frees it within a single sample interval is invisible to duct. | ||
| 2. **Per-pid and session-total peaks may come from different sample moments.** | ||
| Per-pid max-reduction and total max-reduction are independent. | ||
| The same record can have `stats[A].rss = X` (A's peak from one sub-sample) and `totals.rss = Y` (the peak simultaneous total from another sub-sample). | ||
|
|
||
| --- | ||
|
|
||
| ## CPU — `pcpu` | ||
|
|
||
| ### What it measures | ||
|
|
||
| On Linux, `ps -o pcpu` is computed per process as: | ||
|
|
||
| ``` | ||
| pcpu = ((utime + stime) / (now - process_start_time)) × 100 | ||
| ``` | ||
|
|
||
| - `utime + stime` is **cumulative CPU time consumed by the process since it started** (kernel ticks from `/proc/[pid]/stat`). | ||
| - `now - process_start_time` is **wall-clock time elapsed since the process started**. | ||
|
|
||
| So `pcpu` is *the fraction of wall time the process has spent on CPU, averaged from birth until the moment of sampling*. | ||
| It is a **lifetime average**, not an instantaneous rate. | ||
| This differs from `top(1)`, which shows instantaneous-over-refresh-interval. | ||
|
|
||
| ### `etime` is integer seconds | ||
|
|
||
| `ps -o etime` reports elapsed time as an integer count of seconds (formatted `[[DD-]HH:]MM:SS`). | ||
| This has consequences for short-lived and freshly-spawned pids: | ||
|
|
||
| - During a pid's first second of life, `etime` reads as `00:00`. | ||
| ps's `pcpu` calculation divides by this `etime`, and the result during sub-second life is unstable. | ||
| - A pid sampled at sub-second age that has accumulated meaningful CPU work across multiple threads can yield extreme `pcpu` readings. | ||
| Issue [#399](https://github.com/con/duct/issues/399) included a single pid reporting 5347% `pcpu` at `etime=3` on a 20-core machine, which is physically impossible: it came from a sub-second-young sub-sample where ps's calculation was racy. | ||
|
|
||
| This is why sample intervals shorter than `1.0s` behave erratically. | ||
| Consecutive samples of the same pid often see the same integer `etime`, so derived measurements (like the `--cpu ps-cpu-timepoint` view, below) discard those points because `Δetime = 0`. | ||
|
|
||
| ### Three scenarios to build intuition | ||
|
|
||
| #### Scenario A: long-running steady-state process at 100% CPU | ||
|
|
||
| ``` | ||
| t = 1s: cumulative CPU = 1.0s, elapsed = 1s → pcpu = 100% | ||
| t = 10s: cumulative CPU = 10.0s, elapsed = 10s → pcpu = 100% | ||
| t = 60s: cumulative CPU = 60.0s, elapsed = 60s → pcpu = 100% | ||
| ``` | ||
|
|
||
| For steady-state workloads, lifetime-average converges to instantaneous. | ||
| *This is why the mental model "pcpu = current CPU usage" works most of the time.* | ||
|
|
||
| #### Scenario B: brief burst, then idle | ||
|
|
||
| A process that does 1 second of 100% CPU work, then sits idle: | ||
|
|
||
| ``` | ||
| t = 1s: cumulative CPU = 1.0s, elapsed = 1s → pcpu = 100% | ||
| t = 2s: cumulative CPU = 1.0s, elapsed = 2s → pcpu = 50% | ||
| t = 10s: cumulative CPU = 1.0s, elapsed = 10s → pcpu = 10% | ||
| t = 100s: cumulative CPU = 1.0s, elapsed = 100s → pcpu = 1% | ||
| ``` | ||
|
|
||
| After the burst, `pcpu` decays toward 0. | ||
| The process "remembers" past CPU work and slowly forgets as its elapsed time grows. | ||
| Counterintuitive if you expected a real-time number. | ||
|
|
||
| #### Scenario C: the pathological summation case | ||
|
|
||
| Many short-lived, multi-threaded native child processes, as happens under tox when pip compiles C extensions: | ||
|
|
||
| ``` | ||
| Child 1 runs for 200ms on 4 cores, observed by sample at t=150ms: | ||
| cumulative CPU = 600ms, elapsed = 150ms | ||
| → pcpu reported = 400% | ||
|
|
||
| …30 such children observed during a single sample… | ||
|
|
||
| sum across children at sample time = 30 × 400% = 12,000% | ||
| system physical ceiling (20 cores) = 2,000% | ||
| ``` | ||
|
|
||
| Each individual child's number is correct for what `ps` is answering ("fraction of wall time spent on CPU, averaged from start"). | ||
| The problem is that *summing* lifetime-averages across processes that took turns on the CPU produces a total claiming work the system didn't have the cores to do. | ||
| The children ran sequentially, but the sum over the report window treats the spikes as simultaneous. | ||
|
|
||
| ### When `pcpu` is reliable | ||
|
|
||
| | Workload shape | `pcpu` reliability | | ||
| |-----------------------------------------------|---------------------------------------------------------| | ||
| | Single long-running steady-state process | Accurate | | ||
| | Few long-running processes at steady state | Accurate | | ||
| | Bursty processes that are long-running | Accurate at the average; misses burst structure | | ||
| | Many short-lived (few second) child processes | **Unreliable: can inflate dramatically when summed** | | ||
| | Multi-threaded native code bursts | Per-process `pcpu` correct; summed totals may overshoot | | ||
|
|
||
| --- | ||
|
|
||
| ## Memory — `rss` and `pmem` | ||
|
|
||
| `ps -o rss` reports per-process **resident set size**: physical memory currently mapped into the process's address space, in kilobytes. | ||
| This counts: | ||
|
|
||
| - Private pages the process has allocated and touched. | ||
| - **Shared pages** (libraries, copy-on-write memory after `fork()`) that the process has mapped, counted independently in **each** process that maps them. | ||
|
|
||
| `ps -o pmem` is derived: `rss` divided by total system RAM, expressed as a percentage. | ||
| It inherits every property of `rss` and adds a host-dependent denominator. | ||
|
|
||
| ### The shared-page issue | ||
|
|
||
| When multiple processes share the same physical page, that page appears in **each process's RSS**, but the physical page exists only once. | ||
|
|
||
| Example: a Python parent process with 100MB RSS forks 10 child workers. | ||
| Immediately after fork: | ||
|
|
||
| ``` | ||
| Parent RSS: 100MB | ||
| Child 1 RSS: 100MB | ||
| … | ||
| Child 10 RSS: 100MB | ||
|
|
||
| Sum of RSS across processes: 1100MB | ||
| Actual physical memory used: ~100MB (all shared with parent) | ||
| ``` | ||
|
|
||
| As children write to their copy of each page, copy-on-write triggers and the page becomes private. | ||
| At that point physical use genuinely grows. | ||
| So `sum(rss)` is a **loose upper bound** on actual usage: never less than true usage, often much more. | ||
|
|
||
| For a duct-monitored Python test suite with `pytest-xdist` spawning 8 workers, expect `sum(rss)` to overstate physical memory by 3-5×. | ||
|
|
||
| --- | ||
|
|
||
| ## What `con-duct plot` renders | ||
|
|
||
| `con-duct plot <usage>` renders, per report-interval record: | ||
|
|
||
| - **Per-pid traces**: one faint dotted line per pid. | ||
| CPU is on the primary y-axis. | ||
| RSS is on a secondary axis (`twinx`) so the two scales don't fight. | ||
| Color encodes metric, not pid identity. | ||
| - **Envelope lines**: summarize the per-pid cloud at each timestamp (one solid lower bound + one dashed upper bound). | ||
| - **Optional host-memory annotation**: when `info.json` is alongside the usage file, the rss legend label includes total host RAM (e.g. `rss (host: 256.0GB)`). | ||
| Useful for SLURM contexts. | ||
| Without `info.json`, plain `rss`. | ||
|
|
||
| ### `--cpu` mode flag | ||
|
|
||
| duct stores `pcpu` (lifetime average from ps) per pid in `usage.jsonl`. | ||
| The plot can render this two ways: | ||
|
|
||
| - **`--cpu ps-pcpu` (default)**: plot the raw lifetime ratio untransformed. | ||
| "Lossless" view: every point on the chart is an unaltered ps reading. | ||
| Useful when you want to see exactly what the sampler captured. | ||
| - **`--cpu ps-cpu-timepoint`**: at plot time, derive a per-interval estimate from consecutive `(pcpu, etime)` pairs: `(curr_pcpu × curr_etime − prev_pcpu × prev_etime) / Δetime`. | ||
| This inverts ps's lifetime-average formula to extract an approximate instantaneous CPU rate. | ||
| Motivated by [Scenario C](#scenario-c-the-pathological-summation-case): lifetime averages of short-lived bursty processes overstate "current" usage by orders of magnitude. | ||
|
|
||
| Both modes have caveats: | ||
|
|
||
| - The raw `ps-pcpu` mode shows what ps reported, including lifetime-average inflation. | ||
| A pid that ran on 4 cores for 150ms and went idle peaks at 400% in the first report interval that observed it, then decays toward its true average as `etime` grows in subsequent intervals. | ||
| - The derived `ps-cpu-timepoint` mode is approximate (delta math on max-reduced samples). | ||
| It discards each pid's first observation (no prior point to delta against), so short-lived pids that appear in only one record drop out entirely. | ||
| CPU bursts from those pids are not visible in the timepoint view, but remain visible in the `ps-pcpu` view via `totals.pcpu`. | ||
|
|
||
| ### Envelope semantics | ||
|
|
||
| The plot draws two envelopes over the per-pid trace cloud: | ||
|
|
||
| - **Lower bound (solid)**: max-across-pids at each timestamp. | ||
| Reads as "at least this much was in use." | ||
| - **Upper bound (dashed)**: depends on what's being plotted. | ||
| - **RSS, and CPU in `ps-pcpu` mode**: `totals.*` from the record. | ||
| duct computes this as the peak simultaneous total observed across the report window's sub-samples. | ||
| - **CPU in `ps-cpu-timepoint` mode**: sum-across-pids of the derived (instantaneous) values at each timestamp. | ||
| Used here because `totals.pcpu` is a peak of *lifetime averages* and doesn't share units with the derived instantaneous values. | ||
|
|
||
| ## Common questions | ||
|
|
||
| ### Why is the raw `pcpu` line in `ps-pcpu` mode so much higher than `ncores × 100%`? | ||
|
|
||
| Two compounding reasons, either of which can do it alone: | ||
|
|
||
| 1. **Single-pid extremes from ps.** | ||
| For pids sampled at sub-second age, ps's `cputime / etime` calculation is unstable (see [`etime` is integer seconds](#etime-is-integer-seconds)). | ||
| Individual pids can briefly report thousands of percent. | ||
| 2. **Summed lifetime-averages across many short-lived pids.** | ||
| Even if each pid's `pcpu` is finite, summing lifetime averages across processes that took turns on the cores produces a total claiming work the cores couldn't have done. | ||
| See [Scenario C](#scenario-c-the-pathological-summation-case). | ||
| Most common in workloads that spawn many short-lived child processes involving native/multi-threaded code: pip install compiling C extensions, `make -j`, tox, any CI/build workflow. | ||
|
|
||
| ### Why is the `ps-cpu-timepoint` line lower than the `ps-pcpu` line? | ||
|
|
||
| `ps-pcpu` plots the lifetime average from ps. | ||
| A burst captured early in a pid's life pulls the reported `pcpu` high, and that pid's trace decays slowly as `etime` grows. | ||
| `ps-cpu-timepoint` instead estimates an instantaneous rate per report interval, so a burst contributes only to the interval that contained it. | ||
|
|
||
| Example: a pid that did 600ms of CPU on 4 cores in its first 150ms and was idle thereafter. | ||
| `ps-pcpu` shows ~400% in the first report interval and a decaying trace in subsequent intervals (until the pid dies or the trace falls off the chart). | ||
| `ps-cpu-timepoint` shows ~400% only in the burst interval and ~0% thereafter. | ||
|
|
||
| The timepoint view is more "honest" about current usage but loses the cumulative-effort information that `ps-pcpu` carries. | ||
|
|
||
| ### Why does `totals.*` not equal `sum(per-pid max)` in a record? | ||
|
|
||
| duct max-reduces per-pid stats and session totals independently within a report window. | ||
| A pid's reported `rss` is its max across sub-samples in the window; `totals.rss` is the max of the *simultaneous total* across those sub-samples. | ||
| The per-pid peaks may have happened at different moments, so summing them counts moments that never coexisted. | ||
| `totals.*` is the actual peak simultaneous footprint and is the right number for sizing. | ||
|
|
||
| ### My RSS chart grew a lot when I added more worker processes. Did memory usage really grow proportionally? | ||
|
|
||
| Probably not. | ||
| If the workers are forked children of a common parent, each child's RSS counts the shared pages it inherited. | ||
| Per-pid traces and their max envelope grow roughly linearly with child count even when physical memory grows much less. | ||
| The dashed `totals.rss` upper bound is closer to actual physical use, but still over-counts shared libraries linked by independent processes. | ||
| See [The shared-page issue](#the-shared-page-issue). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.