Skip to content

feat: [DSM-142] Use CanisterStates in ReplicatedState#10287

Open
alin-at-dfinity wants to merge 22 commits into
masterfrom
alin/DSM-142-canister-states-integration
Open

feat: [DSM-142] Use CanisterStates in ReplicatedState#10287
alin-at-dfinity wants to merge 22 commits into
masterfrom
alin/DSM-142-canister-states-integration

Conversation

@alin-at-dfinity
Copy link
Copy Markdown
Contributor

Switches ReplicatedState::canister_states from a flat BTreeMap<CanisterId, Arc<CanisterState>> to CanisterStates, exposing the hot/cold partition to the rest of the system and migrating every caller.

ReplicatedState changes:

  • canister_states field is now CanisterStates.
  • Drop canisters_iter_mut(). Round-level callers move to hot_canisters_iter_mut() (skips the long tail of cold canisters); bulk callers move to canisters_for_each_mut / canisters_try_for_each_mut, which iterate every canister and re-establish the partition afterwards.
  • Add hot_canisters_iter() for read-only hot-only iteration.
  • Add repartition_canister_states(), called from StateManager::commit_and_certify after flush_checkpoint_ops_and_page_maps to drive canisters that went quiet during the round back into cold before checkpointing, so that replicas continuing through a checkpoint and replicas (re)starting from it agree on the partition.
  • take_canister_states / put_canister_states now exchange the CanisterStates directly instead of going through a flat BTreeMap round-trip.
  • Aggregator delegations: total_compute_allocation, memory_taken, total_canister_memory_usage, guaranteed_response_message_memory_taken, best_effort_message_memory_taken, callback_count now delegate to CanisterStates and run in O(|hot|).

state_manager:

  • commit_and_certify calls state.repartition_canister_states() after flush_checkpoint_ops_and_page_maps and before tip handover.
  • validate_eq_canister_states calls CanisterStates::validate_strict_split on the reference state to verify that the persisted partition matches what CanisterStates::new would produce on a fresh load.
  • flush_checkpoint_ops_and_page_maps and switch_to_checkpoint switch from canisters_iter_mut to canisters_for_each_mut / canisters_try_for_each_mut.
  • Bench: bench_traversal likewise.

execution_environment:

  • scheduler.rs: scheduler hot-only iteration where appropriate (add_heartbeat_and_global_timer_tasks, purge_expired_ingress_messages, the ongoing_long_install_code check); migrate charge_canisters_for_resource_allocation_and_usage and the log-memory-store migration loop to canisters_for_each_mut.
  • round_schedule.rs: partition_canisters_to_cores now takes / returns a CanisterStates; idle canisters are dropped before the main hot-canister iteration.
  • query_handler.rs, execution_environment.rs: callers updated.
  • canister_manager/tests.rs, scheduler tests (scheduling.rs, metrics.rs, dts.rs, ecdsa.rs, round_schedule/tests.rs, test_utilities.rs, tests.rs) updated.
  • benches/scheduler.rs: updated.

canonical_state:

  • lazy_tree_conversion.rs: new CanisterStatesFork<'_> that presents a CanisterStates as a LazyFork over the merged hot+cold pools in CanisterId order.

messaging:

  • stream_builder/tests.rs, state_machine/tests.rs, tests/common/mod.rs: caller updates.

replicated_state queues and system_state:

  • CanisterQueues / SystemState local_canisters parameter type flips from &BTreeMap<CanisterId, Arc<CanisterState>> to &CanisterStates (no behavioral change; queues only need contains_key).

metrics.rs:

  • check_dts walks hot_canisters_iter() (only hot canisters can have non-empty task queues).
  • check_subnet_memory_usage switches to CanisterStates::memory_taken() for O(|hot|) aggregation.

test_utilities and state_tool:

  • test_utilities/execution_environment, test_utilities/state, and state_tool/src/commands/canister_metrics.rs updated to use the new iteration APIs.

alin-at-dfinity and others added 5 commits May 22, 2026 07:23
Lays the foundation for splitting `ReplicatedState::canister_states` into
"hot" (potentially active) and "cold" (definitely idle) pools, so that
per-round operations can skip the long tail of idle canisters.

This PR is intentionally a no-op for the running replica: it only adds
the new types and predicates. The integration into `ReplicatedState` and
the migration of all consumers follow in subsequent PRs.

Specifically:

  * `CanisterState::is_cold()` — pure predicate that classifies a canister
    as "definitely idle": no input/output, no task queue entries, no
    heartbeat method, inactive global timer, not `Stopping`, no
    unexpired best-effort callbacks, and no scheduler debits.
  * `CallContextManager::has_unexpired_callbacks()` and the matching
    `SystemState::has_unexpired_callbacks()` accessor, used by `is_cold`.
  * `CanisterStates`, a hot/cold-partitioned container with eager
    promotion (mutations land in `hot`) and lazy demotion (via
    `try_cool`/`try_cool_all`), plus the common map operations
    (`get`/`get_mut`/`insert`/`remove`/`contains_key`/`len`/`is_empty`/
    `retain`), per-pool iterators (`hot_iter`/`hot_values`/
    `hot_values_mut`), merged iterators in `CanisterId` order
    (`all_iter`/`all_keys`/`all_values`), and bulk mutation
    (`for_each_mut`/`try_for_each_mut`).
  * `CanisterStates::validate_strict_split()` for the canonical-partition
    invariant used in checkpoint validation.
  * `debug_assert_invariants()` runs on every mutating operation in
    debug builds.

`ColdStats` and the aggregate accessors (`total_compute_allocation`,
`total_canister_memory_usage`, `memory_taken`, `callback_count`, ...)
are intentionally **not** part of this PR — they will be added once the
struct is in place.

Co-authored-by: Cursor <cursoragent@cursor.com>
Maintains a small `ColdStats` aggregate over the canisters in the
`cold` pool, updated incrementally on every transition into / out of
`cold`. This lets the "touch every canister" aggregate queries —
`total_compute_allocation`, `total_canister_memory_usage`,
`memory_taken`, `callback_count`,
`guaranteed_response_message_memory_taken`,
`best_effort_message_memory_taken` — run in `O(|hot|)` instead of
`O(|all canisters|)`, which is the primary motivation for the
hot/cold split on subnets with a long tail of idle canisters.

The aggregates are derived (not persisted) and are reconstructed by
`CanisterStates::new` on checkpoint load. `debug_assert_invariants`
(now also runs an `O(|cold|)` recompute and compares against the live
aggregate) ensures every mutating method keeps them in sync, and the
`ColdStats` struct stays module-private — callers always reach the
totals through the public aggregator methods on `CanisterStates`.

`MemoryTaken`'s fields are bumped from private to `pub(crate)` so that
`CanisterStates::memory_taken` can construct the struct directly,
keeping `MemoryTaken` in its current home in `replicated_state.rs`.
`CanisterStates::memory_taken` itself is `pub(crate)` and will be
wired up to `ReplicatedState::memory_taken` in the next PR; an
`#[allow(dead_code)]` keeps the build warning-free until then.

Aggregator behaviour is exercised by two new tests
(`memory_aggregators_combine_hot_and_cold`,
`callback_count_combines_hot_and_cold`) and the bookkeeping
discipline is exercised by an extended set of `*_updates_cold_stats*`
tests covering `insert`, `remove`, `try_cool*`, `for_each_mut`,
`try_for_each_mut`, and `retain`.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ry. Rename raw_memory to execution_memory, so it better matches the equivalent MemoryTaken field. Update documentation and tests.
A canister can satisfy `CanisterState::is_cold()` while still holding a
guaranteed-response slot reservation: `is_cold()` only requires empty
input/output *messages* (the pool count) and no unexpired best-effort
callback, both of which are independent of whether the canister has
in-flight guaranteed-response requests. A canister that has pushed a
guaranteed-response request that's already been moved to an outgoing
stream still keeps the input-slot reservation for the eventual response,
which contributes `MAX_RESPONSE_COUNT_BYTES` to its
`guaranteed_response_message_memory_usage()`.

The previous commit dropped this field from `ColdStats` on the
assumption it was always zero. It isn't, and the consequence is that
`guaranteed_response_message_memory_taken()` quietly under-reports
subnet-wide memory: promoting a cold canister with a reservation to
`hot` (e.g. on the next `get_mut`) makes the subnet total jump up out
of nowhere, breaking conservation invariants in downstream code
(stream handler `debug_assert!`s, in particular).

Restore the field and the corresponding `add`/`sub` bookkeeping, fold
it into `guaranteed_response_message_memory_taken`,
`total_canister_memory_usage`, and `memory_taken`, and add a focused
test (`cold_canister_with_guaranteed_response_reservation_is_aggregated`)
exercising the case via `push_output_request` followed by draining the
output queue.

Best-effort message memory remains hot-only: an unexpired best-effort
callback forces the canister into `hot`, and any expired best-effort
callback shows up as a pending input which also forces `hot`.

Co-authored-by: Cursor <cursoragent@cursor.com>
Switches `ReplicatedState::canister_states` from a flat
`BTreeMap<CanisterId, Arc<CanisterState>>` to `CanisterStates`,
exposing the hot/cold partition to the rest of the system and
migrating every caller.

`ReplicatedState` changes:

  * `canister_states` field is now `CanisterStates`.
  * Drop `canisters_iter_mut()`. Round-level callers move to
    `hot_canisters_iter_mut()` (skips the long tail of cold
    canisters); bulk callers move to `canisters_for_each_mut` /
    `canisters_try_for_each_mut`, which iterate every canister and
    re-establish the partition afterwards.
  * Add `hot_canisters_iter()` for read-only hot-only iteration.
  * Add `repartition_canister_states()`, called from
    `StateManager::commit_and_certify` after
    `flush_checkpoint_ops_and_page_maps` to drive canisters that went
    quiet during the round back into `cold` before checkpointing, so
    that replicas continuing through a checkpoint and replicas
    (re)starting from it agree on the partition.
  * `take_canister_states` / `put_canister_states` now exchange the
    `CanisterStates` directly instead of going through a flat
    `BTreeMap` round-trip.
  * Aggregator delegations: `total_compute_allocation`, `memory_taken`,
    `total_canister_memory_usage`,
    `guaranteed_response_message_memory_taken`,
    `best_effort_message_memory_taken`, `callback_count` now delegate
    to `CanisterStates` and run in `O(|hot|)`.

`state_manager`:

  * `commit_and_certify` calls `state.repartition_canister_states()`
    after `flush_checkpoint_ops_and_page_maps` and before tip
    handover.
  * `validate_eq_canister_states` calls
    `CanisterStates::validate_strict_split` on the reference state to
    verify that the persisted partition matches what
    `CanisterStates::new` would produce on a fresh load.
  * `flush_checkpoint_ops_and_page_maps` and
    `switch_to_checkpoint` switch from `canisters_iter_mut` to
    `canisters_for_each_mut` / `canisters_try_for_each_mut`.
  * Bench: `bench_traversal` likewise.

`execution_environment`:

  * `scheduler.rs`: scheduler hot-only iteration where appropriate
    (`add_heartbeat_and_global_timer_tasks`,
    `purge_expired_ingress_messages`, the
    `ongoing_long_install_code` check); migrate
    `charge_canisters_for_resource_allocation_and_usage` and the
    log-memory-store migration loop to `canisters_for_each_mut`.
  * `round_schedule.rs`: `partition_canisters_to_cores` now takes /
    returns a `CanisterStates`; idle canisters are dropped before the
    main hot-canister iteration.
  * `query_handler.rs`, `execution_environment.rs`: callers updated.
  * `canister_manager/tests.rs`, scheduler tests
    (`scheduling.rs`, `metrics.rs`, `dts.rs`, `ecdsa.rs`,
    `round_schedule/tests.rs`, `test_utilities.rs`, `tests.rs`)
    updated.
  * `benches/scheduler.rs`: updated.

`canonical_state`:

  * `lazy_tree_conversion.rs`: new `CanisterStatesFork<'_>` that
    presents a `CanisterStates` as a `LazyFork` over the merged
    hot+cold pools in `CanisterId` order.

`canister_sandbox`:

  * `sandboxed_execution_controller.rs`: switch
    `evict_sandbox_processes` to per-id `state.canister_state(id)` /
    `state.canister_priority(id)` lookups (also enables removing the
    bulk `canister_accumulated_priorities` method). This duplicates
    the standalone "perf: Look up sandbox scheduler priorities per
    canister" PR; whichever lands first, the other becomes a no-op.

`messaging`:

  * `stream_handler/tests.rs`: pre-heat `LOCAL_CANISTER` in the
    `out_of_memory` reject-signal test so that the expected and
    inducted states share the same hot/cold partition.
  * `stream_builder/tests.rs`, `state_machine/tests.rs`,
    `tests/common/mod.rs`: caller updates.

`replicated_state` queues and system_state:

  * `CanisterQueues` / `SystemState` `local_canisters` parameter type
    flips from `&BTreeMap<CanisterId, Arc<CanisterState>>` to
    `&CanisterStates` (no behavioural change; queues only need
    `contains_key`).
  * `replicated_state.rs` deletes the now-unused
    `canister_accumulated_priorities` method.

`metrics.rs`:

  * `check_dts` walks `hot_canisters_iter()` (only hot canisters can
    have non-empty task queues).
  * `check_subnet_memory_usage` switches to
    `CanisterStates::memory_taken()` for `O(|hot|)` aggregation.

`test_utilities` and `state_tool`:

  * `test_utilities/execution_environment`, `test_utilities/state`,
    and `state_tool/src/commands/canister_metrics.rs` updated to use
    the new iteration APIs.

Co-authored-by: Cursor <cursoragent@cursor.com>
@alin-at-dfinity alin-at-dfinity requested a review from a team as a code owner May 22, 2026 10:25
@github-actions github-actions Bot added the feat label May 22, 2026
pull Bot pushed a commit to bit-cook/ic that referenced this pull request May 27, 2026
dfinity#10288)

Move the "drop idle canisters with 0-100 AP from the subnet schedule"
logic out of the `NextExecution::None` branch of the main per-canister
loop and into a dedicated pre-loop at the top of `start_iteration`.
Behavior is unchanged: the same set of idle canisters with priorities in
the 0-100 AP range get dropped.

Also clarify the doc comment for
`IterationSchedule::partition_canisters_to_cores`.

This is a small standalone refactor extracted from dfinity#10287, where the
main per-canister loop will switch from iterating all canisters to
iterating only hot canisters (at which point hoisting becomes a
correctness requirement: cold canisters would otherwise no longer be
visited by the main loop and their idle entries would not be dropped).

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: IDX GitHub Automation <infra+github-automation@dfinity.org>
…lementation; also apply pub(crate) to best_effort_message_memory_taken() and guaranteed_response_message_memory_taken(), as they are also potentially dangerous to use directly.
…d() test, so that all stats are covered; and for both hot and cold canisters.
Base automatically changed from alin/DSM-142-canister-states-cold-stats to master May 29, 2026 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant