Skip to content

feat(cachet): structured telemetry with spans, events, and handler API#460

Draft
schgoo wants to merge 8 commits into
mainfrom
u/schgoo/coalescemetrics
Draft

feat(cachet): structured telemetry with spans, events, and handler API#460
schgoo wants to merge 8 commits into
mainfrom
u/schgoo/coalescemetrics

Conversation

@schgoo
Copy link
Copy Markdown
Collaborator

@schgoo schgoo commented May 28, 2026

Summary

Replaces cachet's event-only telemetry with a span + event model that provides per-tier timing, request correlation, and a callback API for consumers to build custom telemetry pipelines.

Motivation

The previous telemetry emitted standalone tracing events with no correlation between tiers. Consumers couldn't distinguish which tier events belonged to which cache operation, and there was no way to subscribe to structured telemetry without parsing tracing fields via the visitor pattern.

Changes

Tracing spans + events

  • Each public Cache method (get, insert, invalidate, clear, get_or_insert, etc.) creates a parent span via CacheTelemetry
  • Each CacheWrapper tier operation creates a child span, producing a nested trace: cache.get → cache.tier
  • Events are emitted inside spans at the appropriate severity level (debug for hits/misses, info for inserts/expirations, error for failures)
  • Durations are recorded on both span fields (for trace subscribers) and event fields (for log subscribers)

CacheEventHandler callback API

  • New CacheEventHandler trait with on_tier_event and on_operation_complete callbacks
  • Registered via CacheBuilder::event_handler(handler)
  • Receives typed CacheTierEvent and CacheOperationEvent structs — no tracing visitor boilerplate
  • Works independently of the logs feature flag
  • Designed as the semi-stable consumer API that hopefully survives a future migration to emit

Request correlation

  • Each cache operation gets a unique request_id: u64 from a process-wide atomic counter
  • WithRequestId<F> future wrapper restores the request ID into a thread-local on every poll(), surviving task migration across threads/cores (same pattern as tracing::Instrument)
  • Both CacheTierEvent and CacheOperationEvent carry the request_id for grouping

Fallback as a flag

  • cache.fallback is now a boolean flag on tier events, not a separate event type
  • Indicates whether a tier was consulted as a fallback

Removed

  • telemetry/ext.rs (ClockExt, Timed, TimedResult) — replaced by clock.stopwatch() directly
  • EVENT_FALLBACK and EVENT_REQUEST_MERGED attribute constants
  • CacheTelemetryInner — span creation moved into CacheTelemetry directly

New attributes

  • FIELD_COALESCED — boolean flag for stampede protection
  • FIELD_FALLBACK — boolean flag for fallback tier consultation

Examples

  • telemetry_subscriber — shows span + event output with tracing_subscriber::fmt
  • telemetry_accumulator — demonstrates accumulating tier events into a single summary per operation using CacheEventHandler + DashMap, mirroring a TVS-style consumer pattern

Performance

Benchmarked in release mode (MockCache get, single tier):

Configuration Time
No telemetry 471ns
Telemetry enabled, no subscriber 481ns
With stampede protection 1,122ns

Telemetry with no active subscriber adds ~10ns overhead. The WithRequestId wrapper adds ~300ns on the stampede protection path.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 28, 2026

⚠️ Breaking Changes Detected


--- failure pub_module_level_const_missing: pub module-level const is missing ---

Description:
A public const is missing or renamed
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/pub_module_level_const_missing.ron

Failed in:
  EVENT_FALLBACK in file /home/runner/work/oxidizer/oxidizer/target/semver-checks/git-origin_main/f41130f5e3579781a04ee224211f1f2188d65394/crates/cachet/src/telemetry/attributes.rs:59

If the breaking changes are intentional then everything is fine - this message is merely informative.

Remember to apply a version number bump with the correct severity when publishing a version with breaking changes (1.x.x -> 2.x.x or 0.1.x -> 0.2.x).

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 98.85642% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.9%. Comparing base (b15b152) to head (b02b890).

Files with missing lines Patch % Lines
crates/cachet/src/telemetry/cache.rs 98.0% 9 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #460     +/-   ##
========================================
- Coverage   100.0%   99.9%   -0.1%     
========================================
  Files         307     306      -1     
  Lines       23903   24432    +529     
========================================
+ Hits        23903   24423    +520     
- Misses          0       9      +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant