fix: make OOM kill count cumulative in container metrics timeseries#377
fix: make OOM kill count cumulative in container metrics timeseries#377debot-macmini1 wants to merge 2 commits into
Conversation
|
Important Your team uses Gitar, but you don't have an assigned seat yet. Ask a team admin to add your seat so Gitar can review your code. Learn more |
| oomMu sync.Mutex | ||
| oomByKey map[string]oomState |
There was a problem hiding this comment.
⚠️ Performance: oomByKey map grows unbounded (memory leak)
getAndUpdateOomCount inserts an entry into sm.oomByKey for every unique (namespace/pod/container) it ever observes, but nothing ever removes entries. The key includes the pod name, which is ephemeral — every Deployment rollout, CronJob run, or pod reschedule creates a brand-new pod name and therefore a new permanent map entry. Unregister only cleans up the clients map (mpa_server.go:259-266) and there is no pod-deletion hook into the streaming layer. In a long-running server within a high-churn cluster this map grows without bound, causing steadily increasing memory usage. Consider tracking a lastSeen time.Time per entry and periodically evicting stale keys (e.g., via a background ticker), or pruning keys when the corresponding pod is no longer reported by the collector.
Track last-seen time per entry and evict stale keys with a periodic pruner.:
type oomState struct {
count int64
lastRestart int64
lastSeen time.Time
}
// in getAndUpdateOomCount, after computing st:
st.lastSeen = time.Now()
sm.oomByKey[key] = st
// add a periodic pruner (started in NewSubscriptionManager):
func (sm *SubscriptionManager) pruneOomState(ttl time.Duration) {
sm.oomMu.Lock()
defer sm.oomMu.Unlock()
cutoff := time.Now().Add(-ttl)
for k, st := range sm.oomByKey {
if st.lastSeen.Before(cutoff) {
delete(sm.oomByKey, k)
}
}
}
- Apply fix
Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎
| if lastReason == collector.ReasonOOMKilled { | ||
| if restartCount > st.lastRestart { | ||
| st.count++ | ||
| st.lastRestart = restartCount | ||
| } | ||
| } else if restartCount > st.lastRestart { | ||
| // Keep lastRestart moving forward even when not OOMKilled. | ||
| st.lastRestart = restartCount | ||
| } |
There was a problem hiding this comment.
💡 Edge Case: OOM count may under-count when restartCount jumps >1
The increment rule in getAndUpdateOomCount only adds 1 per call when OOMKilled is observed and restartCount > lastRestart, regardless of how far restartCount advanced. If the OOM-bearing samples for several restarts are all dropped/missed and the next observed OOM sample shows restartCount advanced by more than one (e.g. from 1 to 3), only a single increment is recorded even though multiple OOM kills occurred. Given the stated goal is a cumulative count that survives dropped samples, this is an approximation worth documenting; if exactness matters, increment by restartCount - st.lastRestart when OOMKilled. Note this would over-count if intervening restarts were non-OOM, so the trade-off should be intentional.
Was this helpful? React with 👍 / 👎
Code Review
|
Problem
OOM events are visible via the new
container_oom_eventresource path, but the "normal" container metrics timeseries (MPA streamContainerMetricItem) does not reliably reflect OOMs.Root cause in current implementation:
SubscriptionManager.Broadcastsetsoom_kill_countto 0/1 based only on the current sample'sLastTerminationReason.oom_kill_countto be cumulative if available.Fix
Make
oom_kill_counta cumulative, sticky count per(namespace/pod/container)insideSubscriptionManager:{count, lastRestart}in-memory.LastTerminationReason == OOMKilledANDRestartCountadvances beyond the last processed restartCount.oom_kill_counton every subsequent utilization sample.This means even if the OOM-bearing sample is dropped due to backpressure, the counter persists and later samples still reflect the OOM.
Tests (negative/robustness)
Added unit tests in
internal/server/mpa_server_test.go:oom_kill_countis cumulative and sticky across non-OOM samples.oom_kill_count=1.Notes
go test ./...runs e2e (test/e2e) which can hang locally; unit tests can be run with:go test ./internal/server -run TestSubscriptionManager -count=1