Skip to content

fix: make OOM kill count cumulative in container metrics timeseries#377

Open
debot-macmini1 wants to merge 2 commits into
mainfrom
fix/oom-timeseries-cumulative
Open

fix: make OOM kill count cumulative in container metrics timeseries#377
debot-macmini1 wants to merge 2 commits into
mainfrom
fix/oom-timeseries-cumulative

Conversation

@debot-macmini1

Copy link
Copy Markdown
Contributor

Problem

OOM events are visible via the new container_oom_event resource path, but the "normal" container metrics timeseries (MPA stream ContainerMetricItem) does not reliably reflect OOMs.

Root cause in current implementation:

  • SubscriptionManager.Broadcast sets oom_kill_count to 0/1 based only on the current sample's LastTerminationReason.
  • The stream send is non-blocking and drops metrics when the per-client channel is full.
  • OOM is a rare, single-sample signal, so a single drop can cause the timeseries to miss it.
  • Proto comment expects oom_kill_count to be cumulative if available.

Fix

Make oom_kill_count a cumulative, sticky count per (namespace/pod/container) inside SubscriptionManager:

  • Track {count, lastRestart} in-memory.
  • Increment when we observe LastTerminationReason == OOMKilled AND RestartCount advances beyond the last processed restartCount.
  • Always emit the cumulative oom_kill_count on every subsequent utilization sample.

This means even if the OOM-bearing sample is dropped due to backpressure, the counter persists and later samples still reflect the OOM.

Tests (negative/robustness)

Added unit tests in internal/server/mpa_server_test.go:

  • Verifies oom_kill_count is cumulative and sticky across non-OOM samples.
  • Verifies no double-count when the same restartCount is observed repeatedly.
  • Negative test: fills the client channel to force a dropped OOM sample, then confirms the next sample still carries oom_kill_count=1.

Notes

  • go test ./... runs e2e (test/e2e) which can hang locally; unit tests can be run with:
    • go test ./internal/server -run TestSubscriptionManager -count=1

@gitar-bot

gitar-bot Bot commented May 7, 2026

Copy link
Copy Markdown

Important

Your team uses Gitar, but you don't have an assigned seat yet. Ask a team admin to add your seat so Gitar can review your code. Learn more

Comment on lines +187 to +188
oomMu sync.Mutex
oomByKey map[string]oomState

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance: oomByKey map grows unbounded (memory leak)

getAndUpdateOomCount inserts an entry into sm.oomByKey for every unique (namespace/pod/container) it ever observes, but nothing ever removes entries. The key includes the pod name, which is ephemeral — every Deployment rollout, CronJob run, or pod reschedule creates a brand-new pod name and therefore a new permanent map entry. Unregister only cleans up the clients map (mpa_server.go:259-266) and there is no pod-deletion hook into the streaming layer. In a long-running server within a high-churn cluster this map grows without bound, causing steadily increasing memory usage. Consider tracking a lastSeen time.Time per entry and periodically evicting stale keys (e.g., via a background ticker), or pruning keys when the corresponding pod is no longer reported by the collector.

Track last-seen time per entry and evict stale keys with a periodic pruner.:

type oomState struct {
	count       int64
	lastRestart int64
	lastSeen    time.Time
}

// in getAndUpdateOomCount, after computing st:
st.lastSeen = time.Now()
sm.oomByKey[key] = st

// add a periodic pruner (started in NewSubscriptionManager):
func (sm *SubscriptionManager) pruneOomState(ttl time.Duration) {
	sm.oomMu.Lock()
	defer sm.oomMu.Unlock()
	cutoff := time.Now().Add(-ttl)
	for k, st := range sm.oomByKey {
		if st.lastSeen.Before(cutoff) {
			delete(sm.oomByKey, k)
		}
	}
}
  • Apply fix

Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎

Comment on lines +233 to 241
if lastReason == collector.ReasonOOMKilled {
if restartCount > st.lastRestart {
st.count++
st.lastRestart = restartCount
}
} else if restartCount > st.lastRestart {
// Keep lastRestart moving forward even when not OOMKilled.
st.lastRestart = restartCount
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: OOM count may under-count when restartCount jumps >1

The increment rule in getAndUpdateOomCount only adds 1 per call when OOMKilled is observed and restartCount > lastRestart, regardless of how far restartCount advanced. If the OOM-bearing samples for several restarts are all dropped/missed and the next observed OOM sample shows restartCount advanced by more than one (e.g. from 1 to 3), only a single increment is recorded even though multiple OOM kills occurred. Given the stated goal is a cumulative count that survives dropped samples, this is an approximation worth documenting; if exactness matters, increment by restartCount - st.lastRestart when OOMKilled. Note this would over-count if intervening restarts were non-OOM, so the trade-off should be intentional.

Was this helpful? React with 👍 / 👎

@gitar-bot

gitar-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown
Code Review ⚠️ Changes requested 0 resolved / 2 findings

Updates OOM kill count metrics to be cumulative and sticky to prevent data loss during network drops, but the implementation introduces an unbounded memory leak in the tracking map and potentially under-counts when restart counts increment significantly.

⚠️ Performance: oomByKey map grows unbounded (memory leak)

📄 internal/server/mpa_server.go:187-188 📄 internal/server/mpa_server.go:225-239 📄 internal/server/mpa_server.go:259-266

getAndUpdateOomCount inserts an entry into sm.oomByKey for every unique (namespace/pod/container) it ever observes, but nothing ever removes entries. The key includes the pod name, which is ephemeral — every Deployment rollout, CronJob run, or pod reschedule creates a brand-new pod name and therefore a new permanent map entry. Unregister only cleans up the clients map (mpa_server.go:259-266) and there is no pod-deletion hook into the streaming layer. In a long-running server within a high-churn cluster this map grows without bound, causing steadily increasing memory usage. Consider tracking a lastSeen time.Time per entry and periodically evicting stale keys (e.g., via a background ticker), or pruning keys when the corresponding pod is no longer reported by the collector.

Track last-seen time per entry and evict stale keys with a periodic pruner.
type oomState struct {
	count       int64
	lastRestart int64
	lastSeen    time.Time
}

// in getAndUpdateOomCount, after computing st:
st.lastSeen = time.Now()
sm.oomByKey[key] = st

// add a periodic pruner (started in NewSubscriptionManager):
func (sm *SubscriptionManager) pruneOomState(ttl time.Duration) {
	sm.oomMu.Lock()
	defer sm.oomMu.Unlock()
	cutoff := time.Now().Add(-ttl)
	for k, st := range sm.oomByKey {
		if st.lastSeen.Before(cutoff) {
			delete(sm.oomByKey, k)
		}
	}
}
💡 Edge Case: OOM count may under-count when restartCount jumps >1

📄 internal/server/mpa_server.go:233-241

The increment rule in getAndUpdateOomCount only adds 1 per call when OOMKilled is observed and restartCount > lastRestart, regardless of how far restartCount advanced. If the OOM-bearing samples for several restarts are all dropped/missed and the next observed OOM sample shows restartCount advanced by more than one (e.g. from 1 to 3), only a single increment is recorded even though multiple OOM kills occurred. Given the stated goal is a cumulative count that survives dropped samples, this is an approximation worth documenting; if exactness matters, increment by restartCount - st.lastRestart when OOMKilled. Note this would over-count if intervening restarts were non-OOM, so the trade-off should be intentional.

🤖 Prompt for agents
Code Review: Updates OOM kill count metrics to be cumulative and sticky to prevent data loss during network drops, but the implementation introduces an unbounded memory leak in the tracking map and potentially under-counts when restart counts increment significantly.

1. ⚠️ Performance: oomByKey map grows unbounded (memory leak)
   Files: internal/server/mpa_server.go:187-188, internal/server/mpa_server.go:225-239, internal/server/mpa_server.go:259-266

   `getAndUpdateOomCount` inserts an entry into `sm.oomByKey` for every unique `(namespace/pod/container)` it ever observes, but nothing ever removes entries. The key includes the pod name, which is ephemeral — every Deployment rollout, CronJob run, or pod reschedule creates a brand-new pod name and therefore a new permanent map entry. `Unregister` only cleans up the `clients` map (mpa_server.go:259-266) and there is no pod-deletion hook into the streaming layer. In a long-running server within a high-churn cluster this map grows without bound, causing steadily increasing memory usage. Consider tracking a `lastSeen time.Time` per entry and periodically evicting stale keys (e.g., via a background ticker), or pruning keys when the corresponding pod is no longer reported by the collector.

   Fix (Track last-seen time per entry and evict stale keys with a periodic pruner.):
   type oomState struct {
   	count       int64
   	lastRestart int64
   	lastSeen    time.Time
   }
   
   // in getAndUpdateOomCount, after computing st:
   st.lastSeen = time.Now()
   sm.oomByKey[key] = st
   
   // add a periodic pruner (started in NewSubscriptionManager):
   func (sm *SubscriptionManager) pruneOomState(ttl time.Duration) {
   	sm.oomMu.Lock()
   	defer sm.oomMu.Unlock()
   	cutoff := time.Now().Add(-ttl)
   	for k, st := range sm.oomByKey {
   		if st.lastSeen.Before(cutoff) {
   			delete(sm.oomByKey, k)
   		}
   	}
   }

2. 💡 Edge Case: OOM count may under-count when restartCount jumps >1
   Files: internal/server/mpa_server.go:233-241

   The increment rule in `getAndUpdateOomCount` only adds 1 per call when `OOMKilled` is observed and `restartCount > lastRestart`, regardless of how far restartCount advanced. If the OOM-bearing samples for several restarts are all dropped/missed and the next observed OOM sample shows restartCount advanced by more than one (e.g. from 1 to 3), only a single increment is recorded even though multiple OOM kills occurred. Given the stated goal is a cumulative count that survives dropped samples, this is an approximation worth documenting; if exactness matters, increment by `restartCount - st.lastRestart` when OOMKilled. Note this would over-count if intervening restarts were non-OOM, so the trade-off should be intentional.

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants