fix: make OOM kill count cumulative in container metrics timeseries by debot-macmini1 · Pull Request #377 · devzero-inc/zxporter

debot-macmini1 · 2026-05-07T07:43:51Z

Problem

OOM events are visible via the new container_oom_event resource path, but the "normal" container metrics timeseries (MPA stream ContainerMetricItem) does not reliably reflect OOMs.

Root cause in current implementation:

SubscriptionManager.Broadcast sets oom_kill_count to 0/1 based only on the current sample's LastTerminationReason.
The stream send is non-blocking and drops metrics when the per-client channel is full.
OOM is a rare, single-sample signal, so a single drop can cause the timeseries to miss it.
Proto comment expects oom_kill_count to be cumulative if available.

Fix

Make oom_kill_count a cumulative, sticky count per (namespace/pod/container) inside SubscriptionManager:

Track {count, lastRestart} in-memory.
Increment when we observe LastTerminationReason == OOMKilled AND RestartCount advances beyond the last processed restartCount.
Always emit the cumulative oom_kill_count on every subsequent utilization sample.

This means even if the OOM-bearing sample is dropped due to backpressure, the counter persists and later samples still reflect the OOM.

Tests (negative/robustness)

Added unit tests in internal/server/mpa_server_test.go:

Verifies oom_kill_count is cumulative and sticky across non-OOM samples.
Verifies no double-count when the same restartCount is observed repeatedly.
Negative test: fills the client channel to force a dropped OOM sample, then confirms the next sample still carries oom_kill_count=1.

Notes

go test ./... runs e2e (test/e2e) which can hang locally; unit tests can be run with:
- go test ./internal/server -run TestSubscriptionManager -count=1

gitar-bot · 2026-05-07T07:43:56Z

Important

Your team uses Gitar, but you don't have an assigned seat yet. Ask a team admin to add your seat so Gitar can review your code. Learn more

gitar-bot · 2026-06-04T15:15:37Z

+	oomMu    sync.Mutex
+	oomByKey map[string]oomState


⚠️ Performance: oomByKey map grows unbounded (memory leak)

getAndUpdateOomCount inserts an entry into sm.oomByKey for every unique (namespace/pod/container) it ever observes, but nothing ever removes entries. The key includes the pod name, which is ephemeral — every Deployment rollout, CronJob run, or pod reschedule creates a brand-new pod name and therefore a new permanent map entry. Unregister only cleans up the clients map (mpa_server.go:259-266) and there is no pod-deletion hook into the streaming layer. In a long-running server within a high-churn cluster this map grows without bound, causing steadily increasing memory usage. Consider tracking a lastSeen time.Time per entry and periodically evicting stale keys (e.g., via a background ticker), or pruning keys when the corresponding pod is no longer reported by the collector.

Track last-seen time per entry and evict stale keys with a periodic pruner.:

type oomState struct { count int64 lastRestart int64 lastSeen time.Time } // in getAndUpdateOomCount, after computing st: st.lastSeen = time.Now() sm.oomByKey[key] = st // add a periodic pruner (started in NewSubscriptionManager): func (sm *SubscriptionManager) pruneOomState(ttl time.Duration) { sm.oomMu.Lock() defer sm.oomMu.Unlock() cutoff := time.Now().Add(-ttl) for k, st := range sm.oomByKey { if st.lastSeen.Before(cutoff) { delete(sm.oomByKey, k) } } }

Apply fix

_{Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎}

gitar-bot · 2026-06-04T15:15:38Z

+	if lastReason == collector.ReasonOOMKilled {
+		if restartCount > st.lastRestart {
+			st.count++
+			st.lastRestart = restartCount
+		}
+	} else if restartCount > st.lastRestart {
+		// Keep lastRestart moving forward even when not OOMKilled.
+		st.lastRestart = restartCount
 	}


💡 Edge Case: OOM count may under-count when restartCount jumps >1

The increment rule in getAndUpdateOomCount only adds 1 per call when OOMKilled is observed and restartCount > lastRestart, regardless of how far restartCount advanced. If the OOM-bearing samples for several restarts are all dropped/missed and the next observed OOM sample shows restartCount advanced by more than one (e.g. from 1 to 3), only a single increment is recorded even though multiple OOM kills occurred. Given the stated goal is a cumulative count that survives dropped samples, this is an approximation worth documenting; if exactness matters, increment by restartCount - st.lastRestart when OOMKilled. Note this would over-count if intervening restarts were non-OOM, so the trade-off should be intentional.

_{Was this helpful? React with 👍 / 👎}

gitar-bot · 2026-06-04T15:15:39Z

Code Review ⚠️ Changes requested 0 resolved / 2 findings

Updates OOM kill count metrics to be cumulative and sticky to prevent data loss during network drops, but the implementation introduces an unbounded memory leak in the tracking map and potentially under-counts when restart counts increment significantly.

⚠️

Performance: oomByKey map grows unbounded (memory leak)

📄 internal/server/mpa_server.go:187-188 📄 internal/server/mpa_server.go:225-239 📄 internal/server/mpa_server.go:259-266

getAndUpdateOomCount inserts an entry into sm.oomByKey for every unique (namespace/pod/container) it ever observes, but nothing ever removes entries. The key includes the pod name, which is ephemeral — every Deployment rollout, CronJob run, or pod reschedule creates a brand-new pod name and therefore a new permanent map entry. Unregister only cleans up the clients map (mpa_server.go:259-266) and there is no pod-deletion hook into the streaming layer. In a long-running server within a high-churn cluster this map grows without bound, causing steadily increasing memory usage. Consider tracking a lastSeen time.Time per entry and periodically evicting stale keys (e.g., via a background ticker), or pruning keys when the corresponding pod is no longer reported by the collector.

Track last-seen time per entry and evict stale keys with a periodic pruner.

type oomState struct {
	count       int64
	lastRestart int64
	lastSeen    time.Time
}

// in getAndUpdateOomCount, after computing st:
st.lastSeen = time.Now()
sm.oomByKey[key] = st

// add a periodic pruner (started in NewSubscriptionManager):
func (sm *SubscriptionManager) pruneOomState(ttl time.Duration) {
	sm.oomMu.Lock()
	defer sm.oomMu.Unlock()
	cutoff := time.Now().Add(-ttl)
	for k, st := range sm.oomByKey {
		if st.lastSeen.Before(cutoff) {
			delete(sm.oomByKey, k)
		}
	}
}

💡 Edge Case: OOM count may under-count when restartCount jumps >1

📄 internal/server/mpa_server.go:233-241

The increment rule in getAndUpdateOomCount only adds 1 per call when OOMKilled is observed and restartCount > lastRestart, regardless of how far restartCount advanced. If the OOM-bearing samples for several restarts are all dropped/missed and the next observed OOM sample shows restartCount advanced by more than one (e.g. from 1 to 3), only a single increment is recorded even though multiple OOM kills occurred. Given the stated goal is a cumulative count that survives dropped samples, this is an approximation worth documenting; if exactness matters, increment by restartCount - st.lastRestart when OOMKilled. Note this would over-count if intervening restarts were non-OOM, so the trade-off should be intentional.

🤖 Prompt for agents

Code Review: Updates OOM kill count metrics to be cumulative and sticky to prevent data loss during network drops, but the implementation introduces an unbounded memory leak in the tracking map and potentially under-counts when restart counts increment significantly.

1. ⚠️ Performance: oomByKey map grows unbounded (memory leak)
   Files: internal/server/mpa_server.go:187-188, internal/server/mpa_server.go:225-239, internal/server/mpa_server.go:259-266

   `getAndUpdateOomCount` inserts an entry into `sm.oomByKey` for every unique `(namespace/pod/container)` it ever observes, but nothing ever removes entries. The key includes the pod name, which is ephemeral — every Deployment rollout, CronJob run, or pod reschedule creates a brand-new pod name and therefore a new permanent map entry. `Unregister` only cleans up the `clients` map (mpa_server.go:259-266) and there is no pod-deletion hook into the streaming layer. In a long-running server within a high-churn cluster this map grows without bound, causing steadily increasing memory usage. Consider tracking a `lastSeen time.Time` per entry and periodically evicting stale keys (e.g., via a background ticker), or pruning keys when the corresponding pod is no longer reported by the collector.

   Fix (Track last-seen time per entry and evict stale keys with a periodic pruner.):
   type oomState struct {
   	count       int64
   	lastRestart int64
   	lastSeen    time.Time
   }
   
   // in getAndUpdateOomCount, after computing st:
   st.lastSeen = time.Now()
   sm.oomByKey[key] = st
   
   // add a periodic pruner (started in NewSubscriptionManager):
   func (sm *SubscriptionManager) pruneOomState(ttl time.Duration) {
   	sm.oomMu.Lock()
   	defer sm.oomMu.Unlock()
   	cutoff := time.Now().Add(-ttl)
   	for k, st := range sm.oomByKey {
   		if st.lastSeen.Before(cutoff) {
   			delete(sm.oomByKey, k)
   		}
   	}
   }

2. 💡 Edge Case: OOM count may under-count when restartCount jumps >1
   Files: internal/server/mpa_server.go:233-241

   The increment rule in `getAndUpdateOomCount` only adds 1 per call when `OOMKilled` is observed and `restartCount > lastRestart`, regardless of how far restartCount advanced. If the OOM-bearing samples for several restarts are all dropped/missed and the next observed OOM sample shows restartCount advanced by more than one (e.g. from 1 to 3), only a single increment is recorded even though multiple OOM kills occurred. Given the stated goal is a cumulative count that survives dropped samples, this is an approximation worth documenting; if exactness matters, increment by `restartCount - st.lastRestart` when OOMKilled. Note this would over-count if intervening restarts were non-OOM, so the trade-off should be intentional.

_{Was this helpful? React with 👍 / 👎 | Gitar}

fix: make oom_kill_count cumulative in MPA timeseries

9b72ad1

Merge branch 'main' into fix/oom-timeseries-cumulative

6ba7126

gitar-bot Bot reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make OOM kill count cumulative in container metrics timeseries#377

fix: make OOM kill count cumulative in container metrics timeseries#377
debot-macmini1 wants to merge 2 commits into
mainfrom
fix/oom-timeseries-cumulative

debot-macmini1 commented May 7, 2026

Uh oh!

gitar-bot Bot commented May 7, 2026

Uh oh!

gitar-bot Bot Jun 4, 2026

Uh oh!

gitar-bot Bot Jun 4, 2026

Uh oh!

gitar-bot Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

debot-macmini1 commented May 7, 2026

Problem

Fix

Tests (negative/robustness)

Notes

Uh oh!

gitar-bot Bot commented May 7, 2026

Uh oh!

gitar-bot Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gitar-bot Bot commented Jun 4, 2026 •

edited

Loading