document k8s-cache by NimrodAvni78 · Pull Request #1886 · open-telemetry/opentelemetry-ebpf-instrumentation

NimrodAvni78 · 2026-04-21T10:47:38Z

Summary

Part of #1330
This is more of an internal documentation for developers
a more high level documentation to opentelemetry.io will come shortly

Validation

I have read and followed the contributing guidelines
If this enhances / fixes / changes a core feature, I have updated the features documentation and support matrix as needed.

codecov · 2026-04-21T10:50:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.33%. Comparing base (28d8e60) to head (b29d247).
⚠️ Report is 8 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (28d8e60) and HEAD (b29d247). Click for more details.

HEAD has 43 uploads less than BASE

Flag BASE (28d8e60) HEAD (b29d247)

oats-test 7 0

k8s-integration-test 15 0

integration-test-arm 4 0

integration-test-vm-x86_64-5.15.152 3 0

integration-test-vm-x86_64-6.10.6 4 0

integration-test 10 0

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1886       +/-   ##
===========================================
- Coverage   69.51%   58.33%   -11.18%     
===========================================
  Files         277      277               
  Lines       33491    34230      +739     
===========================================
- Hits        23280    19969     -3311     
- Misses       8972    13233     +4261     
+ Partials     1239     1028      -211

Flag	Coverage Δ
integration-test	`?`
integration-test-arm	`?`
integration-test-vm-x86_64-5.15.152	`?`
integration-test-vm-x86_64-6.10.6	`?`
k8s-integration-test	`?`
oats-test	`?`
unittests	`58.33% <ø> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

MrAlias

Thanks for putting this together. This is useful internal documentation, especially the overview of the service, the code pointers, and the deployment guidance. I left a few comments where the doc appears to describe behavior or requirements more strongly than the current implementation supports.

MrAlias · 2026-04-21T19:55:01Z

+
+Clients send a `FromTimestampEpoch` on `Subscribe`. On reconnect, OBI sends the
+timestamp of the last event it successfully processed so the cache can skip
+anything older and avoid a full snapshot replay.


This reads like reconnects can replay a true delta of what was missed, but the current implementation is narrower than that. FromTimestampEpoch is only used to filter the current in-memory snapshot in meta.Informers.sortAndCut; there is no persisted event log. That means a client can still miss deletes, and anything that disappeared before reconnect will not be replayed. I think this section should be softened so it does not over-promise recovery behavior.

MrAlias · 2026-04-21T19:55:01Z

+|--------------------------|--------------------------------------------------------|----------------|------------------------------------------------------------|
+| `log_level`              | `OTEL_EBPF_K8S_CACHE_LOG_LEVEL`                        | `info`         | `debug`/`info`/`warn`/`error`.                             |
+| `port`                   | `OTEL_EBPF_K8S_CACHE_PORT`                             | `50055`        | gRPC listen port.                                          |
+| `max_connections`        | `OTEL_EBPF_K8S_CACHE_MAX_CONNECTIONS`                  | `150`          | Max concurrent subscribing OBI clients.                    |


This wording sounds like max_connections is a total cap on subscribing OBI clients, but the server currently wires it into grpc.MaxConcurrentStreams, which limits streams per HTTP/2 transport rather than acting as a global client limit. Since each OBI instance creates its own gRPC connection in cache_svc_client.connect, I think this should be clarified.

MrAlias · 2026-04-21T19:55:01Z

+| `max_connections`        | `OTEL_EBPF_K8S_CACHE_MAX_CONNECTIONS`                  | `150`          | Max concurrent subscribing OBI clients.                    |
+| `profile_port`           | `OTEL_EBPF_K8S_CACHE_PROFILE_PORT`                     | `0` (disabled) | If non-zero, starts a `net/http/pprof` listener.           |
+| `informer_resync_period` | `OTEL_EBPF_K8S_CACHE_INFORMER_RESYNC_PERIOD`           | `30m`          | Full informer resync interval. Increase to lower API load. |
+| `informer_send_timeout`  | `OTEL_EBPF_K8S_CACHE_INFORMER_SEND_TIMEOUT`            | `10s`          | Drops a subscriber that does not drain an event in time.   |


I think this is documenting behavior that is not implemented yet. pkg/kube/kubecache/service/service.go stores sendTimeout on the connection, but handleMessagesQueue never uses it and never calls MessageTimeout(). As written, readers will expect slow subscribers to be dropped after a per-message deadline, and the metrics section later suggests the same. This should either be removed for now or rewritten to match the current behavior.

yeah you are right
will remove this comment
opened a separate issue on this to be fixed separately

MrAlias · 2026-04-21T19:55:01Z

+```yaml
+rules:
+  - apiGroups: [ "apps" ]
+    resources: [ "replicasets" ]


The minimum-RBAC section currently includes replicasets, but pkg/kube/kubecache/meta.InitInformers only creates Pod, Node, and Service informers. Since this section is framed as the minimum required permissions, I think it should avoid granting access that the service does not currently use.

mariomac · 2026-04-23T10:43:15Z

+the event schema must stay backwards-compatible with already-deployed OBI
+instances that connect to a newer cache (and vice versa).
+
+## How to deploy


I'd add a first paragraph saying something like:

If you are using our OBI Helm chart, you just have to provide a non-zero value for the
k8sCache > replicas configuration option in values.yaml.

document k8s-cache

678acb5

NimrodAvni78 requested a review from a team as a code owner April 21, 2026 10:47

MrAlias added the documentation Improvements or additions to documentation label Apr 21, 2026

MrAlias reviewed Apr 21, 2026

View reviewed changes

NimrodAvni78 mentioned this pull request Apr 23, 2026

k8s_cache: informer_send_timeout is not enforced #1903

Open

fix comments

b29d247

NimrodAvni78 requested a review from MrAlias April 23, 2026 07:47

mariomac reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document k8s-cache#1886

document k8s-cache#1886
NimrodAvni78 wants to merge 2 commits intoopen-telemetry:mainfrom
coralogix:nimrodavni78/document-k8s-cache

NimrodAvni78 commented Apr 21, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

MrAlias left a comment

Uh oh!

MrAlias Apr 21, 2026

Uh oh!

NimrodAvni78 Apr 23, 2026

Uh oh!

MrAlias Apr 21, 2026

Uh oh!

MrAlias Apr 21, 2026

Uh oh!

NimrodAvni78 Apr 23, 2026 •

edited

Loading

Uh oh!

MrAlias Apr 21, 2026

Uh oh!

mariomac Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NimrodAvni78 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

codecov Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MrAlias left a comment

Choose a reason for hiding this comment

Uh oh!

MrAlias Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

NimrodAvni78 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

MrAlias Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

MrAlias Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

NimrodAvni78 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrAlias Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

mariomac Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NimrodAvni78 commented Apr 21, 2026 •

edited

Loading

codecov Bot commented Apr 21, 2026 •

edited

Loading

NimrodAvni78 Apr 23, 2026 •

edited

Loading