Skip to content

document k8s-cache#1886

Open
NimrodAvni78 wants to merge 2 commits intoopen-telemetry:mainfrom
coralogix:nimrodavni78/document-k8s-cache
Open

document k8s-cache#1886
NimrodAvni78 wants to merge 2 commits intoopen-telemetry:mainfrom
coralogix:nimrodavni78/document-k8s-cache

Conversation

@NimrodAvni78
Copy link
Copy Markdown
Contributor

@NimrodAvni78 NimrodAvni78 commented Apr 21, 2026

Summary

Part of #1330
This is more of an internal documentation for developers
a more high level documentation to opentelemetry.io will come shortly

Validation

@NimrodAvni78 NimrodAvni78 requested a review from a team as a code owner April 21, 2026 10:47
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.33%. Comparing base (28d8e60) to head (b29d247).
⚠️ Report is 8 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (28d8e60) and HEAD (b29d247). Click for more details.

HEAD has 43 uploads less than BASE
Flag BASE (28d8e60) HEAD (b29d247)
oats-test 7 0
k8s-integration-test 15 0
integration-test-arm 4 0
integration-test-vm-x86_64-5.15.152 3 0
integration-test-vm-x86_64-6.10.6 4 0
integration-test 10 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1886       +/-   ##
===========================================
- Coverage   69.51%   58.33%   -11.18%     
===========================================
  Files         277      277               
  Lines       33491    34230      +739     
===========================================
- Hits        23280    19969     -3311     
- Misses       8972    13233     +4261     
+ Partials     1239     1028      -211     
Flag Coverage Δ
integration-test ?
integration-test-arm ?
integration-test-vm-x86_64-5.15.152 ?
integration-test-vm-x86_64-6.10.6 ?
k8s-integration-test ?
oats-test ?
unittests 58.33% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@MrAlias MrAlias added the documentation Improvements or additions to documentation label Apr 21, 2026
Copy link
Copy Markdown
Contributor

@MrAlias MrAlias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. This is useful internal documentation, especially the overview of the service, the code pointers, and the deployment guidance. I left a few comments where the doc appears to describe behavior or requirements more strongly than the current implementation supports.

Comment thread devdocs/k8s-cache.md Outdated

Clients send a `FromTimestampEpoch` on `Subscribe`. On reconnect, OBI sends the
timestamp of the last event it successfully processed so the cache can skip
anything older and avoid a full snapshot replay.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads like reconnects can replay a true delta of what was missed, but the current implementation is narrower than that. FromTimestampEpoch is only used to filter the current in-memory snapshot in meta.Informers.sortAndCut; there is no persisted event log. That means a client can still miss deletes, and anything that disappeared before reconnect will not be replayed. I think this section should be softened so it does not over-promise recovery behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread devdocs/k8s-cache.md Outdated
|--------------------------|--------------------------------------------------------|----------------|------------------------------------------------------------|
| `log_level` | `OTEL_EBPF_K8S_CACHE_LOG_LEVEL` | `info` | `debug`/`info`/`warn`/`error`. |
| `port` | `OTEL_EBPF_K8S_CACHE_PORT` | `50055` | gRPC listen port. |
| `max_connections` | `OTEL_EBPF_K8S_CACHE_MAX_CONNECTIONS` | `150` | Max concurrent subscribing OBI clients. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording sounds like max_connections is a total cap on subscribing OBI clients, but the server currently wires it into grpc.MaxConcurrentStreams, which limits streams per HTTP/2 transport rather than acting as a global client limit. Since each OBI instance creates its own gRPC connection in cache_svc_client.connect, I think this should be clarified.

Comment thread devdocs/k8s-cache.md Outdated
| `max_connections` | `OTEL_EBPF_K8S_CACHE_MAX_CONNECTIONS` | `150` | Max concurrent subscribing OBI clients. |
| `profile_port` | `OTEL_EBPF_K8S_CACHE_PROFILE_PORT` | `0` (disabled) | If non-zero, starts a `net/http/pprof` listener. |
| `informer_resync_period` | `OTEL_EBPF_K8S_CACHE_INFORMER_RESYNC_PERIOD` | `30m` | Full informer resync interval. Increase to lower API load. |
| `informer_send_timeout` | `OTEL_EBPF_K8S_CACHE_INFORMER_SEND_TIMEOUT` | `10s` | Drops a subscriber that does not drain an event in time. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is documenting behavior that is not implemented yet. pkg/kube/kubecache/service/service.go stores sendTimeout on the connection, but handleMessagesQueue never uses it and never calls MessageTimeout(). As written, readers will expect slow subscribers to be dropped after a per-message deadline, and the metrics section later suggests the same. This should either be removed for now or rewritten to match the current behavior.

Copy link
Copy Markdown
Contributor Author

@NimrodAvni78 NimrodAvni78 Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are right
will remove this comment
opened a separate issue on this to be fixed separately

Comment thread devdocs/k8s-cache.md Outdated
```yaml
rules:
- apiGroups: [ "apps" ]
resources: [ "replicasets" ]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum-RBAC section currently includes replicasets, but pkg/kube/kubecache/meta.InitInformers only creates Pod, Node, and Service informers. Since this section is framed as the minimum required permissions, I think it should avoid granting access that the service does not currently use.

@NimrodAvni78 NimrodAvni78 requested a review from MrAlias April 23, 2026 07:47
Comment thread devdocs/k8s-cache.md
the event schema must stay backwards-compatible with already-deployed OBI
instances that connect to a newer cache (and vice versa).

## How to deploy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a first paragraph saying something like:

If you are using our OBI Helm chart, you just have to provide a non-zero value for the
k8sCache > replicas configuration option in values.yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants