Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## main #1886 +/- ##
===========================================
- Coverage 69.51% 58.33% -11.18%
===========================================
Files 277 277
Lines 33491 34230 +739
===========================================
- Hits 23280 19969 -3311
- Misses 8972 13233 +4261
+ Partials 1239 1028 -211
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
MrAlias
left a comment
There was a problem hiding this comment.
Thanks for putting this together. This is useful internal documentation, especially the overview of the service, the code pointers, and the deployment guidance. I left a few comments where the doc appears to describe behavior or requirements more strongly than the current implementation supports.
|
|
||
| Clients send a `FromTimestampEpoch` on `Subscribe`. On reconnect, OBI sends the | ||
| timestamp of the last event it successfully processed so the cache can skip | ||
| anything older and avoid a full snapshot replay. |
There was a problem hiding this comment.
This reads like reconnects can replay a true delta of what was missed, but the current implementation is narrower than that. FromTimestampEpoch is only used to filter the current in-memory snapshot in meta.Informers.sortAndCut; there is no persisted event log. That means a client can still miss deletes, and anything that disappeared before reconnect will not be replayed. I think this section should be softened so it does not over-promise recovery behavior.
| |--------------------------|--------------------------------------------------------|----------------|------------------------------------------------------------| | ||
| | `log_level` | `OTEL_EBPF_K8S_CACHE_LOG_LEVEL` | `info` | `debug`/`info`/`warn`/`error`. | | ||
| | `port` | `OTEL_EBPF_K8S_CACHE_PORT` | `50055` | gRPC listen port. | | ||
| | `max_connections` | `OTEL_EBPF_K8S_CACHE_MAX_CONNECTIONS` | `150` | Max concurrent subscribing OBI clients. | |
There was a problem hiding this comment.
This wording sounds like max_connections is a total cap on subscribing OBI clients, but the server currently wires it into grpc.MaxConcurrentStreams, which limits streams per HTTP/2 transport rather than acting as a global client limit. Since each OBI instance creates its own gRPC connection in cache_svc_client.connect, I think this should be clarified.
| | `max_connections` | `OTEL_EBPF_K8S_CACHE_MAX_CONNECTIONS` | `150` | Max concurrent subscribing OBI clients. | | ||
| | `profile_port` | `OTEL_EBPF_K8S_CACHE_PROFILE_PORT` | `0` (disabled) | If non-zero, starts a `net/http/pprof` listener. | | ||
| | `informer_resync_period` | `OTEL_EBPF_K8S_CACHE_INFORMER_RESYNC_PERIOD` | `30m` | Full informer resync interval. Increase to lower API load. | | ||
| | `informer_send_timeout` | `OTEL_EBPF_K8S_CACHE_INFORMER_SEND_TIMEOUT` | `10s` | Drops a subscriber that does not drain an event in time. | |
There was a problem hiding this comment.
I think this is documenting behavior that is not implemented yet. pkg/kube/kubecache/service/service.go stores sendTimeout on the connection, but handleMessagesQueue never uses it and never calls MessageTimeout(). As written, readers will expect slow subscribers to be dropped after a per-message deadline, and the metrics section later suggests the same. This should either be removed for now or rewritten to match the current behavior.
There was a problem hiding this comment.
yeah you are right
will remove this comment
opened a separate issue on this to be fixed separately
| ```yaml | ||
| rules: | ||
| - apiGroups: [ "apps" ] | ||
| resources: [ "replicasets" ] |
There was a problem hiding this comment.
The minimum-RBAC section currently includes replicasets, but pkg/kube/kubecache/meta.InitInformers only creates Pod, Node, and Service informers. Since this section is framed as the minimum required permissions, I think it should avoid granting access that the service does not currently use.
| the event schema must stay backwards-compatible with already-deployed OBI | ||
| instances that connect to a newer cache (and vice versa). | ||
|
|
||
| ## How to deploy |
There was a problem hiding this comment.
I'd add a first paragraph saying something like:
If you are using our OBI Helm chart, you just have to provide a non-zero value for the
k8sCache > replicas configuration option in values.yaml.
Summary
Part of #1330
This is more of an internal documentation for developers
a more high level documentation to opentelemetry.io will come shortly
Validation