Conversation
Proposes that OpenTelemetry distributions enable only stable components by default, decouple instrumentation stability from semantic convention stability, and establish expanded stability criteria. Key proposals: - Stable by default: distributions should only enable stable components - Decouple instrumentation/semconv stability: let instrumentation stabilize independently when API surface is stable - Expanded stability criteria: docs, benchmarks, tested integrations - Unified component metadata schema extending Collector's metadata.yaml 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
jsuereth
left a comment
There was a problem hiding this comment.
I actually think the work detailed in this OTEP is large, sweeping and likely needs to be divided up.
I suggest you keep the goals of the OTEP in this, and set up workstreams / requirements that can be tackled in further OTEPs. I.e. This is bigger than one person or one "design".
E.g.
- Enabling experimental features - A workstream we can ask the configuraiton SIG to drive.
- Federated Schema and declaring stability of a schema independently of semantic conventions - You can give this to Weaver / Semconv Tooling SIG (@lmolkova has an OTEP already to continue making progress here)
- Distributions / Releasing - A workstream around defining what a distribution is, and gate-keeping its default features to those that are stable
- Profiling - A workstream around profiling overhead and providing features / capabilities / infrastructure to allow Maintainers to set up these tests if they don't have them and meet the requirements listed here.
That's just top of mind, but I think we could refactor this OTEP to call out each workstream and find owners for those.
|
PTAL @open-telemetry/dotnet-instrumentation-maintainers |
reyang
left a comment
There was a problem hiding this comment.
Before going into the details, I have the same question as what @cijothomas mentioned here.
"This OTEP proposes that OpenTelemetry distributions enable only stable components by default" - what does this mean? If let's say Company XYZ released their distribution of OpenTelemetry Java SDK, and they included an unstable component, what would the OpenTelemetry community do? - Do we send attorneys to them?
mx-psi
left a comment
There was a problem hiding this comment.
This looks good to me, I would like to see approvals from mentioned SIGs before approving myself though
jsuereth
left a comment
There was a problem hiding this comment.
Thanks for fixing the wording here!
I forgot to come back and re-review, was working on specific proposals w/ @lmolkova for the Federeated Semconv piece, but this now looks good to me.
I think we'll need to kick of specific projects for areas which don't have SIG owners. Ideally, I think we should get a "single person who feels responsible" for a workstream, but we can sort out that detail later.
|
|
||
| Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale, with reports of "four times the CPU usage" compared to simpler alternatives. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented. | ||
|
|
||
| These all stem from the same problem: OpenTelemetry's default configuration prioritizes feature completeness over production readiness. This OTEP establishes the goals and workstreams needed to address this. |
There was a problem hiding this comment.
These all stem from the same problem: OpenTelemetry's default configuration prioritizes feature completeness over production readiness.
this seems a bit of an oversimplification given some of the examples above
There was a problem hiding this comment.
To adapt a snowclone I'm sure we all see frequently these days: it's not just stability, it's reliability. And scalability, adaptability, extensibility, configurability, etc. All of the quality attributes we want our software and systems to have.
|
|
||
| - Stability information should be visible and consistent. Users should be able to easily determine the stability status of any component before adopting it, and this information should be presented consistently across all OpenTelemetry projects. | ||
|
|
||
| - Instrumentation should be able to stabilize based on production readiness. The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump. |
There was a problem hiding this comment.
However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump.
This seems like an unreasonable burden to place on things like auto instrumentation. Consider the example where an http client library is directly instrumented using OpenTelemetry APIs, and it is using the currently stable semantic conventions for http client calls. All auto instrumentation needs to do to enable capturing that telemetry, is to subscribe to that telemetry (ActivitySource or Meter in dotnet for example). The instrumentation version is directly coupled to the version of the http client library, and completely outside the control of auto instrumentation.
- Does this mean that there is an expectation that auto instrumentation implementations need to perform proactive testing to detect changes in the telemetry output for new library versions?
- Does auto instrumentation need a new major version whenever we want to support a new major version of 3rd party library that is natively instrumented?
- Will library authors consistently do a major version bump if the telemetry signal changes?
- Do we need something in this proposal specifically for auto instrumentation to call out how default instrumentations need to be managed?
There was a problem hiding this comment.
My takeaway from this: we should embrace major version numbers. Individual instrumentation libraries should have their own major semantic version numbers, and users should have a choice for the sake of stability.
There was a problem hiding this comment.
My concern is for the case where no "instrumentation library" is required. The instrumentation itself is part of the actual library used by the end user. We do not have any control or influence of both the versioning of that library, and which version of the library the end user is choosing to use.
jmacd
left a comment
There was a problem hiding this comment.
@austinlparker The TC circled around to this OTEP. There is a perception that it is not very actionable, and that we need at the very least to get buy in from more maintainers. Our request is that you try to rally more approvals from more maintainers.
|
|
||
| This OTEP aims to achieve six outcomes: | ||
|
|
||
| - Users should be able to trust default installations. Someone who installs an OpenTelemetry SDK, agent, or Collector distribution without additional configuration should receive production-ready functionality that will not break between minor versions. |
There was a problem hiding this comment.
I'm concerned this document scope is creeping into reliability. "Reliable by default" is a good goal. I see us preferring stability over reliability. Using the "silent failure" example: The exporterhelper now has an option called wait_for_result which is false by default. When you don't wait for the result, the result is a lie to the client. The client sees silent failure. The Collector might log something, depending on sample rate. This is a reliability issue but if we fix it, stability suffers.
There was a problem hiding this comment.
I think this also highlights something that's often implicit in how the collector SIG operates, and potentially others. There is an understanding that, regardless what major version is attached to our releases, many users have deployed production systems that rely on what we deliver and expect some level of stability. Indeed, the Collector's coding guidelines specify how to make compatibility-breaking changes in ways that minimize the likelihood of disruption to users.
|
|
||
| - Stability information should be visible and consistent. Users should be able to easily determine the stability status of any component before adopting it, and this information should be presented consistently across all OpenTelemetry projects. | ||
|
|
||
| - Instrumentation should be able to stabilize based on production readiness. The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump. |
There was a problem hiding this comment.
My takeaway from this: we should embrace major version numbers. Individual instrumentation libraries should have their own major semantic version numbers, and users should have a choice for the sake of stability.
| Users report unexpected performance overhead with OpenTelemetry, sometimes discovering issues only at scale. Maintainers lack consistent tooling to detect performance regressions. | ||
|
|
||
| This workstream should address how users understand performance overhead and how maintainers detect regressions. Benchmarks will take different forms depending on the component. | ||
|
|
||
| Each implementation SIG should own this work with coordination from the TC. |
There was a problem hiding this comment.
I have concerns about this requirement. I just think it's unreasonable to expect every SIG to do its own performance testing, otherwise we will end up with a dozen relatively weak performance tests and a lot of wasted effort. I would support an effort to centralize performance testing in which each SDK SIG builds a synthetic benchmark subject following a specification. For example, the benchmark subject will start with a YAML file, the YAML file will give a port to listen on, then the benchmark apparatus will send the subject commands like "with N threads: create 1 span and then perform 1 microsecond of busy work". (Reference.)
| Users evaluating OpenTelemetry for production need confidence in security practices, but commitments around CVE response timelines, dependency updates, and supply chain security are not well documented. | ||
|
|
||
| This workstream should result in documented, consistent security commitments across OpenTelemetry projects. | ||
|
|
||
| The Security SIG, GC, and TC should own this work. |
There was a problem hiding this comment.
Why this is part of this document? This structure is already establshed.
|
|
||
| ### On Instrumentation Libraries | ||
|
|
||
| Instrumentation library maintainers will be able to stabilize based on the production readiness of their code, without waiting for all upstream semantic conventions to stabilize. Once stable, they own the stability of their telemetry output—any breaking change to emitted telemetry requires a major version bump. They will need to clearly document which semantic conventions they use and provide migration guidance when conventions evolve. |
There was a problem hiding this comment.
I suspect these authors will require help from the project. How do semantic conventions and instrumentation library evolve independently of the SDK versions, across the project?
|
I want to take this moment where I have your attention to call out the bad actors in our ecosystem that contribute to instability due to poor SBOM practices. If OTel is required to be stable by default, we have to break our dependency on gRPC-Go because it has an unstable-by-default policy. In other words, I do not see how OTel projects based in Golang can ever become stable while we have this kind of behavior among our dependencies: https://github.com/grpc/grpc-go/blob/master/Documentation/versioning.md#versioning-policy. Can we use this moment to request gRPC-Go to change their behavior as well? As a project, I've come to see gRPC-Go's behavior as extremely harmful. |
Sorry, I'm trying to understand how their behavior would break anything here? I suppose their deprecation rules for minor releases are strange, albeit signposted. |
|
I'm also gonna just lay it out here -- I'm pretty discouraged at this point. I don't see how we can square the feedback I'm seeing on this proposal with the feedback we've been given by users. I think we're probably just using words in different ways (e.g., 'stability' to a global 2000 CISO or CTO looks very different than 'stability' to a 500 person midmarket consultancy, etc.) but we should probably figure out whose definition we care more about. I really do not think it's unreasonable for users to expect that the behavior of major parts of the OpenTelemetry ecosystem (e.g., the collector, or an instrumentation library that we provide) to not completely change between minor releases. I do not think it's unreasonable to suggest that we should be better, as a project, about cutting major releases and committing to them. @jmacd To your point above, I agree that this document blends 'reliability' and 'stability', because users do not see the difference, in my experience. Observability, like it or not, is not really something that people want to spend a lot of innovation tokens on. They want to know what their software is doing. Like you could replace the title of this entire project with 'OpenTelemetry Production-Readiness' and it'd be the same effort with the same desired outcome, but unfortunately we don't have a mechanism to ensure that SIG releases are reliable via the spec, nor do I think we really want to get into the business of defining specific performance targets. To your point about "there should be a central benchmark", this idea was specifically rejected by maintainers of other SIGs as 'too burdensome'. Like, this is kinda the central tension that we have to balance here. Users do not view a difference between otel-ruby, otel-go, the collector, whatever. We're all 'otel' to them. This is extremely challenging to balance with our federated governance structure -- the real goal here is to get us set up to have more centralized control around releases through the epoch release concept, which would give us a central project-wide stability gate. Heck, maybe the ultimate way we implement 90% of this is just through that! I'd love for that to be the answer, it'd make life simpler for everyone! |
- Clarify that instrumentation stability contracts apply to telemetry the library itself produces, not telemetry from third-party libraries it subscribes to (addresses nrcventura's auto-instrumentation concern) - Replace dissolved Configuration SIG reference with note that a new project is needed (flagged by trask and jack-berg) - Remove anecdotal quotes from motivation section per trask's suggestions
7c29295 to
95932d2
Compare
Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
lmolkova
left a comment
There was a problem hiding this comment.
I support the direction outlined in this OTEP and believe we can clarify detailed solutions to these challenges as we implement this vision.
mx-psi
left a comment
There was a problem hiding this comment.
I support this OTEP, while it is high-level I think these all are things we should prioritize and it is valuable to have this high-level plan.
Is the performance part something that could be discussed as an (immediate) follow up? My impression is that something we all know we both need and don't know how to face (strategy and resource/iterations wise). |
There was a problem hiding this comment.
I spent a few hours reviewing this document, and I’m concerned that it tries to cover too many areas at once and some decisions that may be too significant to settle within a single OTEP. Initially, I had more comments, but I decided to remove them as they would be not help progressing this PR.
I want to be clear that I strongly support the broader goal of moving toward stabilization. My work over the past several years hopefully reflects that commitment. That said, I think this proposal would benefit from being broken down into more focused pieces so each aspect can be properly discussed and evaluated, rather than attempting to address everything in one place.
I may be mistaken, but it also feels like we might be mixing “compliance requirements and constraints” with “user challenges,” which could benefit from clearer separation.
As a starting point, perhaps we could align on defining key areas or workstreams to focus on, such as:
- the definition of stable instrumentation/component
- documentation requirements for stable components
- performance expectations/compliance
- security compliance for stable components
This might help structure the work more clearly and enable more focused discussions and get more approval sooner.
| @@ -0,0 +1,153 @@ | |||
| # Stable by Default: Improving OpenTelemetry's Default User Experience | |||
|
|
|||
| This OTEP defines goals and acceptance criteria for making OpenTelemetry production-ready by default. It identifies workstreams requiring dedicated effort and coordination across SIGs, each of which may spawn follow-up OTEPs with detailed designs. | |||
There was a problem hiding this comment.
I have not seen any clear acceptance criteria.
There was a problem hiding this comment.
| This OTEP defines goals and acceptance criteria for making OpenTelemetry production-ready by default. It identifies workstreams requiring dedicated effort and coordination across SIGs, each of which may spawn follow-up OTEPs with detailed designs. | |
| This OTEP defines goals for making OpenTelemetry production-ready by default. It identifies workstreams requiring dedicated effort and coordination across SIGs, each of which may spawn follow-up OTEPs with detailed designs. |
|
|
||
| ## Motivation | ||
|
|
||
| OpenTelemetry has grown into a massive ecosystem supporting four telemetry signals across a dozen programming languages. This growth has come with complexity that creates real barriers to production adoption. |
There was a problem hiding this comment.
I do not see how this sentence fits other parts of this OTEP. How about removing it? Do we need it?
There was a problem hiding this comment.
| OpenTelemetry has grown into a massive ecosystem supporting four telemetry signals across a dozen programming languages. This growth has come with complexity that creates real barriers to production adoption. |
|
|
||
| OpenTelemetry has grown into a massive ecosystem supporting four telemetry signals across a dozen programming languages. This growth has come with complexity that creates real barriers to production adoption. | ||
|
|
||
| Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale. |
There was a problem hiding this comment.
It seems that something is logically missing in this statement. It is totally fine for experimental features to break between version. I do not want this happening to stable releases. It is not clear if the problem is global or just for some parts of OTel (e.g. Collector components). For instance, I have not heard about such consistent complains for OTel Go or OTel .NET Auto. If the tire gets flat you do not need to fix the whole car.
There was a problem hiding this comment.
Respectfully, I do not think this view is an accurate representation of how our users view OpenTelemetry. End users, broadly, do not understand or accept that different SIGs will have different conventions around stability or reliability. We are, as a project, viewed as a single contiguous product surface. The point of this OTEP -- indeed, this entire effort -- is an attempt to provide a single global standard that we can hold SIGs to (or, at least, provides a roadmap to the effort of creating global project releases).
There was a problem hiding this comment.
I see but something is missing here. This just saying "experimental features are not stable". Maybe we should add a statement like, some of the long-time experiental features are critical and users need them badly being stable.
There was a problem hiding this comment.
attempt to provide a single global standard that we can hold SIGs to
Didn't we do that five years ago here? The only reference to that document I see is buried at the end in the Prior Art section. What does this OTEP propose needs to change from that document? What is working and should be retained?
More generally, I'm not sure we're all talking about the same thing here. @pellared seems to be saying "some things are experimental and that's normal and expected" while @austinlparker seems to be saying "too many things are experimental and users are confused about what is and isn't stable, particularly across SIG boundaries that may be transparent to them". I can empathize with that position, having several times heard "OTel breaks things too often" that, on further probing, boils down to " SDK/instrumentation doesn't adhere to (my understanding of) semconv" or "I haven't updated this v0.x component for 35 minor revisions and am surprised something broke". I'm not sure I would describe those situations as a lack of stability or that they would be addressed by what's proposed here.
If this is the case (we know we need it but don't know how), I fail to see how splitting it out of this really helps. The point of this OTEP is to define goals that we, as project leadership, would like to see accomplished. The exact implementation details should be discussed in follow-ups by groups that are more well-suited to own them. |
The fact there are implementation details still in the document is probably what's causing the sense of un-ease around approving this, even though I believe you have wide support for the concept and broad-strokes. In my opinion, the document tone has shifted enough from "solutions" to "requirements + problems" that we can proceed with merging and move forward with concrete proposals for specific topics as @pellared suggests. However, we need to make it clear to maintainers that there is still room for discussion in how we achieve these goals. Some aspects may be non-negotiable - like providing stable expectaitons to users. Others may be, like how we decide on versioning policies between components in OTEL. @pellared - The goal of this OTEP is to fragment the problem into workstreams and come with specific proposals to each workstream to address the problem. For example, the semantic convention + instrumentation related issues have been spearheaded by @lmolkova and myself, you can see the proposals already showing up here:
We'd like to make sure these workstreams can kick of. @pellared If you re-read this proposal looking it not as a solution but as a definition of workstreams, do you think any are missing? If you see wording that implies "solution" vs. "requirements + workstreams" please call it out. |
💯 I would love the PR description and content to reflect that. Otherwise, reviewing this is PR is inconvenient and someone in future may blame us that we are doing things against the OTEP. Especially things that I find very opinionated like the stability of instrumentation libraries.
As have written before, I have no issues with defining the workstreams and I think this should be the main outcome of this OTEP. However, the PR description and content should reflect it. Once I am back from KubeCon, I can continue reviewing as I stopped just before the definition of workstreams. |
|
|
||
| - Stability information should be visible and consistent. Users should be able to easily determine the stability status of any component before adopting it, and this information should be presented consistently across all OpenTelemetry projects. | ||
|
|
||
| - Instrumentation should be able to stabilize based on production readiness. The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump. This stability guarantee applies to telemetry that the instrumentation library itself produces. When an instrumentation library subscribes to telemetry emitted natively by a third-party library (e.g., auto-instrumentation that captures spans produced by an HTTP client's own OTel integration), the content of that telemetry is governed by the third-party library's release cycle, not the instrumentation library's stability contract. |
There was a problem hiding this comment.
It is not clear to me,
if I have Instrumentation library reporting sepmantic.converntion link 1.0.0 (ustable) together with fully stable library level API, can I switch it to the sem.conv.1.1.0 where it was stabilized without making major release of such package?
If the major bump is needed in such cases (it shouldn't IMO) we should not allow stable releases on unstable sem.conv.
|
|
||
| - Instrumentation should be able to stabilize based on production readiness. The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump. This stability guarantee applies to telemetry that the instrumentation library itself produces. When an instrumentation library subscribes to telemetry emitted natively by a third-party library (e.g., auto-instrumentation that captures spans produced by an HTTP client's own OTel integration), the content of that telemetry is governed by the third-party library's release cycle, not the instrumentation library's stability contract. | ||
|
|
||
| - Performance characteristics should be known. Users should be able to understand the overhead implications of OpenTelemetry before deploying to production, and maintainers should be able to detect regressions between releases. |
There was a problem hiding this comment.
What do you mean by this Performance characteristics should be known.?
Do you expect that we will produce some numbers publicly from our benchmarks? I doubt that it is reliable in any way. What we should recommend IMO is to provide instruction how to measure it in the customer env.
Especially in auto-instrumentation word it is hard to consider all possible deployments.
|
|
||
| - Performance characteristics should be known. Users should be able to understand the overhead implications of OpenTelemetry before deploying to production, and maintainers should be able to detect regressions between releases. | ||
|
|
||
| - Security commitments should be documented. Users should be able to evaluate OpenTelemetry's security posture, including CVE response timelines and dependency management practices. |
There was a problem hiding this comment.
I think that it is already considered by the security sig and should not impact stability guarantees.
|
|
||
| ## Success Criteria | ||
|
|
||
| This initiative succeeds when official OpenTelemetry distributions—Collector distributions, the Java agent, and similar—enable only stable components by default. Users should be able to enable experimental features through a consistent, well-documented mechanism. Each component's stability status should be clearly documented and discoverable. Instrumentation libraries should be able to reach stable status based on the production readiness of their code, even if the semantic conventions they depend on are still evolving. Once stable, any breaking change to telemetry output requires a major version bump. Performance benchmarks should exist for stable components, with published baseline characteristics. Security policies and CVE response commitments should be documented and followed. |
There was a problem hiding this comment.
|
|
||
| ### Workstream 2: Federated Schema and Stability | ||
|
|
||
| Instrumentation libraries are blocked from stabilization because they depend on experimental semantic conventions, even when the instrumentation code itself is mature and battle-tested. There is also no consistent mechanism to declare which semantic conventions an instrumentation uses or to report schema URLs consistently. |
There was a problem hiding this comment.
The statement that we do not have consistent way to emit information about semantic convention is not true.
It is not emitted everywhere, but it is different topic that stating that it is not consistent mechanism to utilize it.
| Users evaluating OpenTelemetry for production need confidence in security practices, but commitments around CVE response timelines, dependency updates, and supply chain security are not well documented. | ||
|
|
||
| This workstream should result in documented, consistent security commitments across OpenTelemetry projects. | ||
|
|
||
| The Security SIG, GC, and TC should own this work. |
There was a problem hiding this comment.
Why this is part of this document? This structure is already establshed.
|
|
||
| ### On Existing Distributions | ||
|
|
||
| Distributions that currently enable experimental components by default will need to audit their component list and develop a migration plan. To avoid breaking existing users, implementations may provide a transitional period with deprecation warnings before changing defaults. The specifics of this transition are left to individual distributions and the workstreams above. |
There was a problem hiding this comment.
I think that, there now, we have pretty good solution for this.
If you want to control what you have, just utilize file based configuration (it is implemented/in development at least in Java/.NET Auto). We can avoid breaking changes by this simple recommendation.
| OTEP 0143 on Versioning and Stability established the foundation for stability guarantees in OpenTelemetry clients. This OTEP extends those concepts to distributions and instrumentation. | ||
|
|
||
| OTEP 0232 on Maturity Levels defined maturity levels: Development, Alpha, Beta, RC, Stable, and Deprecated. This OTEP builds on these levels by specifying how they should affect default behavior. Workstreams should use these maturity levels consistently rather than inventing new terminology. | ||
|
|
||
| OTEP 0227 on Separate Semantic Conventions moved semantic conventions to a separate repository with independent versioning. This OTEP leverages that separation to enable independent stability assessments. | ||
|
|
||
| OTEP 0152 on Telemetry Schemas defined schema URLs and transformation mechanisms for semantic convention evolution. Workstream 2 builds on this foundation. |
There was a problem hiding this comment.
Missing links for particular OTEPs?
pellared
left a comment
There was a problem hiding this comment.
The scope of work appears quite extensive, and I didn’t notice supporting research, references, or analysis that would help clarify why these are the most critical areas to focus on. I’m also concerned that achieving everything outlined in the OTEP may not be realistic in its current form, and the priorities could benefit from clearer definition. Additionally, it is unclear what a workstream actually is.
|
|
||
| - Stability information should be visible and consistent. Users should be able to easily determine the stability status of any component before adopting it, and this information should be presented consistently across all OpenTelemetry projects. | ||
|
|
||
| - Instrumentation should be able to stabilize based on production readiness. The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump. This stability guarantee applies to telemetry that the instrumentation library itself produces. When an instrumentation library subscribes to telemetry emitted natively by a third-party library (e.g., auto-instrumentation that captures spans produced by an HTTP client's own OTel integration), the content of that telemetry is governed by the third-party library's release cycle, not the instrumentation library's stability contract. |
There was a problem hiding this comment.
This statement is very opinionated and I think a lot of users would disagree with it
The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized
|
|
||
| - Instrumentation should be able to stabilize based on production readiness. The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump. This stability guarantee applies to telemetry that the instrumentation library itself produces. When an instrumentation library subscribes to telemetry emitted natively by a third-party library (e.g., auto-instrumentation that captures spans produced by an HTTP client's own OTel integration), the content of that telemetry is governed by the third-party library's release cycle, not the instrumentation library's stability contract. | ||
|
|
||
| - Performance characteristics should be known. Users should be able to understand the overhead implications of OpenTelemetry before deploying to production, and maintainers should be able to detect regressions between releases. |
There was a problem hiding this comment.
Performance characteristics should be known. Users should be able to understand the overhead implications of OpenTelemetry before deploying to production,
This goal is not measurable.
|
|
||
| This OTEP aims to achieve six outcomes: | ||
|
|
||
| - Users should be able to trust default installations. Someone who installs an OpenTelemetry SDK, agent, or Collector distribution without additional configuration should receive production-ready functionality that will not break between minor versions. |
There was a problem hiding this comment.
Is this a vision or goal? How would we know that the goal has been achieved?
|
|
||
| - Performance characteristics should be known. Users should be able to understand the overhead implications of OpenTelemetry before deploying to production, and maintainers should be able to detect regressions between releases. | ||
|
|
||
| - Security commitments should be documented. Users should be able to evaluate OpenTelemetry's security posture, including CVE response timelines and dependency management practices. |
There was a problem hiding this comment.
I am not sure if this goal is realistic. Especially regarding "CVE response timelines"
|
|
||
| ## Success Criteria | ||
|
|
||
| This initiative succeeds when official OpenTelemetry distributions—Collector distributions, the Java agent, and similar—enable only stable components by default. Users should be able to enable experimental features through a consistent, well-documented mechanism. Each component's stability status should be clearly documented and discoverable. Instrumentation libraries should be able to reach stable status based on the production readiness of their code, even if the semantic conventions they depend on are still evolving. Once stable, any breaking change to telemetry output requires a major version bump. Performance benchmarks should exist for stable components, with published baseline characteristics. Security policies and CVE response commitments should be documented and followed. |
There was a problem hiding this comment.
How do we know that the users would be happy with the outcome? E.g. regarding
This initiative succeeds when official (...), the Java agent, and similar—enable only stable components by default
I created this issue open-telemetry/opentelemetry-dotnet-instrumentation#2439 3 years ago and we got zero feedback that having experimental instrumentation enabled by default is bad.
Also related issue: open-telemetry/opentelemetry-dotnet-instrumentation#2416
I like this title for what you are trying to accomplish. I like this title because, in a lot of ways, being ready for production means preparing for instability. "Stability" is a good thing, but it's not a great thing, for example in the emergency room "stable" means not deteriorating. We want to see improvement, which is to say positive change, and change means unstable. If you've ever dual-emitted telemetry with an old convention and a new convention, to give yourself time to update a dashboard, you know what we're talking about. "How things change in production" |
|
|
||
| OpenTelemetry has grown into a massive ecosystem supporting four telemetry signals across a dozen programming languages. This growth has come with complexity that creates real barriers to production adoption. | ||
|
|
||
| Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale. |
There was a problem hiding this comment.
attempt to provide a single global standard that we can hold SIGs to
Didn't we do that five years ago here? The only reference to that document I see is buried at the end in the Prior Art section. What does this OTEP propose needs to change from that document? What is working and should be retained?
More generally, I'm not sure we're all talking about the same thing here. @pellared seems to be saying "some things are experimental and that's normal and expected" while @austinlparker seems to be saying "too many things are experimental and users are confused about what is and isn't stable, particularly across SIG boundaries that may be transparent to them". I can empathize with that position, having several times heard "OTel breaks things too often" that, on further probing, boils down to " SDK/instrumentation doesn't adhere to (my understanding of) semconv" or "I haven't updated this v0.x component for 35 minor revisions and am surprised something broke". I'm not sure I would describe those situations as a lack of stability or that they would be addressed by what's proposed here.
|
|
||
| Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale. | ||
|
|
||
| Semantic convention changes destroy existing dashboards. When conventions change, users must update instrumentation across their entire infrastructure while simultaneously updating dashboards, alerts, and downstream tooling. Organizations report significant resistance from developers asked to coordinate these changes. |
There was a problem hiding this comment.
If these changes are that disruptive and important why would we consider instrumentation that uses unstable semantic conventions to be stable? Simply saying those components should bump their major version every time their underlying conventions break isn't a complete answer. It doesn't address the fact that giving a 1.x version to a library that knows it will be making "breaking" changes frequently gives only the appearance of stability and not actual stability, nor does it address the challenges inherent in some language ecosystems regarding the management of multiple major versions of the same library.
|
|
||
| Semantic convention changes destroy existing dashboards. When conventions change, users must update instrumentation across their entire infrastructure while simultaneously updating dashboards, alerts, and downstream tooling. Organizations report significant resistance from developers asked to coordinate these changes. | ||
|
|
||
| Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented. |
There was a problem hiding this comment.
I'm not sure where "stability" relates to any of these statements except perhaps the first sentence, which is followed by several non sequitur statements.
|
|
||
| Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale, with reports of "four times the CPU usage" compared to simpler alternatives. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented. | ||
|
|
||
| These all stem from the same problem: OpenTelemetry's default configuration prioritizes feature completeness over production readiness. This OTEP establishes the goals and workstreams needed to address this. |
There was a problem hiding this comment.
To adapt a snowclone I'm sure we all see frequently these days: it's not just stability, it's reliability. And scalability, adaptability, extensibility, configurability, etc. All of the quality attributes we want our software and systems to have.
|
|
||
| This OTEP aims to achieve six outcomes: | ||
|
|
||
| - Users should be able to trust default installations. Someone who installs an OpenTelemetry SDK, agent, or Collector distribution without additional configuration should receive production-ready functionality that will not break between minor versions. |
There was a problem hiding this comment.
I think this also highlights something that's often implicit in how the collector SIG operates, and potentially others. There is an understanding that, regardless what major version is attached to our releases, many users have deployed production systems that rely on what we deliver and expect some level of stability. Indeed, the Collector's coding guidelines specify how to make compatibility-breaking changes in ways that minimize the likelihood of disruption to users.
|
|
||
| Instrumentation libraries are blocked from stabilization because they depend on experimental semantic conventions, even when the instrumentation code itself is mature and battle-tested. There is also no consistent mechanism to declare which semantic conventions an instrumentation uses or to report schema URLs consistently. | ||
|
|
||
| This workstream should establish a path for instrumentation libraries to stabilize based on the production readiness of their code, rather than requiring all upstream semantic conventions to be stable first. Once stable, instrumentation libraries own the stability of their full output—any breaking change to emitted telemetry must be treated as a breaking change requiring a major version bump, regardless of whether the change originates from updated semantic conventions or from the instrumentation itself. The workstream should also address how instrumentation communicates its semantic convention dependencies to users and downstream tooling, and how migration works when conventions evolve after instrumentation has stabilized. |
There was a problem hiding this comment.
Can we reasonably do this without a mechanism for migrating telemetry produced under one version of the conventions to another? If nothing changes about an instrumentation library but the version of the semantic conventions it uses to emit telemetry why should that instrumentation library have to undergo multiple major version bumps? Nothing about the library itself would change in a way that would cause compilation failures. This also does nothing to address the inconsistency that would result from some applications updating to the next major version of an instrumentation library while others remain behind. That would seem to require schema migration capabilities to reconcile those inconsistencies and, once those are in place, changing the telemetry emitted by an instrumentation library no longer seems like a breaking change worth incrementing the major version.
|
|
||
| ### Workstream 4: Production Readiness Criteria | ||
|
|
||
| Users cannot easily assess whether a component is ready for production use. Stability status alone does not convey documentation quality, performance characteristics, or operational readiness. |
There was a problem hiding this comment.
This seems to acknowledge the difference between "stability" and "production readiness" that seems to be conflated elsewhere. If API stability or configuration stability or telemetry stability don't mean that something is "ready for production use" why should something being deemed so ready be the gate for calling it "stable"?
|
|
||
| ### On Instrumentation Libraries | ||
|
|
||
| Instrumentation library maintainers will be able to stabilize based on the production readiness of their code, without waiting for all upstream semantic conventions to stabilize. Once stable, they own the stability of their telemetry output—any breaking change to emitted telemetry requires a major version bump. They will need to clearly document which semantic conventions they use and provide migration guidance when conventions evolve. |
There was a problem hiding this comment.
They will need to clearly document which semantic conventions they use and provide migration guidance when conventions evolve.
Is this not implicitly saying "some of what this component does is not stable, despite its version and stability level"?
|
|
||
| Defining workstreams and requiring cross-SIG coordination may slow progress compared to individual SIGs acting independently. However, each workstream can proceed independently once acceptance criteria are agreed. This OTEP provides alignment on goals without requiring lockstep execution. | ||
|
|
||
| Allowing instrumentation to stabilize before its upstream semantic conventions may confuse users who see "stable" instrumentation emitting telemetry based on "experimental" semantic conventions. However, this does not mean telemetry output is free to change without consequence—once stable, the instrumentation library commits to the telemetry it emits, and any breaking change requires a major version bump. How to communicate this to users is something the workstreams will need to sort out. The alternative — keeping production-ready instrumentation in pre-release indefinitely — is worse. |
There was a problem hiding this comment.
Is there a problem with instrumentation authors keeping otherwise stable instrumentation libraries at v0.x because the semconv are not stable and then using that semconv status as justification for making breaking API changes that otherwise wouldn't justify the effort of maintaining a new MV? If so, that's not explicit anywhere in here and I'm left trying to read between the lines for why the repeated insistence that semconv stability shouldn't gate component stability but component stability is still broken by unstable semconvs changing.
|
|
||
| ## Future Possibilities | ||
|
|
||
| Once the workstreams defined in this OTEP complete, several additional improvements become possible. Users could specify minimum stability thresholds—for example, "only enable beta or above components"—through configuration files or environment variables. Tooling could automatically assess and surface stability information such as documentation completeness, benchmark availability, and test coverage to help users and maintainers. Mechanisms for coordinating stability status across language implementations would ensure users have consistent expectations regardless of language choice. Decoupling instrumentation stability from semantic conventions enables domain experts outside core OpenTelemetry to develop and stabilize conventions for their domains. |
There was a problem hiding this comment.
I'm not sure I see how any of these things are not possible with the current policies, structures, and tools we have today. Many of them will require significant development effort, for sure, but that doesn't change when we start shipping new major versions every month. In fact, it probably gets worse.
Summary
This OTEP proposes that OpenTelemetry distributions enable only stable components by default, decouple instrumentation stability from semantic convention stability, and establish expanded stability criteria.
Key Proposals
metadata.yamlpattern to instrumentation librariesMotivation
Community feedback consistently identifies pain points that this OTEP addresses:
Related
Test plan