Skip to content
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,10 @@ release.
- Stabilize complex `AnyValue` attribute value types and related attribute limits.
([#4794](https://github.com/open-telemetry/opentelemetry-specification/issues/4794))

- OTEP: Stable by Default - distributions enable only stable components by default,
decouple instrumentation stability from semantic convention stability.
([#4813](https://github.com/open-telemetry/opentelemetry-specification/pull/4813))

## v1.52.0 (2025-12-12)

### Context
Expand Down
153 changes: 153 additions & 0 deletions oteps/4813-stable-by-default.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Stable by Default: Improving OpenTelemetry's Default User Experience

This OTEP defines goals and acceptance criteria for making OpenTelemetry production-ready by default. It identifies workstreams requiring dedicated effort and coordination across SIGs, each of which may spawn follow-up OTEPs with detailed designs.
Copy link
Copy Markdown
Member

@pellared pellared Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen any clear acceptance criteria.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This OTEP defines goals and acceptance criteria for making OpenTelemetry production-ready by default. It identifies workstreams requiring dedicated effort and coordination across SIGs, each of which may spawn follow-up OTEPs with detailed designs.
This OTEP defines goals for making OpenTelemetry production-ready by default. It identifies workstreams requiring dedicated effort and coordination across SIGs, each of which may spawn follow-up OTEPs with detailed designs.


## Motivation

OpenTelemetry has grown into a massive ecosystem supporting four telemetry signals across a dozen programming languages. This growth has come with complexity that creates real barriers to production adoption.
Copy link
Copy Markdown
Member

@pellared pellared Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see how this sentence fits other parts of this OTEP. How about removing it? Do we need it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OpenTelemetry has grown into a massive ecosystem supporting four telemetry signals across a dozen programming languages. This growth has come with complexity that creates real barriers to production adoption.


Community feedback consistently identifies several pain points. Experimental features break production deployments—users report configuration breaking between minor versions, silent failures in telemetry pipelines, and unexpected performance regressions that only appear at scale.
Copy link
Copy Markdown
Member

@pellared pellared Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that something is logically missing in this statement. It is totally fine for experimental features to break between version. I do not want this happening to stable releases. It is not clear if the problem is global or just for some parts of OTel (e.g. Collector components). For instance, I have not heard about such consistent complains for OTel Go or OTel .NET Auto. If the tire gets flat you do not need to fix the whole car.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Respectfully, I do not think this view is an accurate representation of how our users view OpenTelemetry. End users, broadly, do not understand or accept that different SIGs will have different conventions around stability or reliability. We are, as a project, viewed as a single contiguous product surface. The point of this OTEP -- indeed, this entire effort -- is an attempt to provide a single global standard that we can hold SIGs to (or, at least, provides a roadmap to the effort of creating global project releases).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see but something is missing here. This just saying "experimental features are not stable". Maybe we should add a statement like, some of the long-time experiental features are critical and users need them badly being stable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempt to provide a single global standard that we can hold SIGs to

Didn't we do that five years ago here? The only reference to that document I see is buried at the end in the Prior Art section. What does this OTEP propose needs to change from that document? What is working and should be retained?

More generally, I'm not sure we're all talking about the same thing here. @pellared seems to be saying "some things are experimental and that's normal and expected" while @austinlparker seems to be saying "too many things are experimental and users are confused about what is and isn't stable, particularly across SIG boundaries that may be transparent to them". I can empathize with that position, having several times heard "OTel breaks things too often" that, on further probing, boils down to " SDK/instrumentation doesn't adhere to (my understanding of) semconv" or "I haven't updated this v0.x component for 35 minor revisions and am surprised something broke". I'm not sure I would describe those situations as a lack of stability or that they would be addressed by what's proposed here.


Semantic convention changes destroy existing dashboards. When conventions change, users must update instrumentation across their entire infrastructure while simultaneously updating dashboards, alerts, and downstream tooling. Organizations report significant resistance from developers asked to coordinate these changes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these changes are that disruptive and important why would we consider instrumentation that uses unstable semantic conventions to be stable? Simply saying those components should bump their major version every time their underlying conventions break isn't a complete answer. It doesn't address the fact that giving a 1.x version to a library that knows it will be making "breaking" changes frequently gives only the appearance of stability and not actual stability, nor does it address the challenges inherent in some language ecosystems regarding the management of multiple major versions of the same library.


Many instrumentation libraries are stuck on pre-release because they depend on experimental semantic conventions, even when the instrumentation API surface itself is mature and battle-tested. The "batteries not included" philosophy means users must assemble many components before achieving basic functionality. Documentation assumes expertise, and newcomers describe the experience as "overwhelming" with "no discoverability." Auto-instrumentation can add significant resource consumption that only becomes apparent at scale. Users evaluating OpenTelemetry for production deployment need confidence in CVE response timelines, dependency hygiene, and supply chain security—areas where commitments are not well documented.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where "stability" relates to any of these statements except perhaps the first sentence, which is followed by several non sequitur statements.


These all stem from the same problem: OpenTelemetry's default configuration prioritizes feature completeness over production readiness. This OTEP establishes the goals and workstreams needed to address this.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all stem from the same problem: OpenTelemetry's default configuration prioritizes feature completeness over production readiness.

this seems a bit of an oversimplification given some of the examples above

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To adapt a snowclone I'm sure we all see frequently these days: it's not just stability, it's reliability. And scalability, adaptability, extensibility, configurability, etc. All of the quality attributes we want our software and systems to have.


## Goals

This OTEP aims to achieve six outcomes:

- Users should be able to trust default installations. Someone who installs an OpenTelemetry SDK, agent, or Collector distribution without additional configuration should receive production-ready functionality that will not break between minor versions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned this document scope is creeping into reliability. "Reliable by default" is a good goal. I see us preferring stability over reliability. Using the "silent failure" example: The exporterhelper now has an option called wait_for_result which is false by default. When you don't wait for the result, the result is a lie to the client. The client sees silent failure. The Collector might log something, depending on sample rate. This is a reliability issue but if we fix it, stability suffers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also highlights something that's often implicit in how the collector SIG operates, and potentially others. There is an understanding that, regardless what major version is attached to our releases, many users have deployed production systems that rely on what we deliver and expect some level of stability. Indeed, the Collector's coding guidelines specify how to make compatibility-breaking changes in ways that minimize the likelihood of disruption to users.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a vision or goal? How would we know that the goal has been achieved?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "default installation" in this context? Does this apply to every artifact produced by any SIG? I think between "[s]omeone who installs [some software] without additional configuration should receive production-ready functionality that will not break between minor versions" and the OTEP title "Stable by Default" there is little room left for the development process. If we cannot have v0.x software that can break between minor versions how are we to develop new software? Should we expect that all components are developed elsewhere and donated to OTel only once they're "stable"? That seems unreasonable to me.


- Experimental features should be clearly marked and require explicit opt-in. Users who want cutting-edge functionality can access it, but they must take deliberate action that signals they understand the stability trade-offs.

- Stability information should be visible and consistent. Users should be able to easily determine the stability status of any component before adopting it, and this information should be presented consistently across all OpenTelemetry projects.
Comment on lines +23 to +25
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does v0.x versioning and inclusion in a go.mod or project.json or cargo.toml satisfy these requirements? What about standalone binaries released with a v0.x version? If so, I'm not sure what changes would arise from the inclusion of these goals as we're already there. If not, I'm not sure how much more explicit we can be and I think these statement need to be much clearer about what they're trying to achieve and what mechanisms would be satisfactory.


- Instrumentation should be able to stabilize based on production readiness. The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump. This stability guarantee applies to telemetry that the instrumentation library itself produces. When an instrumentation library subscribes to telemetry emitted natively by a third-party library (e.g., auto-instrumentation that captures spans produced by an HTTP client's own OTel integration), the content of that telemetry is governed by the third-party library's release cycle, not the instrumentation library's stability contract.
Copy link
Copy Markdown
Member

@Kielek Kielek Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me,
if I have Instrumentation library reporting sepmantic.converntion link 1.0.0 (ustable) together with fully stable library level API, can I switch it to the sem.conv.1.1.0 where it was stabilized without making major release of such package?

If the major bump is needed in such cases (it shouldn't IMO) we should not allow stable releases on unstable sem.conv.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is very opinionated and I think a lot of users would disagree with it

The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bar for a stable instrumentation library should be whether the instrumentation code itself is production-ready, not whether the semantic conventions it depends on have been finalized. However, once an instrumentation library stabilizes, any breaking change to its telemetry output must be treated as a breaking change requiring a major version bump.

I read this as "you can be stable with unstable dependencies, but every time they break you have to bump your MV", which is not at all stable. It's simultaneously saying too much and too little. What if the code is production ready but the authors aren't sure that the configuration interface is correct and want more feedback? What if the code has been used in production in many cases but has a known issue and the most likely solutions to that known issue involve breaking API changes? I think it's not a simple boolean function on a binary input of (production_ready, semconv_stable).


- Performance characteristics should be known. Users should be able to understand the overhead implications of OpenTelemetry before deploying to production, and maintainers should be able to detect regressions between releases.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this Performance characteristics should be known.?

Do you expect that we will produce some numbers publicly from our benchmarks? I doubt that it is reliable in any way. What we should recommend IMO is to provide instruction how to measure it in the customer env.

Especially in auto-instrumentation word it is hard to consider all possible deployments.

Copy link
Copy Markdown
Member

@pellared pellared Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance characteristics should be known. Users should be able to understand the overhead implications of OpenTelemetry before deploying to production,

This goal is not measurable.


- Security commitments should be documented. Users should be able to evaluate OpenTelemetry's security posture, including CVE response timelines and dependency management practices.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is already considered by the security sig and should not impact stability guarantees.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this goal is realistic. Especially regarding "CVE response timelines"


## Success Criteria

This initiative succeeds when official OpenTelemetry distributions—Collector distributions, the Java agent, and similar—enable only stable components by default. Users should be able to enable experimental features through a consistent, well-documented mechanism. Each component's stability status should be clearly documented and discoverable. Instrumentation libraries should be able to reach stable status based on the production readiness of their code, even if the semantic conventions they depend on are still evolving. Once stable, any breaking change to telemetry output requires a major version bump. Performance benchmarks should exist for stable components, with published baseline characteristics. Security policies and CVE response commitments should be documented and followed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that the users would be happy with the outcome? E.g. regarding

This initiative succeeds when official (...), the Java agent, and similar—enable only stable components by default

I created this issue open-telemetry/opentelemetry-dotnet-instrumentation#2439 3 years ago and we got zero feedback that having experimental instrumentation enabled by default is bad.

Also related issue: open-telemetry/opentelemetry-dotnet-instrumentation#2416

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This initiative succeeds when official OpenTelemetry distributions—Collector distributions, the Java agent, and similar—enable only stable components by default.

Does this mean that the collector-contrib distribution must be eliminated? Do we really have a reasonable expectation that it can both ship only "stable" components and meet its goal of including "all the components from both the OpenTelemetry Collector repository and the OpenTelemetry Collector Contrib repository"?

Users should be able to enable experimental features through a consistent, well-documented mechanism.

Like this?

Each component's stability status should be clearly documented and discoverable.

Like this? Or this? Or this?


## Workstreams

Achieving these goals requires coordinated effort across multiple areas. Each workstream below represents a body of work that may require its own detailed OTEP, tooling, or process changes. The current recommendations are just that -- it's probable that separate projects may need to be created to focus on these specific workstreams.

### Workstream 1: Experimental Feature Opt-In

There is no consistent mechanism across OpenTelemetry for users to opt into experimental features. The Collector uses feature gates, some SDKs use environment variables like `OTEL_SEMCONV_STABILITY_OPT_IN`, and others have ad-hoc approaches. Users have no reliable way to know what they are opting into or what the stability implications are.

This workstream should result in a consistent pattern for experimental feature opt-in that works across SDKs, the Collector, and instrumentation libraries.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this even a feasible goal? The collector can use feature gates with CLI flags because it is itself a standalone application that is in control of its process lifecycle. SDKs and instrumentation libraries should absolutely not be adding CLI flags to control feature gates even if it is possible as they're not in control of the process that contains them and could cause any number of unintended consequences that would be significantly worse than having applications configured one way and libraries configured another.


A new project will be needed to drive this work.

### Workstream 2: Federated Schema and Stability

Comment thread
austinlparker marked this conversation as resolved.
Instrumentation libraries are blocked from stabilization because they depend on experimental semantic conventions, even when the instrumentation code itself is mature and battle-tested. There is also no consistent mechanism to declare which semantic conventions an instrumentation uses or to report schema URLs consistently.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement that we do not have consistent way to emit information about semantic convention is not true.
It is not emitted everywhere, but it is different topic that stating that it is not consistent mechanism to utilize it.


This workstream should establish a path for instrumentation libraries to stabilize based on the production readiness of their code, rather than requiring all upstream semantic conventions to be stable first. Once stable, instrumentation libraries own the stability of their full output—any breaking change to emitted telemetry must be treated as a breaking change requiring a major version bump, regardless of whether the change originates from updated semantic conventions or from the instrumentation itself. The workstream should also address how instrumentation communicates its semantic convention dependencies to users and downstream tooling, and how migration works when conventions evolve after instrumentation has stabilized.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reasonably do this without a mechanism for migrating telemetry produced under one version of the conventions to another? If nothing changes about an instrumentation library but the version of the semantic conventions it uses to emit telemetry why should that instrumentation library have to undergo multiple major version bumps? Nothing about the library itself would change in a way that would cause compilation failures. This also does nothing to address the inconsistency that would result from some applications updating to the next major version of an instrumentation library while others remain behind. That would seem to require schema migration capabilities to reconcile those inconsistencies and, once those are in place, changing the telemetry emitted by an instrumentation library no longer seems like a breaking change worth incrementing the major version.


The Semantic Conventions SIG and Weaver maintainers are the natural owners. Related work includes the [OTEP on federated semantic conventions](https://github.com/open-telemetry/opentelemetry-specification/pull/4815).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @open-telemetry/weaver-maintainers

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Draft proposal to put all the pieces together for this workstream - #4906


### Workstream 3: Distribution and Component Definitions

The term "component" means different things in different contexts—a Collector receiver is quite different from an SDK plugin or an instrumentation library. There is no clear definition of what criteria a component must meet to be included in an official distribution, or what "official distribution" even means.

This workstream needs to define what a component is, what an official distribution is, and what criteria govern inclusion in distributions. The definitions need to work across the Collector, SDKs, and instrumentation.

The GC and Technical Committee should own this work.

### Workstream 4: Production Readiness Criteria

Users cannot easily assess whether a component is ready for production use. Stability status alone does not convey documentation quality, performance characteristics, or operational readiness.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to acknowledge the difference between "stability" and "production readiness" that seems to be conflated elsewhere. If API stability or configuration stability or telemetry stability don't mean that something is "ready for production use" why should something being deemed so ready be the gate for calling it "stable"?


This workstream should define what "production-ready" means for OpenTelemetry components. The goal is visibility, not gatekeeping — helping maintainers understand what production users need without creating barriers to stabilization.

The End User SIG and Communications SIG should own this work.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love our End User + Communications SIG - but is this the right owner?

I think examples of this are crafting the collector resiliency documentation, but the key questions to ask here involve core architectural decisions around architectures OTEL components support and making sure our releases fit into that cohesive whole.

In lieu of a better SIG, I'd suggest this belongs to the TC (today, by charter), and we should step up what we offer here.


### Workstream 5: Performance Benchmarking

Users report unexpected performance overhead with OpenTelemetry, sometimes discovering issues only at scale. Maintainers lack consistent tooling to detect performance regressions.

This workstream should address how users understand performance overhead and how maintainers detect regressions. Benchmarks will take different forms depending on the component.

Each implementation SIG should own this work with coordination from the TC.
Comment on lines +75 to +79
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about this requirement. I just think it's unreasonable to expect every SIG to do its own performance testing, otherwise we will end up with a dozen relatively weak performance tests and a lot of wasted effort. I would support an effort to centralize performance testing in which each SDK SIG builds a synthetic benchmark subject following a specification. For example, the benchmark subject will start with a YAML file, the YAML file will give a port to listen on, then the benchmark apparatus will send the subject commands like "with N threads: create 1 span and then perform 1 microsecond of busy work". (Reference.)


### Workstream 6: Security Standards

Users evaluating OpenTelemetry for production need confidence in security practices, but commitments around CVE response timelines, dependency updates, and supply chain security are not well documented.

This workstream should result in documented, consistent security commitments across OpenTelemetry projects.

The Security SIG, GC, and TC should own this work.
Comment on lines +83 to +87
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is part of this document? This structure is already establshed.


## Impact

### On Existing Distributions

Distributions that currently enable experimental components by default will need to audit their component list and develop a migration plan. To avoid breaking existing users, implementations may provide a transitional period with deprecation warnings before changing defaults. The specifics of this transition are left to individual distributions and the workstreams above.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, there now, we have pretty good solution for this.
If you want to control what you have, just utilize file based configuration (it is implemented/in development at least in Java/.NET Auto). We can avoid breaking changes by this simple recommendation.


### On Instrumentation Libraries

Instrumentation library maintainers will be able to stabilize based on the production readiness of their code, without waiting for all upstream semantic conventions to stabilize. Once stable, they own the stability of their telemetry output—any breaking change to emitted telemetry requires a major version bump. They will need to clearly document which semantic conventions they use and provide migration guidance when conventions evolve.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect these authors will require help from the project. How do semantic conventions and instrumentation library evolve independently of the SDK versions, across the project?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will need to clearly document which semantic conventions they use and provide migration guidance when conventions evolve.

Is this not implicitly saying "some of what this component does is not stable, despite its version and stability level"?


Note that this stability contract covers telemetry the instrumentation library itself produces. In cases where auto-instrumentation subscribes to telemetry emitted natively by a third-party library—for example, an HTTP client that directly uses OpenTelemetry APIs—the telemetry content is controlled by that library, not by the instrumentation package. The instrumentation library's stability commitment in this case is to its subscription surface (which telemetry sources it captures and how it processes them), not to the content of telemetry it does not control.

### On Users

Users will experience a more predictable default installation. Those who depend on experimental features will need to explicitly opt in, which may require configuration changes during the transition period.

## Trade-offs

Disabling experimental features by default means users get less functionality out of the box, which could worsen the "batteries not included" perception. The workstreams above will need to account for this.

Defining workstreams and requiring cross-SIG coordination may slow progress compared to individual SIGs acting independently. However, each workstream can proceed independently once acceptance criteria are agreed. This OTEP provides alignment on goals without requiring lockstep execution.

Allowing instrumentation to stabilize before its upstream semantic conventions may confuse users who see "stable" instrumentation emitting telemetry based on "experimental" semantic conventions. However, this does not mean telemetry output is free to change without consequence—once stable, the instrumentation library commits to the telemetry it emits, and any breaking change requires a major version bump. How to communicate this to users is something the workstreams will need to sort out. The alternative — keeping production-ready instrumentation in pre-release indefinitely — is worse.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a problem with instrumentation authors keeping otherwise stable instrumentation libraries at v0.x because the semconv are not stable and then using that semconv status as justification for making breaking API changes that otherwise wouldn't justify the effort of maintaining a new MV? If so, that's not explicit anywhere in here and I'm left trying to read between the lines for why the repeated insistence that semconv stability shouldn't gate component stability but component stability is still broken by unstable semconvs changing.


Expanding what "production-ready" means could make it harder for components to stabilize, worsening the "stuck on pre-release" problem. The workstreams should avoid creating new barriers to stabilization.

## Prior Art

OTEP 0143 on Versioning and Stability established the foundation for stability guarantees in OpenTelemetry clients. This OTEP extends those concepts to distributions and instrumentation.

OTEP 0232 on Maturity Levels defined maturity levels: Development, Alpha, Beta, RC, Stable, and Deprecated. This OTEP builds on these levels by specifying how they should affect default behavior. Workstreams should use these maturity levels consistently rather than inventing new terminology.

OTEP 0227 on Separate Semantic Conventions moved semantic conventions to a separate repository with independent versioning. This OTEP leverages that separation to enable independent stability assessments.

OTEP 0152 on Telemetry Schemas defined schema URLs and transformation mechanisms for semantic convention evolution. Workstream 2 builds on this foundation.
Comment on lines +117 to +123
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing links for particular OTEPs?


The OpenTelemetry Collector's `metadata.yaml` and feature gates provide established patterns for component metadata and experimental feature opt-in that workstreams should consider.

Kubernetes uses [feature gates](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) with alpha/beta/stable progression, where beta features are typically enabled by default. Workstreams should consider whether OpenTelemetry should follow a similar pattern.

## Alternatives Considered

An earlier version of this OTEP attempted to specify detailed requirements for stability criteria, metadata schemas, and opt-in mechanisms. Community feedback indicated this approach was too prescriptive and should be broken into manageable workstreams that can be tackled independently with their own detailed designs.

We also considered keeping current defaults but improving documentation about stability. This does not address the core problem: users hit production issues with experimental features they did not realize they were using. Documentation alone is insufficient.

We considered requiring semantic conventions to be stable before instrumentation can stabilize. This blocks useful, mature instrumentation indefinitely and does not match how users evaluate stability.

## Open Questions

Who will own each workstream? Should ownership be assigned before this OTEP is approved, or can workstreams proceed as volunteers emerge?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest each workstream is either adopted as the roadmap of an existing SIG (when that's the owner) or becomes a new project in the governance model, with a dedicated project owner, to make sure this succeeds.


Can workstreams proceed in parallel, or do some depend on others? For example, does "Distribution and Component Definitions" need to complete before "Experimental Feature Opt-In" can finalize its design?

Should the default be "stable only" or "beta and above"? The Collector and Kubernetes enable beta features by default. Is that the right model for OpenTelemetry broadly?

Which distributions are considered "official" and subject to these requirements? Just the Collector distributions and Java agent? What about language-specific SDK packages?

How do we ensure workstream outcomes are adopted across the federated OpenTelemetry project? What enforcement mechanisms exist beyond social pressure?

How will we measure whether this initiative is successful? User surveys? Reduced support burden? Faster adoption?

## Future Possibilities

Once the workstreams defined in this OTEP complete, several additional improvements become possible. Users could specify minimum stability thresholds—for example, "only enable beta or above components"—through configuration files or environment variables. Tooling could automatically assess and surface stability information such as documentation completeness, benchmark availability, and test coverage to help users and maintainers. Mechanisms for coordinating stability status across language implementations would ensure users have consistent expectations regardless of language choice. Decoupling instrumentation stability from semantic conventions enables domain experts outside core OpenTelemetry to develop and stabilize conventions for their domains.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see how any of these things are not possible with the current policies, structures, and tools we have today. Many of them will require significant development effort, for sure, but that doesn't change when we start shipping new major versions every month. In fact, it probably gets worse.