Skip to content

Commit 40fcf21

Browse files
committed
docs: Add new blueprint for non-k8s environments
1 parent 4e247eb commit 40fcf21

3 files changed

Lines changed: 367 additions & 2 deletions

File tree

.cspell.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,3 +120,5 @@ words:
120120
- warnf
121121
- warnidf
122122
- Wayback
123+
- Lukasz
124+
- Ciukaj

content/en/docs/guidance/blueprints/_index.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,3 @@ Each blueprint is tightly scoped to address specific challenges, so you might
1212
need to refer to multiple blueprints, depending on your environment.
1313

1414
<!-- Add how to propose a new blueprint once the issue template and repo are ready. -->
15-
16-
**Coming soon!**
Lines changed: 365 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,365 @@
1+
---
2+
title: Instrumenting Infrastructure and Processes on Non-K8s Environments
3+
linkTitle: Instrumenting Infrastructure and Processes on Non-K8s Environments
4+
date: 2026-04-21
5+
author: Lukasz Ciukaj (Splunk)
6+
---
7+
8+
## Summary
9+
10+
This blueprint outlines a strategic reference for Platform Engineering and SRE
11+
teams operating in traditional virtual machine (VM), bare metal, and on-premises
12+
environments, including scenarios where containers are run directly on an
13+
operating system without an orchestrator like Kubernetes.
14+
15+
It addresses the friction often found when attempting to establish consistent
16+
observability across heterogeneous infrastructure, legacy processes, and
17+
containerized workloads.
18+
19+
By implementing the patterns in this blueprint, organizations can expect to
20+
achieve the following outcomes:
21+
22+
- Out-of-the-box, high-quality telemetry for applications and services running
23+
in non-Kubernetes environments, including directly managed containers.
24+
- Consistent lifecycle management for OpenTelemetry agents, together with
25+
standardized bootstrap and configuration patterns for SDK-based
26+
instrumentation.
27+
- Unified observability across mixed infrastructure: VMs, bare metal, and
28+
containers without an orchestrator.
29+
- Improved governance over telemetry signal quality, data enrichment, routing,
30+
and export pipelines.
31+
- Reduced manual toil and cognitive load for developers and operators.
32+
33+
## Background
34+
35+
Many organizations maintain a blend of legacy infrastructure, VMs, bare metal
36+
servers, and direct-to-runtime container deployments, in addition to or instead
37+
of Kubernetes. These environments can be complex and often lack the automation
38+
and standardization provided by orchestrators. Ensuring consistent, high-quality
39+
observability in these scenarios is critical, yet frequently hampered by
40+
fragmented tooling and manual processes.
41+
42+
The introduction of Open Agent Management Protocol (OpAMP) provides a
43+
standardized, scalable way to remotely manage, configure, and monitor
44+
OpenTelemetry agents across diverse infrastructure. In parallel, shared
45+
libraries, pre-baked images, and centrally maintained configuration artifacts
46+
can help standardize SDK-based instrumentation. Together, these approaches
47+
reduce friction and improve reliability for both host-based and containerized
48+
workloads.
49+
50+
## Common challenges
51+
52+
Organizations operating in non-Kubernetes environments, such as those relying on
53+
traditional virtual machines, bare metal servers, or containers running directly
54+
on hosts, typically face a distinct set of challenges that hinder effective
55+
observability. Unlike environments with orchestrators such as Kubernetes, these
56+
setups often lack built-in automation, standardization, and centralized
57+
management for deploying and configuring observability tooling. As a result,
58+
ensuring consistent, high-quality telemetry across a diverse landscape of
59+
infrastructure and applications can be complex and resource-intensive.
60+
61+
### 1. Fragmented instrumentation approaches
62+
63+
Without standardized deployment and management patterns, teams often adopt
64+
different OpenTelemetry agents, SDKs, or exporters for host-based and
65+
containerized workloads.
66+
67+
This leads to:
68+
69+
- **Inconsistent metadata:** Telemetry signals may lack standard resource
70+
attributes such as `service.name`, `host.id`, `host.name`, `container.id`, and
71+
`deployment.environment`, making cross-system correlation difficult.
72+
- **Divergent instrumentation behavior:** Different teams may apply different
73+
defaults for sampling, propagation, resource detection, or export, producing
74+
uneven telemetry quality.
75+
- **Manual configuration drift:** Host- and container-based agents frequently
76+
require manual configuration, resulting in drift and an increased risk of
77+
errors.
78+
79+
### 2. Limited automation for telemetry deployment and management
80+
81+
Instrumentation and agent deployment on VMs, bare metal, and directly run
82+
containers is often manual or script-based, and ongoing configuration is
83+
difficult to manage at scale. This decentralized, ad hoc approach typically
84+
requires operators or developers to install, configure, and update OpenTelemetry
85+
agents individually on each host or workload.
86+
87+
This leads to:
88+
89+
- **High toil:** New workloads or hosts require repeated, error-prone
90+
configuration steps.
91+
- **Slow rollout and update cycles:** Updates to instrumentation or
92+
configuration are slow and difficult to propagate fleet-wide.
93+
- **Operational risk:** Rollbacks, version control, and health monitoring are
94+
harder to perform consistently across the estate.
95+
96+
### 3. Siloed data processing and export
97+
98+
Data collection and export pipelines are often set up per application, per host,
99+
or per team. In the absence of centralized management, individual teams may
100+
independently configure telemetry agents, exporters, and data processing logic
101+
for each workload or environment.
102+
103+
This leads to:
104+
105+
- **Duplicated effort:** Teams may duplicate data enrichment, filtering, and
106+
routing logic across environments.
107+
- **Inconsistent policy enforcement:** Redaction, retry behavior, batching, and
108+
routing policies may vary between teams.
109+
- **Lack of visibility:** Operations and governance teams lack unified control
110+
over what telemetry is collected and how it is processed or exported.
111+
112+
## General guidelines
113+
114+
To address the challenges described above, organizations should adopt a set of
115+
strategic guidelines designed to streamline observability practices across
116+
diverse non-Kubernetes environments. These guidelines provide a foundation for
117+
standardizing telemetry instrumentation, automating agent management, and
118+
ensuring consistent data quality, whether workloads are running on VMs, bare
119+
metal, or as containers managed directly by a runtime. By following these
120+
recommendations, teams can reduce operational overhead, improve visibility, and
121+
lay the groundwork for scalable, sustainable observability in complex
122+
infrastructure landscapes.
123+
124+
### 1. Centrally manage agent lifecycle while allowing controlled customization
125+
126+
<small>Challenges addressed: 1, 2</small>
127+
128+
Use OpAMP, where supported, to centrally manage OpenTelemetry agents running as
129+
system services or service containers. Platform teams should own the baseline
130+
agent distribution, required processors and exporters, security settings, health
131+
reporting, and default resource detection behavior.
132+
133+
At the same time, organizations should explicitly define how
134+
environment-specific or workload-specific customization is allowed. A practical
135+
model is to use a **layered configuration approach**:
136+
137+
- A **platform-owned baseline** for mandatory defaults, security controls, and
138+
organization-wide processors/exporters.
139+
- An **environment overlay** for differences such as endpoints, tenancy,
140+
deployment environment, or site-specific metadata.
141+
- A **workload overlay** for approved variations such as opt-in receivers,
142+
additional resource attributes, or safe tuning parameters.
143+
144+
This creates a clear boundary between standardization and flexibility: teams can
145+
extend approved parts of the configuration without creating one-off, unmanaged
146+
deployments.
147+
148+
By implementing this guideline, organizations can expect to achieve:
149+
150+
- Automated, consistent telemetry configuration across all environments.
151+
- Reduced manual errors and simplified onboarding for new workloads.
152+
- Faster, safer upgrades and rollbacks of agent configurations.
153+
- A controlled mechanism for local customization without sacrificing central
154+
governance.
155+
156+
### 2. Centralize telemetry collection and processing through an OpenTelemetry Collector Gateway layer
157+
158+
<small>Challenges addressed: 1, 3</small>
159+
160+
Deploy one or more OpenTelemetry Collector Gateways as central aggregation
161+
points for telemetry data from hosts and directly managed containers. In
162+
non-Kubernetes environments, these gateways can be deployed using several
163+
patterns, depending on scale and operational model, including:
164+
165+
- Dedicated gateway VMs or bare metal hosts.
166+
- A service pool behind a load balancer.
167+
- Containerized gateway services running on general-purpose compute.
168+
- Regional or site-local gateways for distributed environments.
169+
170+
By implementing this guideline, organizations can expect to achieve:
171+
172+
- Unified control over data processing, enrichment, and export pipelines.
173+
- Simplified governance and easier implementation of organization-wide policies.
174+
- Better resilience and scalability than per-host or per-application export
175+
topologies.
176+
- Clear separation between local collection and centralized policy enforcement.
177+
178+
### 3. Standardize resource attribution and distribute reusable instrumentation building blocks
179+
180+
<small>Challenges addressed: 1</small>
181+
182+
Define an organization-wide telemetry contract for resource attribution and
183+
ensure it is applied consistently across all workloads. This should not rely
184+
only on documentation; it should be delivered through reusable building blocks
185+
such as:
186+
187+
- Pre-baked agent images.
188+
- Shared libraries or starter packages for SDK-based instrumentation.
189+
- Standard startup wrappers or environment-variable conventions.
190+
- Centrally maintained configuration snippets or templates.
191+
192+
At minimum, the standard resource model for non-Kubernetes environments should
193+
cover:
194+
195+
- **Host:** `host.id`, `host.name`, `host.arch`
196+
- **Device (where applicable):** `device.id` and other relevant device
197+
attributes
198+
- **Process:** `process.pid`, `process.executable.name`, `process.command`
199+
- **Process runtime:** `process.runtime.name`, `process.runtime.version`
200+
- **Operating system:** `os.type`, `os.description`, `os.version`
201+
- **Container (where applicable):** `container.id`
202+
- **Service identity:** `service.name`, `service.version`,
203+
`deployment.environment`
204+
205+
Application telemetry should include at least `host.id` or `host.name` so that
206+
application signals can be correlated with host- and infrastructure-level
207+
telemetry.
208+
209+
By implementing this guideline, organizations can expect to achieve:
210+
211+
- Improved correlation and searchability of telemetry data across systems.
212+
- Easier analysis and troubleshooting regardless of infrastructure type.
213+
- Consistent metadata quality without requiring every team to reinvent
214+
instrumentation patterns.
215+
- Faster adoption through reusable, supported building blocks.
216+
217+
## Implementation
218+
219+
Translating these guidelines into practice requires a combination of automation,
220+
standardized tooling, and centralized management. The implementation steps below
221+
are written as roadmap items, with checklist-style actions that organizations
222+
can plan and execute in sequence.
223+
224+
### 1. Define a baseline telemetry contract and layered configuration model
225+
226+
<small>Guidelines supported: 1, 3</small>
227+
228+
Define the minimum required telemetry contract for the organization and document
229+
which parts of agent and SDK configuration are centrally owned versus locally
230+
customizable. This is the foundation for consistency at scale.
231+
232+
Checklist:
233+
234+
- Define the mandatory resource attributes and signal conventions that all
235+
workloads must emit.
236+
- Define the baseline agent configuration, including exporters, authentication,
237+
TLS, health reporting, and default processors.
238+
- Define the allowed extension points for environment-specific and
239+
workload-specific customization.
240+
- Version all baseline and overlay configurations so they can be rolled out and
241+
rolled back safely.
242+
- Publish ownership boundaries so teams know what they can and cannot modify.
243+
244+
Documentation:
245+
246+
- [OpAMP Specification](https://github.com/open-telemetry/opamp-spec)
247+
- [OpenTelemetry Semantic Conventions](/docs/specs/semconv/)
248+
249+
### 2. Stand up an OpAMP Management Plane for agents
250+
251+
<small>Guidelines supported: 1</small>
252+
253+
Deploy a central OpAMP management service to manage agent configuration, status
254+
reporting, health monitoring, and controlled rollouts for supported agents.
255+
256+
Checklist:
257+
258+
- Select the supported OpAMP-capable agent distributions.
259+
- Stand up a central OpAMP server or management endpoint.
260+
- Register agents and enable health/status reporting.
261+
- Define rollout rings such as development, staging, and production.
262+
- Define rollback procedures for failed updates or bad configurations.
263+
- Monitor management-plane health and agent connectivity.
264+
265+
Documentation:
266+
267+
- [OpAMP Specification](https://github.com/open-telemetry/opamp-spec)
268+
- [OpenTelemetry Collector Documentation](/docs/collector/)
269+
270+
### 3. Package and deploy standardized agents and SDK bootstrap artifacts
271+
272+
<small>Guidelines supported: 1, 3</small>
273+
274+
Use configuration management and image packaging to deliver supported telemetry
275+
components consistently across hosts and containerized workloads.
276+
277+
Checklist:
278+
279+
- Package host-based agents as standard system services.
280+
- Provide pre-baked images or service-container definitions for containerized
281+
deployments.
282+
- Publish shared libraries, starter packages, or bootstrap wrappers for
283+
supported SDK languages.
284+
- Standardize environment-variable and configuration-file conventions across
285+
environments.
286+
- Validate that new workloads inherit the baseline configuration by default.
287+
288+
Documentation:
289+
290+
- [OpenTelemetry Collector Documentation](/docs/collector/)
291+
- [OpenTelemetry Documentation](/docs/)
292+
293+
### 4. Deploy an OpenTelemetry Collector Gateway layer
294+
295+
<small>Guidelines supported: 2</small>
296+
297+
Deploy one or more OpenTelemetry Collector Gateways as the central processing
298+
and export tier. Choose a topology appropriate for the environment, such as
299+
dedicated VMs, service pools behind a load balancer, or regional gateway nodes.
300+
301+
Checklist:
302+
303+
- Select the gateway deployment topology for each environment.
304+
- Define how local agents discover and connect to gateways.
305+
- Configure processors for batching, memory protection, enrichment, retry, and
306+
routing.
307+
- Separate lightweight ingest from heavier centralized processing where scale
308+
requires it.
309+
- Define high-availability and failover behavior for gateways.
310+
- Validate end-to-end routing to observability backends.
311+
312+
Documentation:
313+
314+
- [OpenTelemetry Collector Documentation](/docs/collector/)
315+
- [OpenTelemetry Collector Configuration](/docs/collector/configuration/)
316+
317+
### 5. Enforce resource attribution and correlation standards
318+
319+
<small>Guidelines supported: 1, 3</small>
320+
321+
Ensure that all telemetry includes the required metadata for correlation across
322+
infrastructure and application layers.
323+
324+
Checklist:
325+
326+
- Publish the minimum required resource attribute set for hosts, processes,
327+
runtimes, operating systems, and containers where applicable.
328+
- Ensure application telemetry includes at least `host.id` or `host.name`.
329+
- Enable resource detection and enrichment wherever supported.
330+
- Validate emitted telemetry against the standard attribute contract.
331+
- Add conformance checks to deployment pipelines or post-deployment validation
332+
steps.
333+
334+
Documentation:
335+
336+
- [OpenTelemetry Semantic Conventions](/docs/specs/semconv/)
337+
- [OpenTelemetry Resource Semantic Conventions](/docs/specs/semconv/resource/)
338+
339+
### 6. Centralize governance, policy enforcement, and change management
340+
341+
<small>Guidelines supported: 2, 3</small>
342+
343+
Use the Collector Gateway layer and centrally owned configuration to enforce
344+
organization-wide rules for processing, routing, and exporting telemetry.
345+
346+
Checklist:
347+
348+
- Define approved exporters and backend destinations.
349+
- Centralize redaction, filtering, enrichment, and routing policies.
350+
- Define standard retry, batching, and sampling policies.
351+
- Establish an exception process for workloads that need non-default behavior.
352+
- Review telemetry quality and policy compliance regularly.
353+
354+
Documentation:
355+
356+
- [OpenTelemetry Collector Documentation](/docs/collector/)
357+
- [OpenTelemetry Semantic Conventions](/docs/specs/semconv/)
358+
359+
## Reference architectures
360+
361+
The patterns described above have been successfully implemented by the following
362+
end-users:
363+
364+
- [Reference Architecture 1][]
365+
- [Reference Architecture 2][]

0 commit comments

Comments
 (0)