|
| 1 | +--- |
| 2 | +title: Instrumenting Infrastructure and Processes on Non-K8s Environments |
| 3 | +linkTitle: Instrumenting Infrastructure and Processes on Non-K8s Environments |
| 4 | +date: 2026-04-21 |
| 5 | +author: Lukasz Ciukaj (Splunk) |
| 6 | +--- |
| 7 | + |
| 8 | +## Summary |
| 9 | + |
| 10 | +This blueprint outlines a strategic reference for Platform Engineering and SRE |
| 11 | +teams operating in traditional virtual machine (VM), bare metal, and on-premises |
| 12 | +environments, including scenarios where containers are run directly on an |
| 13 | +operating system without an orchestrator like Kubernetes. |
| 14 | + |
| 15 | +It addresses the friction often found when attempting to establish consistent |
| 16 | +observability across heterogeneous infrastructure, legacy processes, and |
| 17 | +containerized workloads. |
| 18 | + |
| 19 | +By implementing the patterns in this blueprint, organizations can expect to |
| 20 | +achieve the following outcomes: |
| 21 | + |
| 22 | +- Out-of-the-box, high-quality telemetry for applications and services running |
| 23 | + in non-Kubernetes environments, including directly managed containers. |
| 24 | +- Consistent lifecycle management for OpenTelemetry agents, together with |
| 25 | + standardized bootstrap and configuration patterns for SDK-based |
| 26 | + instrumentation. |
| 27 | +- Unified observability across mixed infrastructure: VMs, bare metal, and |
| 28 | + containers without an orchestrator. |
| 29 | +- Improved governance over telemetry signal quality, data enrichment, routing, |
| 30 | + and export pipelines. |
| 31 | +- Reduced manual toil and cognitive load for developers and operators. |
| 32 | + |
| 33 | +## Background |
| 34 | + |
| 35 | +Many organizations maintain a blend of legacy infrastructure, VMs, bare metal |
| 36 | +servers, and direct-to-runtime container deployments, in addition to or instead |
| 37 | +of Kubernetes. These environments can be complex and often lack the automation |
| 38 | +and standardization provided by orchestrators. Ensuring consistent, high-quality |
| 39 | +observability in these scenarios is critical, yet frequently hampered by |
| 40 | +fragmented tooling and manual processes. |
| 41 | + |
| 42 | +The introduction of Open Agent Management Protocol (OpAMP) provides a |
| 43 | +standardized, scalable way to remotely manage, configure, and monitor |
| 44 | +OpenTelemetry agents across diverse infrastructure. In parallel, shared |
| 45 | +libraries, pre-baked images, and centrally maintained configuration artifacts |
| 46 | +can help standardize SDK-based instrumentation. Together, these approaches |
| 47 | +reduce friction and improve reliability for both host-based and containerized |
| 48 | +workloads. |
| 49 | + |
| 50 | +## Common challenges |
| 51 | + |
| 52 | +Organizations operating in non-Kubernetes environments, such as those relying on |
| 53 | +traditional virtual machines, bare metal servers, or containers running directly |
| 54 | +on hosts, typically face a distinct set of challenges that hinder effective |
| 55 | +observability. Unlike environments with orchestrators such as Kubernetes, these |
| 56 | +setups often lack built-in automation, standardization, and centralized |
| 57 | +management for deploying and configuring observability tooling. As a result, |
| 58 | +ensuring consistent, high-quality telemetry across a diverse landscape of |
| 59 | +infrastructure and applications can be complex and resource-intensive. |
| 60 | + |
| 61 | +### 1. Fragmented instrumentation approaches |
| 62 | + |
| 63 | +Without standardized deployment and management patterns, teams often adopt |
| 64 | +different OpenTelemetry agents, SDKs, or exporters for host-based and |
| 65 | +containerized workloads. |
| 66 | + |
| 67 | +This leads to: |
| 68 | + |
| 69 | +- **Inconsistent metadata:** Telemetry signals may lack standard resource |
| 70 | + attributes such as `service.name`, `host.id`, `host.name`, `container.id`, and |
| 71 | + `deployment.environment`, making cross-system correlation difficult. |
| 72 | +- **Divergent instrumentation behavior:** Different teams may apply different |
| 73 | + defaults for sampling, propagation, resource detection, or export, producing |
| 74 | + uneven telemetry quality. |
| 75 | +- **Manual configuration drift:** Host- and container-based agents frequently |
| 76 | + require manual configuration, resulting in drift and an increased risk of |
| 77 | + errors. |
| 78 | + |
| 79 | +### 2. Limited automation for telemetry deployment and management |
| 80 | + |
| 81 | +Instrumentation and agent deployment on VMs, bare metal, and directly run |
| 82 | +containers is often manual or script-based, and ongoing configuration is |
| 83 | +difficult to manage at scale. This decentralized, ad hoc approach typically |
| 84 | +requires operators or developers to install, configure, and update OpenTelemetry |
| 85 | +agents individually on each host or workload. |
| 86 | + |
| 87 | +This leads to: |
| 88 | + |
| 89 | +- **High toil:** New workloads or hosts require repeated, error-prone |
| 90 | + configuration steps. |
| 91 | +- **Slow rollout and update cycles:** Updates to instrumentation or |
| 92 | + configuration are slow and difficult to propagate fleet-wide. |
| 93 | +- **Operational risk:** Rollbacks, version control, and health monitoring are |
| 94 | + harder to perform consistently across the estate. |
| 95 | + |
| 96 | +### 3. Siloed data processing and export |
| 97 | + |
| 98 | +Data collection and export pipelines are often set up per application, per host, |
| 99 | +or per team. In the absence of centralized management, individual teams may |
| 100 | +independently configure telemetry agents, exporters, and data processing logic |
| 101 | +for each workload or environment. |
| 102 | + |
| 103 | +This leads to: |
| 104 | + |
| 105 | +- **Duplicated effort:** Teams may duplicate data enrichment, filtering, and |
| 106 | + routing logic across environments. |
| 107 | +- **Inconsistent policy enforcement:** Redaction, retry behavior, batching, and |
| 108 | + routing policies may vary between teams. |
| 109 | +- **Lack of visibility:** Operations and governance teams lack unified control |
| 110 | + over what telemetry is collected and how it is processed or exported. |
| 111 | + |
| 112 | +## General guidelines |
| 113 | + |
| 114 | +To address the challenges described above, organizations should adopt a set of |
| 115 | +strategic guidelines designed to streamline observability practices across |
| 116 | +diverse non-Kubernetes environments. These guidelines provide a foundation for |
| 117 | +standardizing telemetry instrumentation, automating agent management, and |
| 118 | +ensuring consistent data quality, whether workloads are running on VMs, bare |
| 119 | +metal, or as containers managed directly by a runtime. By following these |
| 120 | +recommendations, teams can reduce operational overhead, improve visibility, and |
| 121 | +lay the groundwork for scalable, sustainable observability in complex |
| 122 | +infrastructure landscapes. |
| 123 | + |
| 124 | +### 1. Centrally manage agent lifecycle while allowing controlled customization |
| 125 | + |
| 126 | +<small>Challenges addressed: 1, 2</small> |
| 127 | + |
| 128 | +Use OpAMP, where supported, to centrally manage OpenTelemetry agents running as |
| 129 | +system services or service containers. Platform teams should own the baseline |
| 130 | +agent distribution, required processors and exporters, security settings, health |
| 131 | +reporting, and default resource detection behavior. |
| 132 | + |
| 133 | +At the same time, organizations should explicitly define how |
| 134 | +environment-specific or workload-specific customization is allowed. A practical |
| 135 | +model is to use a **layered configuration approach**: |
| 136 | + |
| 137 | +- A **platform-owned baseline** for mandatory defaults, security controls, and |
| 138 | + organization-wide processors/exporters. |
| 139 | +- An **environment overlay** for differences such as endpoints, tenancy, |
| 140 | + deployment environment, or site-specific metadata. |
| 141 | +- A **workload overlay** for approved variations such as opt-in receivers, |
| 142 | + additional resource attributes, or safe tuning parameters. |
| 143 | + |
| 144 | +This creates a clear boundary between standardization and flexibility: teams can |
| 145 | +extend approved parts of the configuration without creating one-off, unmanaged |
| 146 | +deployments. |
| 147 | + |
| 148 | +By implementing this guideline, organizations can expect to achieve: |
| 149 | + |
| 150 | +- Automated, consistent telemetry configuration across all environments. |
| 151 | +- Reduced manual errors and simplified onboarding for new workloads. |
| 152 | +- Faster, safer upgrades and rollbacks of agent configurations. |
| 153 | +- A controlled mechanism for local customization without sacrificing central |
| 154 | + governance. |
| 155 | + |
| 156 | +### 2. Centralize telemetry collection and processing through an OpenTelemetry Collector Gateway layer |
| 157 | + |
| 158 | +<small>Challenges addressed: 1, 3</small> |
| 159 | + |
| 160 | +Deploy one or more OpenTelemetry Collector Gateways as central aggregation |
| 161 | +points for telemetry data from hosts and directly managed containers. In |
| 162 | +non-Kubernetes environments, these gateways can be deployed using several |
| 163 | +patterns, depending on scale and operational model, including: |
| 164 | + |
| 165 | +- Dedicated gateway VMs or bare metal hosts. |
| 166 | +- A service pool behind a load balancer. |
| 167 | +- Containerized gateway services running on general-purpose compute. |
| 168 | +- Regional or site-local gateways for distributed environments. |
| 169 | + |
| 170 | +By implementing this guideline, organizations can expect to achieve: |
| 171 | + |
| 172 | +- Unified control over data processing, enrichment, and export pipelines. |
| 173 | +- Simplified governance and easier implementation of organization-wide policies. |
| 174 | +- Better resilience and scalability than per-host or per-application export |
| 175 | + topologies. |
| 176 | +- Clear separation between local collection and centralized policy enforcement. |
| 177 | + |
| 178 | +### 3. Standardize resource attribution and distribute reusable instrumentation building blocks |
| 179 | + |
| 180 | +<small>Challenges addressed: 1</small> |
| 181 | + |
| 182 | +Define an organization-wide telemetry contract for resource attribution and |
| 183 | +ensure it is applied consistently across all workloads. This should not rely |
| 184 | +only on documentation; it should be delivered through reusable building blocks |
| 185 | +such as: |
| 186 | + |
| 187 | +- Pre-baked agent images. |
| 188 | +- Shared libraries or starter packages for SDK-based instrumentation. |
| 189 | +- Standard startup wrappers or environment-variable conventions. |
| 190 | +- Centrally maintained configuration snippets or templates. |
| 191 | + |
| 192 | +At minimum, the standard resource model for non-Kubernetes environments should |
| 193 | +cover: |
| 194 | + |
| 195 | +- **Host:** `host.id`, `host.name`, `host.arch` |
| 196 | +- **Device (where applicable):** `device.id` and other relevant device |
| 197 | + attributes |
| 198 | +- **Process:** `process.pid`, `process.executable.name`, `process.command` |
| 199 | +- **Process runtime:** `process.runtime.name`, `process.runtime.version` |
| 200 | +- **Operating system:** `os.type`, `os.description`, `os.version` |
| 201 | +- **Container (where applicable):** `container.id` |
| 202 | +- **Service identity:** `service.name`, `service.version`, |
| 203 | + `deployment.environment` |
| 204 | + |
| 205 | +Application telemetry should include at least `host.id` or `host.name` so that |
| 206 | +application signals can be correlated with host- and infrastructure-level |
| 207 | +telemetry. |
| 208 | + |
| 209 | +By implementing this guideline, organizations can expect to achieve: |
| 210 | + |
| 211 | +- Improved correlation and searchability of telemetry data across systems. |
| 212 | +- Easier analysis and troubleshooting regardless of infrastructure type. |
| 213 | +- Consistent metadata quality without requiring every team to reinvent |
| 214 | + instrumentation patterns. |
| 215 | +- Faster adoption through reusable, supported building blocks. |
| 216 | + |
| 217 | +## Implementation |
| 218 | + |
| 219 | +Translating these guidelines into practice requires a combination of automation, |
| 220 | +standardized tooling, and centralized management. The implementation steps below |
| 221 | +are written as roadmap items, with checklist-style actions that organizations |
| 222 | +can plan and execute in sequence. |
| 223 | + |
| 224 | +### 1. Define a baseline telemetry contract and layered configuration model |
| 225 | + |
| 226 | +<small>Guidelines supported: 1, 3</small> |
| 227 | + |
| 228 | +Define the minimum required telemetry contract for the organization and document |
| 229 | +which parts of agent and SDK configuration are centrally owned versus locally |
| 230 | +customizable. This is the foundation for consistency at scale. |
| 231 | + |
| 232 | +Checklist: |
| 233 | + |
| 234 | +- Define the mandatory resource attributes and signal conventions that all |
| 235 | + workloads must emit. |
| 236 | +- Define the baseline agent configuration, including exporters, authentication, |
| 237 | + TLS, health reporting, and default processors. |
| 238 | +- Define the allowed extension points for environment-specific and |
| 239 | + workload-specific customization. |
| 240 | +- Version all baseline and overlay configurations so they can be rolled out and |
| 241 | + rolled back safely. |
| 242 | +- Publish ownership boundaries so teams know what they can and cannot modify. |
| 243 | + |
| 244 | +Documentation: |
| 245 | + |
| 246 | +- [OpAMP Specification](https://github.com/open-telemetry/opamp-spec) |
| 247 | +- [OpenTelemetry Semantic Conventions](/docs/specs/semconv/) |
| 248 | + |
| 249 | +### 2. Stand up an OpAMP Management Plane for agents |
| 250 | + |
| 251 | +<small>Guidelines supported: 1</small> |
| 252 | + |
| 253 | +Deploy a central OpAMP management service to manage agent configuration, status |
| 254 | +reporting, health monitoring, and controlled rollouts for supported agents. |
| 255 | + |
| 256 | +Checklist: |
| 257 | + |
| 258 | +- Select the supported OpAMP-capable agent distributions. |
| 259 | +- Stand up a central OpAMP server or management endpoint. |
| 260 | +- Register agents and enable health/status reporting. |
| 261 | +- Define rollout rings such as development, staging, and production. |
| 262 | +- Define rollback procedures for failed updates or bad configurations. |
| 263 | +- Monitor management-plane health and agent connectivity. |
| 264 | + |
| 265 | +Documentation: |
| 266 | + |
| 267 | +- [OpAMP Specification](https://github.com/open-telemetry/opamp-spec) |
| 268 | +- [OpenTelemetry Collector Documentation](/docs/collector/) |
| 269 | + |
| 270 | +### 3. Package and deploy standardized agents and SDK bootstrap artifacts |
| 271 | + |
| 272 | +<small>Guidelines supported: 1, 3</small> |
| 273 | + |
| 274 | +Use configuration management and image packaging to deliver supported telemetry |
| 275 | +components consistently across hosts and containerized workloads. |
| 276 | + |
| 277 | +Checklist: |
| 278 | + |
| 279 | +- Package host-based agents as standard system services. |
| 280 | +- Provide pre-baked images or service-container definitions for containerized |
| 281 | + deployments. |
| 282 | +- Publish shared libraries, starter packages, or bootstrap wrappers for |
| 283 | + supported SDK languages. |
| 284 | +- Standardize environment-variable and configuration-file conventions across |
| 285 | + environments. |
| 286 | +- Validate that new workloads inherit the baseline configuration by default. |
| 287 | + |
| 288 | +Documentation: |
| 289 | + |
| 290 | +- [OpenTelemetry Collector Documentation](/docs/collector/) |
| 291 | +- [OpenTelemetry Documentation](/docs/) |
| 292 | + |
| 293 | +### 4. Deploy an OpenTelemetry Collector Gateway layer |
| 294 | + |
| 295 | +<small>Guidelines supported: 2</small> |
| 296 | + |
| 297 | +Deploy one or more OpenTelemetry Collector Gateways as the central processing |
| 298 | +and export tier. Choose a topology appropriate for the environment, such as |
| 299 | +dedicated VMs, service pools behind a load balancer, or regional gateway nodes. |
| 300 | + |
| 301 | +Checklist: |
| 302 | + |
| 303 | +- Select the gateway deployment topology for each environment. |
| 304 | +- Define how local agents discover and connect to gateways. |
| 305 | +- Configure processors for batching, memory protection, enrichment, retry, and |
| 306 | + routing. |
| 307 | +- Separate lightweight ingest from heavier centralized processing where scale |
| 308 | + requires it. |
| 309 | +- Define high-availability and failover behavior for gateways. |
| 310 | +- Validate end-to-end routing to observability backends. |
| 311 | + |
| 312 | +Documentation: |
| 313 | + |
| 314 | +- [OpenTelemetry Collector Documentation](/docs/collector/) |
| 315 | +- [OpenTelemetry Collector Configuration](/docs/collector/configuration/) |
| 316 | + |
| 317 | +### 5. Enforce resource attribution and correlation standards |
| 318 | + |
| 319 | +<small>Guidelines supported: 1, 3</small> |
| 320 | + |
| 321 | +Ensure that all telemetry includes the required metadata for correlation across |
| 322 | +infrastructure and application layers. |
| 323 | + |
| 324 | +Checklist: |
| 325 | + |
| 326 | +- Publish the minimum required resource attribute set for hosts, processes, |
| 327 | + runtimes, operating systems, and containers where applicable. |
| 328 | +- Ensure application telemetry includes at least `host.id` or `host.name`. |
| 329 | +- Enable resource detection and enrichment wherever supported. |
| 330 | +- Validate emitted telemetry against the standard attribute contract. |
| 331 | +- Add conformance checks to deployment pipelines or post-deployment validation |
| 332 | + steps. |
| 333 | + |
| 334 | +Documentation: |
| 335 | + |
| 336 | +- [OpenTelemetry Semantic Conventions](/docs/specs/semconv/) |
| 337 | +- [OpenTelemetry Resource Semantic Conventions](/docs/specs/semconv/resource/) |
| 338 | + |
| 339 | +### 6. Centralize governance, policy enforcement, and change management |
| 340 | + |
| 341 | +<small>Guidelines supported: 2, 3</small> |
| 342 | + |
| 343 | +Use the Collector Gateway layer and centrally owned configuration to enforce |
| 344 | +organization-wide rules for processing, routing, and exporting telemetry. |
| 345 | + |
| 346 | +Checklist: |
| 347 | + |
| 348 | +- Define approved exporters and backend destinations. |
| 349 | +- Centralize redaction, filtering, enrichment, and routing policies. |
| 350 | +- Define standard retry, batching, and sampling policies. |
| 351 | +- Establish an exception process for workloads that need non-default behavior. |
| 352 | +- Review telemetry quality and policy compliance regularly. |
| 353 | + |
| 354 | +Documentation: |
| 355 | + |
| 356 | +- [OpenTelemetry Collector Documentation](/docs/collector/) |
| 357 | +- [OpenTelemetry Semantic Conventions](/docs/specs/semconv/) |
| 358 | + |
| 359 | +## Reference architectures |
| 360 | + |
| 361 | +The patterns described above have been successfully implemented by the following |
| 362 | +end-users: |
| 363 | + |
| 364 | +- [Reference Architecture 1][] |
| 365 | +- [Reference Architecture 2][] |
0 commit comments