OpenTelemetry Collector: The Observability Pipe You’re Going To Be On-Call For

Sun, Jun 21, 2026 · 11 min read

The pager goes off because the app is sick. You start the usual dance: open the traces, pull the logs, check the deploy timeline. Except the traces are thin, the logs are lagging by four minutes, and the dashboards that should be screaming are politely flat. For a second you wonder if the incident is actually over. It isn’t. Your telemetry pipeline is the thing that’s down, and it’s been quietly dropping spans since the memory pressure started — which is exactly when you needed it most.

There is no worse moment to learn that your observability pipe is production infrastructure than during the incident it was supposed to explain.

This is not a post about OpenTelemetry being a mistake. It’s the opposite. OpenTelemetry won — and winning is precisely why the Collector stopped being a harmless little sidecar and became a data plane you are now responsible for operating.

OTel won, and that’s the problem

In May 2026 the CNCF announced that OpenTelemetry graduated. The project framed it right — graduation is a milestone, not a finish line — but the milestone is real. OTel was formed in 2019 as a CNCF-assisted merger of OpenTracing and OpenCensus, to do one specific, valuable thing: standardize how you collect, process, and export metrics, logs, and traces so you can change backends without re-instrumenting your entire codebase. That is a hard, genuinely good goal, and they mostly pulled it off.

The numbers behind graduation are not niche-project numbers. More than 12,000 contributors from more than 2,800 companies. The JavaScript API package crossed 1.36 billion downloads in the prior twelve months; the Python API package passed 1.3 billion, both setting monthly records in April 2026. Graduation came with a third-party independent security audit and reviews of core components, including the Collector itself.

Hit that kind of adoption and something subtle happens: the project becomes boring. Boring infrastructure is the most dangerous kind, because boring is what you stop budgeting attention for. Nobody writes a runbook for the thing everyone assumes just works. OTel is now boring the way DNS and load balancers are boring — load-bearing, and you find out how the hard way.

The Collector is not an agent. It’s a data plane.

Here is the mental shift I had to make, and that a lot of teams still haven’t. The Collector started life in most people’s heads as “the thing that scrapes telemetry and forwards it.” An agent. A pipe. Install once and forget. That framing was accurate in 2020 and is dangerously out of date now.

What actually flows through a mature Collector deployment today: metrics, logs, and traces, sure. But also routing — this telemetry to the cheap long-term store, that one to the expensive queryable backend. Tail sampling, where you hold traces in memory and decide after the fact which are worth keeping. Transformations and filtering through OTTL. Enrichment with Kubernetes metadata, so a raw span becomes “the checkout pod in the payments namespace.” Redaction of fields you legally cannot ship to a vendor. Export to one or more backends, each with its own auth and its own bill. Cost control before the invoice surprises you. And increasingly, new signals: continuous profiles and AI agent telemetry.

That is not an agent. That is a data plane. Every byte of visibility your on-call engineer relies on passes through it, gets reshaped by it, and depends on it being healthy — so when it is unhealthy, you are blind in the exact dimension you need most. I’ve argued before that I don’t want more dashboards; I want answers during an incident. The cruel joke is that the pipe feeding those answers is itself a system that can fail, and when it fails it fails silently.

The survey tells the story

I’m not extrapolating from one bad night. The official OpenTelemetry Collector follow-up survey, analyzed in January 2026, describes a fleet that looks an awful lot like production infrastructure and is treated like a hobby.

Start with scale. 65% of respondents run more than ten Collectors in production. Not one, not a tidy pair for redundancy. More than ten. Kubernetes is the dominant home at 81% of deployment locations, VMs climbed to 51%, bare metal sat at 18%, and a meaningful slice — about half the large deployments — run both Kubernetes and VMs at once. Inside Kubernetes the topology fragments further: roughly 58% run a gateway, 50% a DaemonSet, 23% a sidecar, 14% a StatefulSet, often several at once, in layers. That is already a distributed system, stitched into one pipeline.

Now the uncomfortable part. 46% of respondents build their own Collector rather than running an off-the-shelf one. Among those, 86% use the Collector Builder — but only 39% confidently agreed it is easy to use; the rest were neutral or found it hard. Nearly half the field is compiling a custom binary, with a tool most don’t find easy, and shipping it into production data paths.

And then: about 23% reported not monitoring their Collectors at all — roughly one in four teams running the thing that carries all their telemetry, not watching the thing itself. Of those that do, 83% gather metrics, 61% logs, and only 25% traces — blind spots even where monitoring exists.

The improvement requests are the tell. Configuration management and resolution topped the list at around 63%, then stability at 52%, Collector observability at 43%, more receiver and exporter support at 29%. People specifically asked for safer pipeline-level reconfiguration — changing one pipeline without restarting or disturbing the whole Collector. That is the request of an operator who has caused an outage by editing a config, and who learned in production that this is infrastructure.

Where it actually hurts

“It’s a data plane” stays abstract until you’ve been paged for one. The concrete failure modes are these.

Config reload blast radius. Change one exporter and the Collector restarts the whole process; for a few seconds every pipeline — including the ones you didn’t touch — drops data. One edit should not be able to blind every signal at once, which is exactly what the survey respondents were begging to fix.

Backpressure and memory. Tail sampling and batching hold telemetry in memory. When the backend slows or volume spikes, memory climbs; hit the limit and you OOM or start dropping. The memory_limiter exists for exactly this — but it means your worst data loss happens during your worst incidents, when load is highest.

Cardinality. A well-meaning developer adds a label carrying the user ID. Cardinality explodes, the Collector’s memory and your backend’s bill both climb, and the pipe is straining because of a one-line change three teams away that nobody reviewed as an infrastructure change.

Component maturity. Not every receiver, processor, and exporter is equally mature — the survey flagged missing components and documentation gaps directly. When you build a custom Collector you own that supply chain, and I’ve written enough about supply chain risk to not love a self-compiled binary full of third-party components in the critical path holding credentials for every backend.

Secrets and egress. That Collector holds API keys for every backend and opens outbound connections to vendor endpoints — a fat target with credentials and network reach that almost never gets the security review it deserves.

Upgrades. It’s OTel; it moves fast. Configs drift, components deprecate, behavior changes between versions. If you’re not versioning and testing upgrades, one latest pull can change your sampling or break a pipeline at the worst time.

The vendor-neutral trade nobody priced in

Here’s the deal OpenTelemetry actually offered, stated honestly. It reduces instrumentation lock-in, and that is real and valuable. You instrument once against an open standard, and switching backends becomes a config change instead of a six-month re-instrumentation project. I argued in my New Relic vs Datadog comparison that OSS observability is never “free Datadog,” and this is the same lesson from a different angle: you didn’t delete the lock-in, you moved it.

You moved it into your telemetry pipeline. The instrumentation got portable; the operational responsibility got concentrated. Your visibility used to be coupled to a vendor’s agent the vendor operated; now it’s coupled to a Collector fleet you operate. That is a trade I’d take every time — but only knowing I’m taking it. The teams that get burned heard “vendor-neutral” and assumed it meant less work. It means different work, on your side of the line.

The vendors know which way the wind blows. In April 2026, AWS put native OpenTelemetry metrics support for CloudWatch into public preview — send metrics straight over OTLP, no custom conversion glue, query with PromQL. It was region-limited and should be treated as preview, not a stable universal default. But even the hyperscalers are now adapting to OTLP rather than fighting it. The protocol won, which only raises the stakes on the pipe that speaks it.

More signals are coming, and they’re not all ready

The Collector’s job is only getting busier. In March 2026, OpenTelemetry Profiles entered public Alpha — continuous production profiling as a first-class signal alongside traces, metrics, and logs. The eBPF profiling agent runs as a Collector receiver: the Collector can receive profiles, enrich them with Kubernetes metadata via the k8sattributesprocessor, and transform or filter them with OTTL. Profile samples can carry trace_id and span_id, so you jump from a slow span to the exact code that was on-CPU. That’s genuinely exciting, and it fits the pattern from eBPF eating Kubernetes’ iptables plumbing: more of our observability is moving into the kernel and flowing out through the Collector.

But it’s Alpha. The project says plainly it should not be used for critical production workloads yet — exactly the right warning, and exactly the kind that gets ignored by someone who saw a demo and wants the feature. Another signal class, another receiver, more memory pressure, and maturity that hasn’t caught up with enthusiasm.

The same story is starting with AI agents. The GenAI special interest group has been building semantic conventions for LLM, vector database, and agent telemetry — though that work moved to a dedicated GenAI repository and the older guidance page is no longer maintained, so don’t assume everything there is stable yet. The operational logic doesn’t wait for the spec: if your AI agent telemetry flows through the Collector, then prompts, completions, and session data flow through the Collector. That telemetry is useful — a feedback loop for evaluation and quality, not just troubleshooting — but it’s also expensive, high-cardinality, and full of things you do not want sprayed at every backend. Cost controls, sampling, privacy redaction, and ownership aren’t optional for that traffic. They’re the entry fee.

How I’d actually run it

So treat the Collector like what it is. The operating model I’d put in place before the pipe teaches the team during an incident:

Name an owner. A team, on a page, accountable for the fleet. “Everyone owns it” means nobody does — the survey’s 23% who don’t monitor it are what that looks like.
Version the Collector and its config like production code. Pin versions, review config changes in PRs. The config is infrastructure-as-code, even when it lives in a ConfigMap nobody diffed.
Monitor the Collector explicitly. Its own metrics, logs, and health: queue depth, dropped spans, memory_limiter activity, exporter failures. If you can’t see the pipe straining, you find out when it bursts.
Decide in advance what can be dropped under pressure. Debug traces? Verbose logs? Make that a deliberate policy, not an accident of which component OOMs first.
Isolate critical pipelines from experimental ones. Don’t let the new Alpha profiling receiver share a process with the pipeline carrying your production error rate. Separate Collectors, separate blast radii.
Treat custom builds as release artifacts. If you’re in the 46% compiling your own, that binary needs the same provenance, scanning, and version discipline as anything else you ship.
Test pipeline changes before production. A staging Collector on representative traffic catches the cardinality bomb and the broken exporter before your on-call engineer does.
Control high-cardinality labels and AI payloads before they hit the bill. Drop or aggregate at the Collector — the cheapest place to enforce cost. The vendor invoice is the most expensive place to discover you didn’t.
Keep rollback simple. One known-good config and version you can revert to in seconds. An incident is not when you debug a Collector config from scratch.
Don’t send secrets, prompts, or session data blindly to every backend. Redact at the Collector. Egress to a vendor is a data-sharing decision whether or not anyone treated it like one.

None of this is exotic. It’s the same discipline you already apply to any production data path — load balancers, message queues, ingress. The Collector earned a spot on that list; most org charts just haven’t updated to reflect it.

The pipe is infrastructure. Act like it.

OpenTelemetry graduating is good news. A vendor-neutral telemetry standard is something our industry needed and spent years failing to produce, and I’d recommend it without hesitation.

But “adopt OpenTelemetry” and “stand up a Collector and forget about it” are very different decisions, and too many teams make the first while accidentally doing the second. The operational weight didn’t disappear — it relocated, into a pipeline teams consistently under-own.

So here’s the test. If losing this pipe blinds your on-call engineer in the middle of an incident, it is production infrastructure. Full stop. There is no version of that sentence where it’s still a sidecar. Give it an owner, an SLO, a rollout strategy, monitoring, and a runbook now — while it’s calm — instead of letting it explain all of that to you at 2 a.m., right after it stops working.