I Don’t Want More Dashboards. I Want Answers.

It is 2:14 a.m. The page fires. Checkout error rate is up, the on-call channel is already three messages deep, and someone has dropped a link to a dashboard. I open it. Forty-two panels. Twelve of them are red, which tells me nothing, because nine of those twelve are red on a good day and nobody has touched the thresholds since 2022. I scroll. I find a latency graph that is clearly angry. It does not tell me why. It does not tell me what changed at 2:09. It does not tell me which of our seventeen downstream dependencies is the actual problem. It just confirms, in beautiful gradients, that we are having a bad night.

I already knew that. The pager told me.

This is the part of observability nobody wants to admit out loud: we have gotten extremely good at producing surfaces and extremely bad at producing conclusions. We have wall after wall of charts, and during the one moment they are supposed to earn their keep — a real incident, real money leaking, real customers affected — they mostly make me feel busy while I do the actual diagnosis somewhere else, in my head and in a terminal.

How dashboards became the whole job

It is worth being fair about how we got here, because the dashboard did not become the default artifact of observability by accident or by stupidity. It became the default because it is the easiest thing to build and the easiest thing to show.

A panel is a query plus a visualization. The barrier to creating one is almost zero, which is wonderful and is also the entire problem. Every new service ships with a “starter dashboard.” Every postmortem ends with someone adding three panels so “we’ll see it next time.” Every tool — Grafana, Datadog, CloudWatch, the dashboards you build on top of Prometheus and OpenTelemetry — makes creation frictionless and deletion socially awkward. Nobody gets praised for removing a graph.

And dashboards demo well. You cannot put “we correlated a deploy with a dependency regression and shipped a rollback in four minutes” on a slide as cleanly as you can put a glowing grid of sparklines. So the grid is what gets funded, copied, and multiplied. Telemetry volume becomes a proxy for maturity. More metrics, more traces, more panels — surely that means we understand our systems better.

It does not. It means we can see more. Seeing is not understanding, and it is definitely not deciding.

The actual problem is sprawl

I want to be precise here, because this is where people hear “dashboards are bad” and stop listening. Dashboards are not bad. A dashboard is a fine starting point — a known place to begin looking. The problem is what happens when starting points multiply without discipline.

Dashboard sprawl looks like this. You have hundreds of dashboards and no idea which are authoritative. Half were built for a launch that shipped two years ago. Three different teams have a “service health” board for the same service and they disagree. Panels have no owner, so thresholds rot, queries silently break when a label changes, and nobody notices because nobody is accountable for that specific square of glass. Worst of all, the signals are not attached to any decision. A graph goes red and the honest answer to “so what do we do?” is a shrug.

That last one is the real disease. A signal with no decision attached to it is not observability. It is decoration. If I cannot tell you what action a panel is supposed to trigger, that panel is costing you attention at 2 a.m. and giving you nothing back. During an incident, every unowned, undecidable panel is noise wearing the costume of insight.

What I actually need at 2 a.m.

When I am holding the pager, I do not need more telemetry. The telemetry almost always already exists. I need it assembled into answers to a very short list of questions, in roughly this order.

What changed. Ninety percent of incidents trace back to a change — a deploy, a config flip, a feature flag, a migration, a traffic shift, a certificate that finally expired. The single most valuable thing any platform can show me is a timeline of changes lined up against the timeline of symptoms. Not in a separate tool I have to remember exists. Right there, next to the graph that is on fire.

What is affected, and how badly. Not “errors are up.” Which endpoints, which customers, which regions, what fraction of traffic. Blast radius is the difference between “wake up the VP” and “fix it before standup.”

Which dependency is sick. Modern systems fail sideways. My service is red because its database is slow because a noisy neighbor is saturating a shared disk. I should not have to reconstruct that chain by hand, hopping between four tools, at the exact moment I am least equipped to do careful detective work. The topology already knows the dependency graph. Use it.

What is the safest next action. Roll back the 2:09 deploy? Drain a node? Fail over? Shed load? I am not asking the platform to decide for me. I am asking it to lay out the options and the evidence so I can decide in seconds instead of minutes.

Notice that none of these are “show me another chart.” They are investigation, correlation, context, and judgment. That is the work. The charts are raw material for the work, not the work itself.

On AI, carefully

So naturally everyone is now bolting an assistant onto the dashboard, promising it will answer exactly these questions. I want this to be good. I am also deeply suspicious, and you should be too.

An AI summary that is grounded in real telemetry, real deployment history, real service topology, and the actual runbook is genuinely useful. “Error rate rose at 2:09, two minutes after deploy a4f1c to the payments service, which added a call to a dependency now showing p99 of 4 seconds; the runbook for this pattern says roll back” — that is the thing. That is a junior engineer’s first hour of investigation, done in three seconds, with its sources attached so I can verify it.

An AI summary that is not grounded in those things is prettier guessing. It will write a confident, well-punctuated paragraph that sounds like an answer and points me at the wrong service. At 2 a.m., a confident wrong answer is worse than no answer, because I will spend ten minutes chasing it before my own skepticism catches up. The test is simple and unforgiving: can it show its work? If the assistant cannot cite the specific deploy, the specific trace, the specific dependency it is blaming, it is autocomplete with a monitoring logo. Ungrounded summarization does not reduce my cognitive load during an incident. It adds a plausible liar to the war room.

A better operating model

The fix is not a new product category. It is restraint plus correlation.

Restraint means treating dashboards like code that has to be owned and pruned. The best dashboards I have ever used are boring: small, maybe five or eight panels, each one owned by a named team, each one answering a specific question, each one with a clear “if this looks like that, do this” attached. Boring is the compliment. A dashboard you fully understand before the second coffee is a dashboard that works at 2 a.m. Delete the rest. If a panel has not informed a decision in six months, it is not load-bearing, it is clutter, and clutter is not free.

Correlation means investing the effort where it actually pays: stitching changes, topology, and impact together so the platform participates in the investigation instead of just hosting graphs. Spend less energy adding the four-hundredth metric and more energy making the deploy markers, the dependency map, and the customer-impact view show up in the same place at the same time.

None of this is exotic. The telemetry is already there. We have just been pointing our craftsmanship at the wrong layer — at producing more views instead of producing more answers.

So here is where I have landed after enough bad nights. The dashboard is where an investigation starts. It is the front door, not the destination. The moment it becomes the place we expect insight to live — the wall we stare at hoping understanding will condense out of the gradients — we have quietly given up on the actual job. I do not want more dashboards. I have plenty. I want the answer to “what changed, what broke, who is affected, and what do we do next,” and I want it before the customers tell me first.