The Problem With Changing Tools Too Often
There is a particular kind of meeting I have sat in too many times. A team is frustrated with a tool. Something broke, or the config is ugly, or a louder team down the hall is using something newer, and someone says the words: “maybe we should just switch to X.” And everyone nods, because switching feels like progress, and nobody in the room wants to be the person defending the thing that’s annoying them today.
I used to be the person nodding. After enough years and enough 3 a.m. pages, I have become the person who asks the boring question instead: what, specifically, does the new tool remove? Not “is it better.” What class of operational pain does it delete, and is that worth paying for the migration twice — once to build it, once to relearn how it fails?
Most of the time, honestly, the answer is nothing. We are not solving a problem. We are bored, or we are tired of a problem we never actually understood.
This is not an argument against new tools
I want to be clear up front, because this topic attracts a lazy reading. I am not telling you to freeze your stack in amber and run the same thing until you retire. New tools are good. Some of them are genuinely transformative. When Helm 4 shipped OCI-first distribution and finally fixed CRD lifecycle, that removed a real, recurring, blood-pressure-raising failure I’d lived with for years. That’s a reason. When a tool kills an entire category of pages, adopt it and don’t look back.
The problem isn’t adopting new tools. The problem is replacing working tools for reasons that evaporate the moment you say them out loud: hype, boredom, aesthetics, conference envy, or frustration with a problem the team never sat down and properly diagnosed. Those four produce the same migration as a real reason does — same cost, same risk, same two-week tail of weirdness — and they buy you nothing.
What you’re actually throwing away
The thing nobody costs out is operational knowledge, because it doesn’t live in a repo. It lives in people’s heads, and it is enormous.
When you’ve run a tool for three years, you know its failure modes the way you know the sounds your car makes. You know that this error message actually means that unrelated thing. You know which flag is a trap. You know the one config block everyone copies wrong. You know, without thinking, where to look first when it misbehaves at 2 a.m., because you’ve been there before and you remember the shape of it. That knowledge is the single most valuable thing your team owns about that tool, and it is completely non-transferable. Switch tools and it goes to zero overnight.
So you swap a tool you understand for one you don’t, and the first thing that happens — guaranteed — is that you start rediscovering failures. Every mature tool’s “ugly” parts are scar tissue. They look ugly precisely because the team has already found the sharp edges and filed them down with workarounds, comments, and runbook entries. The new tool looks clean and elegant by comparison, and a lot of that cleanliness is just an illusion: you haven’t found its sharp edges yet. They’re there. You’ll find them the hard way, in production, on a night you didn’t budget for it.
I made exactly this trade once, early on, swapping a config-management setup the team knew cold for something newer and prettier. The syntax migration took a sprint. The knowledge migration took the better part of a year of small, stupid, avoidable incidents — each one a problem we’d already solved in the old tool, now solved again in a new interface. We didn’t fix anything. We just re-paid for things we already owned.
The duplicate-stack tax
There’s a second cost that’s easier to see and just as expensive: you rarely migrate cleanly. You migrate partially, and now you run two of everything.
This is how teams end up with half their services behind one reverse proxy and half behind another, both alive, both in the on-call rotation — a state I’ve watched play out more than once and tried to talk people out of in the Caddy/Nginx/Traefik post. It’s how you end up running Helm and Kustomize for the same kind of workload because the migration stalled at 60%. It’s how a fleet ends up half Ansible and half something else, with two mental models, two sets of secrets handling, two ways everything can break. Every “we’re migrating” that never finishes leaves you carrying both tools forever, and the second tool isn’t half the cost of the first. It’s full price, plus the cost of the seam between them.
Now your people have to be fluent in both. Your runbooks have to cover both. Your new hire has to learn both before they can take a shift. And the person paged at 3 a.m. has to first figure out which stack this particular service is even on before they can start fixing it. That hesitation — that extra thirty seconds of “wait, is this the old way or the new way” — is exactly the tax you pay, at the exact moment you can least afford it.
The question that actually filters
So when a team brings me a tool change, I’ve stopped asking “is the new one better?” Better is cheap. Everything is better at something; that’s how it got built and marketed. The question that actually does work is narrower and meaner:
What measurable operational pain does this remove, and is it worth the migration debt?
Measurable is the load-bearing word. “It feels cleaner” is not measurable. “The HCL is annoying” is not measurable. “We page on this specific failure twice a month and the new tool makes that failure structurally impossible” — that’s measurable, and that’s a reason. I walked through this exact logic with IaC in the Terraform vs OpenTofu vs Pulumi post: all three will build the same VPC. I stay on the one I know not because the others are worse, but because nothing about my work has made the migration pay for itself. Boredom is not a migration budget.
When you actually should change
To be fair to the other side, here is my real list — the reasons I’ll sign off on a migration without flinching, because every one of them is a problem the new tool genuinely deletes rather than relocates:
- Measurable, recurring operational pain that the new tool structurally removes — not softens, removes.
- Licensing or vendor hostility. A relicense that puts your usage at legal risk, or a vendor that’s turned adversarial toward its own users.
- Cost explosions that don’t track any value you’re getting — pricing that grew an order of magnitude while your usage didn’t.
- A dead ecosystem. No releases, no security patches, no maintainers, no answers to your weird error. That’s not a tool anymore, it’s a liability with a logo.
- Security or compliance requirements the current tool genuinely cannot meet — a control an auditor demands and the incumbent can’t provide.
- The new tool removes an entire class of problems at once, the way a good change should.
And the one reason that is never on the list, no matter how good it feels in the moment: novelty. New for the sake of new is not a reason. It’s a cost wearing the costume of a reason.
The maturity is in the restraint
The tools I trust most in production are not the most elegant ones I’ve ever used. They’re the ones I can operate calmly when I’m woken up and running on no sleep — the ones whose failures I’ve already met, whose quirks are already in muscle memory, whose error messages I can read half-asleep and know what to do. That calm is worth more than any feature on a comparison page, and it is the exact thing a needless migration sets on fire.
Early-career engineers measure their stack by how modern it is. The best operators I know measure theirs by how few surprises it has left. Chasing the newest tool feels like progress and reads like initiative, but a lot of it is just churn — motion that looks like momentum and leaves you exactly where you started, only less practiced.
The best tool is usually the one your team can operate calmly at 3 a.m. And the real mark of a mature DevOps team isn’t the length of the list of tools they’ve adopted. It’s the shorter, quieter, much harder-won list of the ones they decided not to replace.