Danilo Falcão da Silva — DevOps Blog

About

Hi, I’m Danilo Falcão da Silva — a Senior DevOps Engineer with 20+ years of experience designing and operating infrastructure across on-premises and cloud environments. I’m based in Belo Horizonte, Brazil.

My day-to-day is mostly Linux systems, containerization, CI/CD, and the kind of glue that keeps large-scale distributed infrastructure honest. This blog is where I write up the things I learn, fix, or kubectl delete and never speak of again.

Find me

Open any infrastructure repository from the last five years and count the file extensions. I did this recently on a platform I inherited, and the answer was bleak in a familiar way: a handful of Go files, some Dockerfiles, a couple of shell scripts, and then a sprawling continent of .yaml. Cluster manifests. Helm values. CI pipelines. Argo Applications. Alerting rules. Service definitions. Policy. The actual program was a rounding error. The system was YAML.

There is a particular kind of meeting I have sat in too many times. A team is frustrated with a tool. Something broke, or the config is ugly, or a louder team down the hall is using something newer, and someone says the words: “maybe we should just switch to X.” And everyone nods, because switching feels like progress, and nobody in the room wants to be the person defending the thing that’s annoying them today.

I did not start caring about MCP because of the protocol spec. I read the spec, it is fine, it is JSON-RPC with some sensible primitives, and on its own it would not have earned a post. I started caring about MCP the day it changed what an agent is. Before tool access, an agent generates text. You read the text, you decide, you act. After tool access, the agent acts. It opens the pull request, it queries the database, it restarts the container, it pages someone. The text was a suggestion. The tool call is an operation.

It is 2:14 a.m. The page fires. Checkout error rate is up, the on-call channel is already three messages deep, and someone has dropped a link to a dashboard. I open it. Forty-two panels. Twelve of them are red, which tells me nothing, because nine of those twelve are red on a good day and nobody has touched the thresholds since 2022. I scroll. I find a latency graph that is clearly angry. It does not tell me why. It does not tell me what changed at 2:09. It does not tell me which of our seventeen downstream dependencies is the actual problem. It just confirms, in beautiful gradients, that we are having a bad night.

I do not use Go for everything, and I want to be honest about that.

My stack has a hierarchy. Shell handles most of what I do. Python takes over when shell gets too complex. Go is what I reach for when I am explicitly optimizing for performance. That last tier is what this post is about — when performance is the goal, Go is awesome in a specific, practical way.

I have been waiting years to write this sentence without rolling my eyes: hybrid Kubernetes on a major managed platform finally looks practical.

Not perfect. Not magical. Practical.

The new EKS Hybrid Nodes Gateway is the first AWS move in this space that feels like an operations feature, not a slide-deck feature. It directly targets one of the worst recurring pain points in hybrid Kubernetes: getting reliable traffic flow between cloud-side components and on-prem or edge pods without building a pile of bespoke routing logic that only two people on the team understand.

I use Zsh with Oh My Zsh.

That one line summarizes a big shift in how I work every day. I did not switch because Bash is bad. I love Bash and respect it. Bash is stable, everywhere, script-friendly, and still the safest common denominator when I touch unknown Linux boxes.

I switched because my daily workflow changed, and Bash stopped fitting that workflow.

The Bash pain point that finally made me move

The biggest Bash pain point for me was history behavior across multiple terminals.

Firefox is my choice.

That sentence is boring, and that is exactly why I trust it. I have spent long stretches on Chrome, pure Chromium builds, and Brave. I keep coming back to Firefox. Not because it is perfect, not because it wins every benchmark, and not because I enjoy contrarian browser takes. I come back because it fits my real daily workflow better, with less policy friction and less account-management nonsense.

I’ll make this simple up front: I love New Relic for APM. I also love what Datadog built as an integrated operational platform.

If you only want a tribal answer, stop here: for app-centric teams with strong APM focus, New Relic still punches above its weight. For platform-heavy operations across infra, security, delivery, and incident flow, Datadog usually wins.

But the right answer depends on team shape, telemetry habits, and how much operational burden you are willing to carry.

I decided to test Codex CLI today because I have liked the quality I get from GPT-5.3-Codex enough to take it seriously in real work. I did not open it looking for a demo. I opened it with the same question I apply to any tool that can touch production-bound code: is this operationally trustworthy, or just impressive for fifteen minutes?

My conclusion is clear: Codex CLI is already good enough to pilot seriously, but it is not my daily driver yet.

Btrfs is king on Linux. Not because it’s technically superior to ZFS in every dimension — it isn’t — but because it ships in the kernel, installs without friction, and integrates with every major distro’s tooling out of the box. In 2026, if you format a fresh Fedora, openSUSE, or Ubuntu Desktop installation, you’re running Btrfs by default. ZFS requires you to want it badly enough to fight the packaging.

Two things happened in the same week. Both of them involved Anthropic. One was a major hire. The other was a Vatican press conference. Neither was fake. But the internet managed to smash them together into something that was, and now people are asking me if the Pope works at Anthropic.

Let me untangle this.

What actually happened: the Karpathy hire

On May 19, 2026, Andrej Karpathy posted on X: “I’ve joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D.”

Every few years someone publishes a blog post titled something like “PostgreSQL: The Everything Database” and the comments fill with people saying “well, obviously.” The thing is, it wasn’t obvious. In 2010, if you told a room full of engineers that the correct database for their document store, their geospatial queries, their full-text search, and their vector similarity lookups was the same 30-year-old relational database, they would have politely suggested you hadn’t used MongoDB yet.

kubectl is the Swiss Army knife. Nobody disputes this. But Swiss Army knives are terrible at most of the individual jobs they claim to do, and kubectl is no different: it can tail logs, but only one pod at a time. It can switch contexts, but with zero guardrails. It can describe resources, but in a wall of YAML that buries the thing you actually care about.

Day-2 operations — the part where the cluster is live, traffic is flowing, and someone pages you at 2 a.m. — need sharper instruments. What follows is the utility belt I’d recommend to any Kubernetes operator building their toolkit in 2026. Not everything here is new. Some of these tools have been around since 2018. The point is that they’re still maintained, still solve real problems, and still faster than the kubectl incantation you’d otherwise be typing.

Most teams get SLAs, SLOs, and SLIs wrong. Not because the concepts are hard, but because they treat them as compliance paperwork instead of operational tools. The result is dashboards nobody trusts, targets nobody chose deliberately, and on-call rotations that burn people out chasing noise.

This post is a field guide for teams that actually run production systems and want reliability engineering to work as an engineering discipline — not a slide deck exercise.

On January 9, 2026, at roughly 02:20 UTC, Anthropic flipped a switch on their servers and broke thousands of developer workflows overnight. No blog post, no advance notice, no migration path. Third-party coding tools — OpenCode, Cline, RooCode, OpenClaw, and others — that had been using Claude subscription OAuth tokens suddenly got a single error: “This credential is only authorized for use with Claude Code and cannot be used for other API requests.”

I ran GNU Screen for more than twenty years. It was one of the first tools I installed on every machine, right after vim and ssh. Screen kept long compilations alive through flaky connections, let me juggle IRC and tail logs on the same VT100, and never once lost a session I cared about. For a tool born in 1987, that is a remarkable track record.

About fifteen years ago I switched to tmux. It was not because Screen broke. It was because tmux made me faster at work I was already doing, and then it kept getting better while Screen stood still. I have not looked back.

This is a late review. GPT-5.3-Codex shipped on February 5, 2026, and here I am almost four months later writing about it. I have no excuse beyond the usual one — too many things to look at, not enough evenings. But having spent the past few weeks reading benchmarks, watching demos, and following what developers are actually saying about it, I want to be honest: it is a good model. A genuinely good one.

On May 18, 2026, Linus Torvalds called the Linux kernel security mailing list “almost entirely unmanageable.” The reason: a flood of AI-generated bug reports. The reaction was predictable — ban AI, blame researchers, declare the tools aren’t ready.

I wrote about the maintenance crisis last week and I think that framing misses the deeper story. The problem is not that AI is generating too many reports. The problem is that the code was more broken than we thought, and for twenty years nobody had the tools to look at it properly.

I don’t pay for Snyk. Not because it’s bad — it’s a genuinely good product — but because there is a free stack that catches the vast majority of the same issues directly in CI, and the remaining gap hasn’t been worth roughly $600 per developer per year to close. On a team of fifteen engineers, that’s the price of a small EC2 fleet you actually need.

This post is about the open-source security tooling I actually wire into pipelines: Trivy for containers, dependencies and IaC, Semgrep for application code, Nuclei and OWASP ZAP for the live app, and a few honourable mentions. It’s not an exhaustive catalogue. It’s the stack I keep coming back to.

There are four names that come up when you ask how to manage configuration at scale. I’ve used all four, in production, over the last fifteen years. The answer to “which one” is not the one that wins benchmarks — it’s the one that matches how your team thinks about change.

The shortlist hasn’t changed since the early 2010s: Ansible, Puppet, Chef, SaltStack. The ownership has changed. The licences have changed. The community gravity has very much changed. Ansible sits at 31.94% market share for new projects as of early 2026 and is, by a wide margin, the most-adopted tool for new configuration-management work. Puppet holds 12.41%, Chef around 6.70%, and Salt has fallen out of the top-tier mindshare entirely even though the codebase is still actively developed under Broadcom.

Infrastructure as Code stopped being a single-answer question a couple of years ago. Terraform went BSL in August 2023, OpenTofu forked under MPL, Pulumi kept arguing that infrastructure should be real code in real languages, and IBM closed the $6.4B HashiCorp acquisition in February 2025. Three legitimate tools, three different bets on who owns the abstraction.

I still run Terraform. For everything.

This post is the honest version of why. I’ve read the OpenTofu release notes, watched Pulumi talks, played with both on weekend projects, and read more migration write-ups than I can remember. I have never put OpenTofu or Pulumi in front of production traffic. Most of what follows about those two is research, not muscle memory — and I’ll flag that as I go.

I have spent most of my career running things that boot. Bare metal, VMs, containers, Kubernetes — boxes that come up, hold state, and need somebody to think about their lifecycle. AWS Lambda is the opposite of that mental model, and for a long time I treated it the way a lot of old-school infrastructure people treat it: useful for toy apps, fine for a Slack bot, not a serious tool.

For about a decade, “Kubernetes ingress” effectively meant one thing: ingress-nginx, the community-maintained controller that wrapped the Nginx engine behind the Kubernetes Ingress resource. It was fine. It was the default everyone reached for. It was also, quietly, the wrong long-term shape for the problem.

That era ended in 2026. The community ingress-nginx project reached end-of-life in March 2026. The Kubernetes Gateway API, which graduated to GA on the v1.2 line, is now the forward-looking standard. Envoy Gateway is the CNCF reference implementation. Cilium does L7 routing in eBPF without a sidecar or an extra proxy. RKE2 flipped to Traefik by default in v1.36 and removes ingress-nginx entirely in v1.37.

I configure infrastructure for a living. Containers, reverse proxies, NFS mounts, certificate renewals, sync layers between machines — that’s most days. The last thing I want when I open my personal note-taking app is another sync layer to babysit. That, more than anything else, is why I picked Notion over Obsidian.

This is the honest read on Notion vs Obsidian for one person, one phone, one Linux desktop, and the occasional browser tab on someone else’s machine. The headline: Obsidian is the purer architecture and the wrong fit for me. Notion is the lazy architecture that does the right thing without asking. For a personal knowledge base, lazy wins.

The Kubernetes Secret object is a YAML manifest with a base64-encoded value in it. That’s the entire encryption story. Anyone with get permission on the namespace can read every credential the workload holds. Base64 is not encryption. It is barely obfuscation.

Teams discover this the wrong way — usually during a security review, occasionally during an incident — and the next question is always the same: so where do the real secrets live?

For most of 2025 the age-verification conversation was about porn sites. By the end of the year it had moved up the stack. By 2026 it is at the operating system, and that is where the story gets interesting for anyone who cares about how Linux is built.

Nine US states put age-verification laws in force during 2025 alone: South Carolina (Jan 1), Florida (Jan 1), Tennessee (Jan 13), Georgia (Jul 1), Wyoming (Jul 1), North Dakota (Aug 1), Arizona (Sep 26), Ohio (Sep 30), and Missouri (Nov 30). Roughly half the country now mandates some form of age gate for adult content, social media, or both.

You’ll see this comparison on r/kubernetes every couple of months, phrased as if it’s a real choice: Rancher or Lens? The framing is wrong. They occupy different layers of the stack. Asking which one “wins” is like asking whether VS Code beats Kubernetes.

But the question keeps coming up — usually from someone who has Lens installed, has heard about Rancher, and is trying to figure out whether they should swap. So let me lay out what each one actually is, where they overlap, where they don’t, and which one earns a place in a serious on-prem setup.

For most of Kubernetes’ life, the cluster data path has been a tower of iptables rules. Pod-to-service routing, NAT, network policy, even the way kube-proxy programs a Service IP — all of it expressed as netfilter chains evaluated linearly on every packet. It worked. It also aged badly.

In 2026, the answer the ecosystem has converged on is eBPF, and the project doing most of the convergence is Cilium. The shift is no longer aspirational: kube-proxy itself shipped an nftables mode that is expected to go GA in Kubernetes 1.33, the old IPVS backend is deprecated as of v1.35, and the major managed Kubernetes providers (EKS, GKE, AKS) all offer a Cilium-powered data plane as a first-class option. Azure CNI Powered by Cilium is GA on K8s 1.33.

There is a kind of infrastructure question that never really gets settled, just re-litigated every couple of years as the surrounding ecosystem moves. “Which reverse proxy?” is one of those questions.

The shortlist hasn’t changed much: Caddy, Nginx, Traefik. The context around them has changed a lot. The community ingress-nginx project reached end-of-life in March 2026. RKE2 v1.36 flipped to Traefik as the default ingress. Caddy quietly shipped 2.11 with better health-checking and ECH rotation. Nginx is on 1.31 mainline / 1.30.1 stable and treats HTTP/3 as a first-class but still-evolving feature.

There are two open-source autonomous agents in 2026 worth a serious DevOps engineer’s time, and they have made opposite architectural bets. I tried both. I run Hermes Agent every day. This is the analysis of why — not a both-sides post, not a head-to-head benchmark, but a direct argument that one of these two architectures is right for infrastructure work and the other one isn’t.

The headline: agent-first beats gateway-first when the work rewards familiarity. Most infrastructure work does. The rest of this post is the why.

On May 18, 2026, Linus Torvalds said the Linux kernel security mailing list had become “almost entirely unmanageable” because of duplicate AI-generated bug reports. Two months earlier, longtime stable maintainer Willy Tarreau had already shared the numbers: a list that received two to three reports per week in 2024 was getting five to ten reports per day by March 2026. In January, Daniel Stenberg shut down the curl bug bounty after the valid-report rate on HackerOne dropped from above 15% to below 5%, with twenty submissions in 21 days — seven of them in one 16-hour window — and zero real vulnerabilities among them.

KDE shipped the Plasma 6.7 Beta on May 14, 2026. The final release lands on June 16. I’m still on Plasma 6.6.5 — the stable that shipped in Fedora 44 — and I’m going to stay there until 6.7 hits Fedora’s repos. But I have been reading every announcement, every “This Week in Plasma” post, and every release-note dump KDE has been putting out, and I want to say this clearly:

My home media server is a Raspberry Pi 4 named uther. It runs Plex, the entire *arr stack, qBittorrent, a request manager, a reverse proxy, Vaultwarden as a password manager, and this very blog — all in Docker Compose. It sips around five watts at idle, more like seven during peak indexing, and the most expensive part of the whole setup is a separate Synology NAS sitting on the same shelf doing nothing but being a quiet, reliable hard drive over NFS.

NVIDIA on Linux has been a punchline for so long that “just buy AMD” was the standard advice in every Linux subreddit thread. In 2026, on Fedora 44 with KDE Plasma 6.6 on Wayland, running an RTX 5070 (Blackwell) with the 595 driver branch, I want to say something different:

It’s fine. It’s actually fine.

Not “fine if you don’t use Wayland.” Not “fine if you stick to X11.” Fine, full stop. This is the short, honest report of how I run it.

Here’s the test I use for any deployment tooling:

It is 3 a.m. on a Sunday. PagerDuty just woke you up. A production service is degraded. You roll out of bed, open your laptop, and have to figure out what the cluster thinks is true, what’s actually true, and what changed in the last twelve hours. The faster you can answer those three questions, the better the tooling.

The right GitOps stack collapses all three questions into one dashboard. The wrong one has you SSH-hopping between five servers running kubectl rollout history against unlabeled deployments. I’ve done both. I’m writing this post about the former.

Helm has been the punchline of Kubernetes packaging for about as long as Kubernetes has been called Kubernetes. Helm 2 had Tiller, an in-cluster component running as cluster-admin that read every chart’s YAML and applied it from inside the cluster — a security horror show that drove half the community to invent its own deployment tooling just to avoid it. Helm 3 finally killed Tiller in 2019 and went client-side, which fixed the worst of it. And then Helm 3 sat there, relatively unchanged, for six years.

Most of the Kubernetes conversation in 2026 happens around managed services — EKS, GKE, AKS — and most of the rest happens around K3s for edge and homelab. Somewhere in the middle, on the hardware that lives in a rack in a datacenter you can drive to, there’s a Kubernetes story that nobody talks about loudly enough.

That story is RKE2 — the Rancher Kubernetes Engine 2, SUSE’s hardened, security-focused, single-binary distribution designed for on-premises production. I’ve been running it for two years across two different employers and one home lab, and it’s the rare piece of infrastructure that gets more impressive the longer you live with it.

I ran Ubuntu on my workstation from Hardy Heron in 2008 to about Jammy in 2022. Fifteen years. It was the first Linux I trusted on servers I cared about, the first one that made hardware support feel solved, and for a long stretch it was the obvious answer to “what Linux should a sane person install?”

I don’t run it on my desktop anymore. I don’t run Ubuntu Server either — though that’s a personal-taste call and not a knock on the product, which is still a perfectly good distribution that I’d happily recommend to most people. Ubuntu has split in half: the server side is still strong, the desktop side has gone somewhere I’m not willing to follow, and the two halves deserve very different treatment. Especially because we just got Ubuntu 26.04 LTS “Resolute Raccoon” on April 23, and the headlines are mostly cheerful about features that don’t fix the things that actually drove me away.

I have been carrying a quiet grudge about Tailscale for two years. The product is excellent. The clients are polished. The “log in with Google and you have a mesh in 60 seconds” experience is genuinely magical. But the control plane — the part that decides which of your machines can talk to which other machines, and where the keys for that decision live — is closed source and hosted exclusively by Tailscale. You can run Headscale to reimplement it open-source, but you’re still bringing along Tailscale’s proprietary clients on most platforms.

I’ve spent enough of the last six months working alongside an AI coding agent that I now have actual opinions, in the way you only get from shipping production code with a tool, not from reading benchmarks about it. There are three names that dominate the conversation in 2026 and they represent three genuinely different bets about how humans and language models should collaborate on code.

This is my honest read on Cursor, Claude Code, and opencode. The headline: Claude Code is my daily driver. opencode is my Plan B. Cursor is not what I reach for. Here’s why.

“What would you give and what would you keep?”Mase, From Scratch (Double Up, 1999)

Mase asked it about rewinding your whole life and starting over. I ask it every time someone on my team picks a development container stack. Because the moment you decide to let a container be your workstation — not just hold a service, but hold your editor, your shell, your language toolchains, your AUR packages on top of a Fedora host — you’re making a series of small, ugly trade-offs. What would you give up from your bare-metal workflow, and what would you keep?

The first Steam Machine was a 2015 disaster. A confused, fragmented launch of third-party boxes running an immature SteamOS 1, no AAA Linux catalogue, no Proton, and a controller everyone politely pretended to like. The whole effort quietly evaporated within a couple of years and Valve, to its credit, didn’t try to spin it. They went away, built the Steam Deck, built Proton into something that actually works, and waited.

I have a complicated relationship with code editors. I used vim for fifteen years out of stubbornness, switched to VS Code because the ecosystem made it impossible not to, and have spent the last four years quietly resenting it every time the laptop fans spin up because I opened three monorepos and a markdown file.

Zed 1.0 shipped at the end of April 2026. I’ve been running it as my daily driver for the two weeks since, and I’m writing this post in it. Here’s where I’ve actually landed, including the things I think the typical Zed review under-sells.

I’ve been running personal infrastructure on Hetzner for the better part of a decade. Cloud VMs for side projects. A couple of dedicated boxes for the things I didn’t want to babysit on AWS pricing. The calculus was always the same: pay a third of what the hyperscalers charge, manage it yourself, accept the lack of a managed-service safety net. The deal was very, very good.

On April 1, 2026, that deal got noticeably worse.

A nine-person jury in the Northern District of California took less than two hours today to unanimously dismiss every claim in Elon Musk’s lawsuit against Sam Altman, Greg Brockman, OpenAI, and Microsoft. Musk had sought up to $134 billion in disgorgement, the removal of Altman and Brockman from OpenAI’s leadership, and the unwinding of OpenAI’s October 2025 conversion into an $852-billion public benefit corporation.

He got none of it. The technicality matters more than the headline.

There’s a particular kind of project I love: the one that takes something new and weird and turns it into something you barely have to think about. Docker did this for containers. Hugo did it for static sites. In 2026, Ollama is doing it for open-source language models.

You install one binary. You type ollama pull <model>. You type ollama run <model>. Your application talks to localhost:11434 over an OpenAI-compatible HTTP API. That’s the whole user manual. The fact that this is now boring — that nobody has to argue about it on Hacker News anymore — is the point of this post.

Fedora Linux 44 shipped on April 28, 2026, two weeks behind its original date after a late-cycle batch of blockers. I’ve been running it on my KDE daily-driver for a couple of weeks now. It’s the kind of release that doesn’t scream — no single tentpole feature — but if you spend your day on Linux, the cumulative effect is real. Here’s what stood out to me, with a bias toward what I actually noticed.