SLA, SLO, SLI, and Error Budgets: A DevOps Reality Check

Most teams get SLAs, SLOs, and SLIs wrong. Not because the concepts are hard, but because they treat them as compliance paperwork instead of operational tools. The result is dashboards nobody trusts, targets nobody chose deliberately, and on-call rotations that burn people out chasing noise.

This post is a field guide for teams that actually run production systems and want reliability engineering to work as an engineering discipline — not a slide deck exercise.

The Basics, Without the Textbook

SLASLOSLI
What it isA contract with consequencesAn internal reliability targetA measurement of user-facing behavior
Who owns itLegal/business + engineeringEngineering + productEngineering
Consequence of breachFinancial penalties, credits, churnDeployment freeze, prioritization shiftNothing — it is just a number
Typical example“99.9% uptime monthly or credits issued”“99.95% success rate on checkout, rolling 28 days”“Proportion of HTTP 200 responses on /api/checkout in < 400ms”

The relationship is simple: SLIs are what you measure. SLOs are the targets you set on those measurements. SLAs are the business promises backed by SLOs.

If your SLO is 99.95%, your SLA should be something like 99.9% — always leave a buffer. The gap between your SLO and SLA is your engineering margin. Lose that margin and every SLO breach becomes a contractual incident.

Choosing SLIs That Actually Matter

The most common mistake is picking SLIs because they are easy to collect, not because they reflect user experience. CPU utilization is not an SLI. Pod restart counts are not an SLI. These are infrastructure signals — useful for debugging, useless for reliability targets.

Good SLIs answer one question: is the user getting what they came for?

For most services, this means some combination of:

  • Availability: Did the request succeed? (HTTP 5xx ratio, gRPC status codes)
  • Latency: Was it fast enough? (p50, p95, p99 at the edge, not at the service mesh)
  • Correctness: Was the response right? (Harder to measure, critical for data pipelines and financial systems)
  • Freshness: Is the data current? (Essential for read replicas, caches, async pipelines)

Measure at the boundary closest to the user. If you have a CDN or load balancer in front, that is where your SLI lives — not inside your Kubernetes cluster where retries and circuit breakers mask failures.

The Golden Signals and How They Map

Google’s four golden signals — latency, traffic, errors, saturation — are a good starting framework, but they are not all SLI material:

  • Latency and errors map directly to SLIs.
  • Traffic is context, not a target. You do not set an SLO on requests per second.
  • Saturation is a capacity planning signal. It predicts future SLI degradation but is not an SLI itself.

The USE method (utilization, saturation, errors) covers infrastructure. The RED method (rate, errors, duration) covers services. SLIs live in RED territory.

Error Budgets: Turning Reliability Into a Resource

An error budget is the inverse of your SLO. If your SLO is 99.95% over 28 days, your error budget is 0.05% — roughly 20 minutes of total downtime or the equivalent in failed requests.

This reframes reliability from “never break anything” to “we have a budget for breakage, spend it wisely.” That shift changes everything about how engineering teams operate:

  • Budget is healthy (above 50%): Ship features, run experiments, do risky migrations. This is the whole point — velocity is the default.
  • Budget is strained (25–50%): Slow down. Increase review rigor, batch deployments, skip the optional infrastructure change.
  • Budget is critical (below 25%): Feature freeze. All engineering effort goes to reliability. No exceptions, no negotiations.
  • Budget is exhausted (0%): Full stop on changes. Incident-level response. Postmortem before anything else ships.

The policy must be written down and agreed to by engineering and product leadership before it triggers. If you wait until the budget is blown to negotiate what happens next, you have already lost.

Error Budget Policy and Deployment Governance

A budget without enforcement is a suggestion. Here is what an operational error budget policy looks like:

Deployment gates: If the rolling 28-day error budget is below 25%, automated deployments pause. The release pipeline checks budget status before promoting to production. This is not a manual process — it is a CI/CD gate.

Escalation path: Budget exhaustion triggers a formal review. The on-call team writes a brief (not a 30-page postmortem) covering what consumed the budget and what changes would prevent recurrence. Product and engineering leads review together.

Budget reset: Budgets roll. Do not use calendar-month windows — they create perverse incentives to push risky changes early in the month. Use a rolling 28-day or 30-day window so every day matters equally.

Exemptions: Infrastructure incidents outside your control (cloud provider outages, upstream dependency failures) can be excluded from budget consumption if you have clear attribution. This requires good observability — you need to prove it was not your fault, not just assume it.

Dashboards and Observability Signals

Your SLO dashboard is not your monitoring dashboard. They serve different audiences and different decisions.

SLO dashboard (for product and engineering leadership):

  • Current SLI value vs. target
  • Error budget remaining (absolute and percentage)
  • Burn rate over the last 1h, 6h, 24h
  • Budget projection: at current burn rate, when does it hit zero?

Operational dashboard (for on-call engineers):

  • Latency distributions (p50/p95/p99) broken down by endpoint
  • Error rates by status code and service
  • Saturation metrics: CPU, memory, disk, connection pools
  • Dependency health: upstream latency and error rates

Alerting should use multi-window burn rates, not raw thresholds. A single spike that recovers in seconds should not page anyone at 3 AM. Instead, alert when the fast burn rate (5-minute window) and the slow burn rate (1-hour window) both exceed their thresholds. This eliminates the vast majority of false pages.

The classic mistake is alerting on symptoms instead of SLIs. “CPU is at 90%” is a symptom. “Checkout success rate dropped below SLO” is an SLI breach. Page on the second, investigate the first.

Implementing This in a Real Company With Legacy Systems

You do not need a greenfield project to adopt SLO-driven reliability. Here is a realistic rollout:

Week 1–2: Pick one service, define one SLI. Choose your most important user-facing service. Define a single availability SLI — usually the ratio of successful responses. Instrument it using whatever you already have: Prometheus, Datadog, CloudWatch. Do not buy new tooling yet.

Week 3–4: Set a baseline SLO. Look at 30 days of historical data. Your SLO should be slightly better than your current performance — ambitious enough to matter, achievable enough to not be ignored. If your service has been running at 99.8%, do not set the SLO at 99.99%. Start at 99.9%.

Month 2: Add a latency SLI and build the dashboard. Add a latency SLI (p95 or p99 at the edge). Build the SLO dashboard with budget tracking. Share it with the team. Do not set policy yet — let people get used to seeing the numbers.

Month 3: Write and adopt the error budget policy. Now that people trust the numbers, formalize the policy. Get product and engineering to sign off on what happens at each budget threshold. Start gating deployments.

Quarter 2: Expand to more services. Roll the pattern out to your next three to five services. Adjust targets based on what you learned. This is where you discover which services are genuinely coupled and which SLOs need to account for dependency chains.

The key is that you do not need organizational buy-in for the whole framework on day one. Start measuring, show the value, expand from there.

Common Mistakes and How to Fix Them

Setting SLOs without baseline data. If you do not know your current reliability, any target is a guess. Measure first, set targets second.

Vanity SLIs. Uptime percentage measured by a synthetic ping to your health check endpoint tells you almost nothing about user experience. Measure real user transactions.

The “five nines” trap. 99.999% availability means 26 seconds of downtime per month. If your deployment process takes longer than that, you cannot deploy without burning budget. Most teams need 99.9% or 99.95%, not more.

SLOs detached from user journeys. An SLO on your API gateway means nothing if the user’s actual journey spans three services, a queue, and a database. Define SLIs at the user journey level, then decompose into per-service budgets.

Too many SLOs. If you have 40 SLOs, you have zero SLOs. Nobody can prioritize against 40 targets. Start with three to five per service, maximum.

Ignoring error budget enforcement. If you define SLOs but never freeze deployments or shift priorities when budget runs out, the entire framework is theater. The budget policy is the mechanism that makes SLOs useful.

Alerting on every metric. Alert fatigue kills reliability programs faster than any technical problem. Page only on SLI breaches and imminent budget exhaustion. Everything else is a dashboard, not an alert.

Conclusion

SLAs, SLOs, SLIs, and error budgets are not about perfection — they are about making reliability a conscious engineering decision with clear tradeoffs. The goal is not zero downtime. The goal is understanding exactly how much unreliability you can afford and spending that budget on shipping value.

The teams that do this well share a few traits: they measure at the user boundary, they set targets based on data, they enforce budgets through automation, and they treat reliability work as first-class engineering — not a tax on feature delivery.

Quick-Start Checklist

  • Identify your top three user-facing services
  • Define one availability SLI per service, measured at the edge
  • Collect 30 days of baseline data before setting targets
  • Set SLOs with a buffer below your SLA commitments
  • Build an SLO dashboard with burn rate and budget projection
  • Write an error budget policy with clear thresholds (50%, 25%, 0%)
  • Configure multi-window burn rate alerts — not raw threshold pages
  • Get product and engineering to co-sign the budget policy
  • Gate deployments on budget status in your CI/CD pipeline
  • Review and adjust SLOs quarterly based on operational data