sre observability error budgets reliability

Why Your Error Budget Is Lying to You (And How Observability Actually Fixes It)

Error budgets sound clean in theory but fail silently when you can't see where the unreliability actually is. Here's how to build observability that makes them real.

2026-06-03 · 8 min read

The Comfort of a Number That Means Nothing

You've got a 99.9% SLO. That means you can afford 43 minutes of downtime per month. You've got a spreadsheet. You've got a Slack alert that tells you when you're "at risk". Everyone nods at the planning meeting. You're doing SRE now.

The spreadsheet is lying to you. Not because the math is wrong. The math is fine. It's lying because it never tells you why you're spending the budget.

You burned through your entire Q3 error budget in week 2. Was it a database connection pool exhaustion? A deploy that shipped stale cache keys? A third party API that started timing out? A cascading failure in a service you didn't know was on the critical path? Your error budget spreadsheet says "43 minutes, gone". It says nothing about which of those it was. It says nothing about what you could have prevented.

That's not an error budget. That's a receipts file you never read.

What Error Budgets Actually Need (And What They Don't Have)

An error budget only works if three things are true:

First: you know when you're spending it. Not "we got an alert". Not "someone noticed latency was high". You know the exact moment, the exact service, the exact endpoint, the exact user population affected.

Second: you know why you spent it. Not a root cause analysis written three days later. Not "there was load". You know, while it's happening, what the mechanism is. Database slow? CPU bound? Memory leaks? Cascading timeout from dependency X?

Third: you know if that same thing could happen again undetected. This is the hard one. Your observability might be good enough to catch this failure. Will it catch the similar one that takes a different path through your system?

Error budgets without this collapse into security theater. You're tracking a number. The number has no texture. It's not connected to anything you can actually fix.

The Real Cost of Guessing

Here's what happens when observability is missing but you're still trying to use an error budget:

You hit your SLO breach. Your error budget goes red. The incident command starts: who was affected? why? Everyone starts guessing. Someone checks the logs. Someone else checks the database metrics. A third person looks at GCP Monitoring. None of those views talk to each other.

You spend the first 15 minutes of your incident in a fog trying to establish what actually happened. You're not fixing the problem. You're not preventing the next one. You're arguing about what the symptoms mean.

By the time you have enough context, the incident is over. You write it up. Root cause: "traffic spike" or "deployment issue" or "resource exhaustion". That's technically true. It's also the worst possible truth because it tells you nothing about what was actually broken.

Now next quarter rolls around. The same sequence happens. Different endpoint, same pattern of cascade. Your error budget takes another hit. You notice the pattern in a retro meeting three weeks later. By then you've spent another month's worth of budget on variants of the same underlying issue.

The error budget was never helping you. It was just a counter on the wall that said "you're broken" with no instruction on how to stop being broken.

Where Observability Meets Error Budget: The Actual Fix

Real error budgets need real observability. Not a dashboard with 47 charts you check post-incident. Instrumentation that connects the failure to the decision.

Start here: structured logs with trace context. When something fails, the log entry needs to know which request it came from, which service it touched next, which database query it triggered. Not as a string that you parse later. As fields you query now.

Run a query like this and get an answer in seconds:

WHERE error_budget_consumed = true
  AND timestamp > now() - 30d
GROUP BY service, endpoint, failure_mode
ORDER BY budget_impact DESC

Your failure modes should be explicit in your instrumentation. Not "error: request failed". Something like:

{
  "failure_mode": "database_connection_pool_exhausted",
  "request_count": 1200,
  "affected_endpoints": ["POST /api/checkout", "GET /api/user"],
  "budget_impact_minutes": 12.4,
  "first_occurrence": "2024-11-15T08:23:11Z"
}

That's data you can act on.

Then: cardinality on the dimensions that matter. You care about:

Which service broke (service name)
Which endpoint failed (path, method)
Which user segment (geography, subscription tier, traffic source)
Which dependency cascaded the failure (database, cache, external API, queue)
The mechanism itself (timeout, memory, connection pool, rate limit)

The trick is logging those dimensions at instrument time, not extracting them from strings later. In a multi-tenant system, where different customer environments could hit different resource constraints, you need to know which tenant's requests started failing first. That's information you build in at the logging layer, not something you reverse engineer from the crash.

Most teams get here wrong. They log events. They don't log the dimensions of the events. So you end up with:

ERROR: checkout service failed
ERROR: checkout service failed
ERROR: checkout service failed
... 1,247 times ...

And a Slack message that says "checkout is down" and no way to know if it's down for everyone, one customer, or one geographic region.

The Observability -> Error Budget Loop

Once you have this, something shifts. Your error budget becomes a driver of prioritization, not just a flag on the wall.

You run your query. You see: "We spent 60 minutes of budget this month. 45 of it came from a connection pool timeout in the payment service. 12 came from a cache miss cascade in the user service. 3 came from an external API SLA breach."

Now you can prioritize:

The connection pool timeout is killing your budget. It's predictable, preventable, and probably easy to fix (connection pool tuning, circuit breaker, bulkhead isolation).
The cache cascade is also preventable. Probably needs cache warming or a fallback path.
The external API SLA breach: can't fix the vendor. But you can add a fallback, or circuit break them earlier to preserve budget for your own service.

The error budget is now a tool. Not a compliance checkbox.

You see patterns. If you're bleeding budget in the same service repeatedly, that service is a reliability liability. It becomes a project priority. You have numbers. "This service has consumed 30% of our error budget despite being 5% of our traffic." That's a data-driven reliability story, not a guess.

If one user segment is disproportionately affected by failures, that's a signal. You can reproduce and debug with precision because you logged the dimensions.

What Bad Observability Looks Like (And Costs You)

The opposite: you have error budgets but observability that's mostly metrics.

You check Prometheus. You see CPU spikes. You see memory spikes. You see request latency spikes. You don't see why they spiked together or which business transaction was affected or which customer noticed. You're looking at symptoms through a straw.

Now you're in the classic incident loop: someone suggests it was probably the deploy from 8 minutes ago. Someone else says maybe it's the database. A third person thinks it might be a traffic spike. You revert the deploy. Metrics go back down. Did the deploy actually break it, or did it just coincide with something else? You'll never know because you didn't log the right dimensions.

Your error budget counter goes red. No idea why. Next incident, same dance.

This is why teams with strong observability can burn through their error budgets intentionally (shipping risky features, scaling experiments, chaotic engineering tests) while teams with weak observability burn through them blindly (a third-party API timeout, a configuration drift, a queue buildup, something nobody noticed until the budget was gone).

The Actual Implementation: Three Layers

Layer one: instrument for failure modes, not just metrics.

In your request handler, log:

{
  "trace_id": "xyz",
  "service": "payment",
  "endpoint": "POST /checkout",
  "failure_mode": null,  // stays null if success
  "database_query_time_ms": 145,
  "external_api_call_time_ms": 230,
  "cache_hit": false,
  "user_id": "cust_123",
  "tenant_id": "merchant_456",
  "region": "my-southeast",
  "status_code": 200
}

If something fails, failure_mode gets populated. "timeout", "connection_pool_exhausted", "rate_limited", "cache_miss", "dependency_circuit_breaker_open".

Layer two: aggregate by dimensions, not just count.

Instead of "error rate", calculate "error rate by service by endpoint by failure_mode by tenant". Use your logging backend to bucket these. ClickHouse, BigQuery, or even Datadog LTS if that's what you're already paying for. The point: you can query "show me all the failure modes that consumed more than 5 minutes of my error budget this month" and get a ranked list.

Layer three: connect the error budget calculation to the observability query.

Your SLO says "99.9% availability on POST /checkout". Your observability needs to answer: "how many POST /checkout requests in the last 30 days succeeded vs failed?" and then calculate the actual error budget consumption from that, not from a separate alert or manual calculation.

error_budget_consumed = 30 days * 24 hours * 60 minutes * (1 - slo)
= 30 * 24 * 60 * 0.001 = 43.2 minutes

actual_budget_used = sum(error_duration_by_failure_mode)

And crucially: they should match. If they don't, your observability is lying to you, and your error budget is doubly worthless.

Why Most Teams Skip This

Building observability that drives error budgets takes work. It's cheaper in the short term to just track SLO breaches in a spreadsheet. You don't need to instrument a dozen services. You don't need to architect a logging pipeline. You don't need to think about cardinality and cost.

And for the first six months it feels fine. You breach your SLO maybe once a quarter. You spend a day in a retro. You move on.

Then you ship a multi-tenant system. Or you cross the threshold where one team's service affects three others. Or you scale to the point where cascading failures become common enough to notice. Suddenly your error budget is spinning red every other week and you have no idea why.

By then, rebuilding observability from scratch is a two-sprint project, and you've already trained your organization to ignore the error budget because it's been meaningless noise for months.

The Mistake That Feels Like Progress

Here's the trap: building dashboards with your existing observability and calling it done.

You make a dashboard. It shows your error budget consumption by day. You make another dashboard that shows your SLO percentage by service. You link them. You feel like you've solved the problem.

You haven't. You've added visual noise. The dashboards don't tell you why the error budget is red. They just confirm that it is.

Worse: dashboard-driven observability is a tool for post-incident analysis, not incident prevention. You look at it after something breaks. It doesn't help you catch the pattern before the budget is spent.

What you actually need is queryable observability. The ability to ask questions in the moment: "show me all failures in the last hour that would breach this SLO" or "what's the p99 latency for requests that hit the payment service vs requests that don't" or "which tenant is consuming the most error budget this month and why".

That's a fundamentally different architecture from dashboards.

How to Start Today

You don't need to rebuild everything. You probably already have logging somewhere. Start:

Add failure mode tags to your critical request paths. In your HTTP middleware or your function handler, catch exceptions and tag them. "database_connection_pool_exhausted", "timeout", "rate_limited", "circuit_breaker_open", "bad_response_from_dependency".
Add cardinality to your logs. Service name, endpoint, user/tenant segment, dependency name. These should be structured fields, not strings.
Pick one SLO you actually care about. Something user-facing that would hurt if it broke. Run a query: "how many times did we miss this SLO last month, and for which failure modes?"
Estimate: if we'd prevented just the top failure mode, how much error budget would we have? That's your signal.
Now: set up an alert that fires the moment that failure mode hits a threshold. Not "CPU is high". Not "requests are slow". "Connection pool is exhausted on the payment service." Actionable.

That's not a complete observability system. It's the beginning of one that actually connects to your error budget.

The Result

Three months later: you're not burning through error budgets mysteriously. You know exactly where the budget went. You're fixing the same failure modes twice because you caught them the second time. Your incident retros are shorter because everyone knows what happened.

Your error budget stops being a compliance number and becomes what it should always have been: a tool that tells you where your system is actually fragile.

Written by

Faiz Kasman

Software engineer in Kuala Lumpur. Payments, multi-tenant SaaS, and inventory infrastructure.

GitHub LinkedIn About

Keep reading