Why Your SRE Team's Toil Budget Is Lying to You
Why Your SRE Team's Toil Budget Is Lying to You
Most engineering organizations track toil as a single number. If they track it at all.
That single number is hiding the most important signal in your reliability practice.
Toil Is Two Buckets. Most Teams Track Neither.
When you plan your team’s capacity, there are three categories of work you need to account for.
Reactive Capacity: Outages, incidents, pages, interrupt-driven work. Things you cannot plan for, but must absorb. Your team is in it right now, or they were last night.
Remediation Capacity: Alert tuning, runbook creation, tech debt paydown, ownership cleanup. Proactive work that prevents the next incident. It can be planned and scheduled. Almost nobody does.
Project Capacity: Sprint work, customer features, internal tooling. The work leadership normally tracks. The work that shows up in the roadmap.
Reactive and Remediation together are your toil. That is the work that consumes capacity without advancing the product. Project work is the third category and separate from both. It is where you want the majority of your team’s time to go.
Most teams formally plan only project work and absorb toil as invisible overhead. This makes the cost of reactive work impossible to measure and makes remediation work impossible to protect.
The Catch Nobody Warns You About
The goal is to move time from Reactive into Remediation. Fewer fires. More prevention. Better sleep.
Here is the part that trips every team up: you have to commit to Remediation time before your Reactive load improves. Not after things settle down. Not once you get through this quarter’s incidents. Before. While you are still drowning.
That time is hard to find. It will feel irresponsible to protect it. That is exactly why leadership has to be the ones to protect it, visibly and repeatedly, even when it is inconvenient.
When that time is protected and the debt starts moving, Reactive load shrinks. As Reactive shrinks, Remediation and Project capacity both grow. The investment compounds.
When enough technical debt is paid down, you will see Remediation time start migrating into Project time. SRE and DevOps teams shift from fighting fires to building systems that guide engineers toward better decisions in the first place.
What Commitment Actually Looks Like
Commitment is visible. Your team hears and sees you carve out and protect Remediation capacity in each sprint or planning cycle. They know they will not win every cross-team priority battle. But they know you are in the fight with them.
Here is an example from my own work as an Observability Architect doing On-Call Reviews with SRE teams.
We kept getting paged for latency. SREs knew the drill: scale the database, shift load to replicas, get things moving again. It worked, mostly. But under review, every one of these incidents traced back to the same root cause: poor SQL code.
The real fix required code commits, deployments, and coordination across the SRE team, the data platform team, and the engineers who owned the feature. That coordination almost never happened on its own. Everyone was overbooked.
When leadership stepped in and made it happen, protected the cross-team time and unblocked the priority conflict, an entire class of pageable events disappeared. Engineers walked away with better SQL skills. Customer support cases visibly dropped.
That is what committed leadership unlocks. Not just faster recovery. Fewer incidents in the first place.
The Number to Track
The Google SRE book established a ceiling: toil should be 50% or less of your SRE team’s time. That recommendation is right. But it does not tell you how to use the number.
The goal is to keep toil, Reactive and Remediation combined, at 50% or less. That number is a health indicator.
Teams over 50% have too much tech debt and too many reliability issues to scale. They are losing ground. Teams under 50% are usually operating well and in good shape.
Track the split from your pager data and your ticket system. Tag work to the bucket it belongs in. Make the invisible visible.
If you can’t measure the split, you can’t shift it.
This Is a Chapter in a Book I’m Writing
The SRE On-Call Review Practice is a hands-on guide for SRE and DevOps practitioners and the leaders who support them. It covers the full lifecycle of building a weekly on-call review practice, from getting the first meeting on the calendar to building the organizational habits that make reliability improvements stick.
The framework above is part of a broader chapter on organizational commitment and time capacity planning. There is a lot more where this came from.
Grab the preview and get early access when the book ships.
Preview readers get first access to the finished book and I welcome feedback. This is the kind of book that gets better with real-world input from practitioners and leaders who have lived these problems.