I Wrote a Book About the Thing That Almost Broke Me

 


The SRE On-Call Review Practice book cover

I remember the knock on the door.

I don’t remember what I was doing. Probably something that felt urgent. Something I was under pressure to deliver, which was pretty much everything. What I remember is that when someone knocked on the door of my office — a shed in the back yard — I started screaming. I didn’t know who it was yet. It didn’t matter. It was another interruption from another direction, and every neuron in my body had been trained to respond to interruption with panic.

I scared my friend. I scared myself. That’s when I knew I needed a therapist.

I’ve been in this industry for 25 years. I’ve worked on systems at Fortune 500 scale: 150+ TiB of logs a day, 8 million samples per second, 400 million active time series. I’ve watched teams get paged into the ground and then quietly rebuild, year after year, usually without ever naming what was happening to them. I’ve had the burnout. I’ve had the bad rotations. And I’ve spent years helping teams find their way back from them.

Nobody had written the book I kept wishing existed. Vendors tried to fill the space. But let’s say their version was a bit one-sided. I spent years creating a practice to help my teams recover, and I finally wrote it down.

The Problem Is Bigger Than Your Alerting Config

Here’s what I see, over and over, when I work with a team that’s carrying an on-call problem: they think the problem is technical. Tune the thresholds. Add more runbooks. Add AI. Consolidate the noisy alerts. And yes, those things matter. But they’re treating symptoms.

The deeper problem is what sustained, unpredictable interruption does to a human being over time.

Phantom Vibration Syndrome is real. It’s the hallucination of a page or notification when none has fired, a conditioned fear response that develops after months or years of being constantly on alert. Researchers first documented it in hospital staff. It’s now common enough in engineering that most people I talk to recognize it immediately when I name it.

Attention residue is real. Gloria Mark’s research at UC Irvine found it takes an average of 23 minutes to return to a complex task after an interruption. If your engineers are getting paged multiple times, they are never fully recovering. They’re operating on the residue of focus they never quite got back.

Decision fatigue is real. Every time someone on-call has to evaluate an alert (is this real, how bad is it, who needs to know, what do I do), they’re drawing on a finite cognitive budget. Late in a bad rotation, that budget is gone. The decisions get worse. The mistakes get made.

These aren’t failures of character or focus. They’re the predictable outputs of a system treating engineers like a perpetual motion machine.

I’ve believed something for a long time, and this book says it plainly: the foundation of any reliable system is the sustained health and well-being of the people who build it. There is no business imperative, no uptime SLA, no customer escalation that supersedes this truth. When leadership treats this as a soft concern rather than an operational one, the system eventually makes the case for them, usually at the worst possible moment.

Why I Finally Wrote It Down

I kept seeing the same pattern. More and more pages. The rotation becomes unsustainable. Leadership would notice. Someone would try to fix it, usually by tuning alerts or hiring more people, and the underlying dysfunction would persist. Six months later, new faces, same problem.

Nobody had a playbook that took both dimensions seriously at once: the human cost and the operational mechanics. The vendor-sponsored content treats it as a product problem. The academic literature is too abstract to hand to an SRE team.

I’ve spent decades accumulating pattern recognition across large, complex systems. I’ve worked through the research on cognitive science and reliability, through the data behind Accelerate, through Karl Weick’s work on small wins. And I’ve watched those things work in practice, on real teams, in real rotations. I’ve seen a team go from dreading their pager to actually trusting their alerts again.

It was time to write it down.

Three Responses to Any Alert

Here’s one of the first things in the book, because I think it’s genuinely useful on its own.

Every alert that fires during an on-call shift has exactly three valid responses. Only three.

Action It. It’s a real event. You own it. You acknowledge, you assess impact, you remediate, and when it’s done, you update the runbook. The next person who sees this alert should be a little better equipped than you were.

Fix the Alert Rule. It’s not a real event. Or the thresholds are wrong. Or the signal doesn’t mean what the alert says it means. Either way, the alert itself is the problem. The fix isn’t in production. The fix is in your alerting configuration.

Escalate. This isn’t yours. It belongs to another team, another system, another owner. Route it correctly. Then, critically, update the routing so it doesn’t come back to you next time.

That’s it. Three options. The discipline is in actually choosing one, every time, without letting alerts accumulate in an unresolved pile where their status becomes ambiguous and their eventual resolution becomes impossible to track.

This sounds simple. It is not easy. The book is about building the practice that makes it sustainable.

The Book

The SRE On-Call Review Practice: A practical framework for combating alert fatigue and rebuilding on-call trust is the first book in the Observability Practitioner Series from Cardinality Cloud.

It covers alert standards and hygiene, the hidden mechanics of grouping and auto-resolution, how to design alerts that mean something, how to run the weekly review, how to measure whether things are actually getting better, and how to make the case to leadership that this work is worth protecting.

It’s available June 1st, 2026. Print edition on Amazon KDP for $14.99. PDF edition from this site.

Get the Free Preview Now

Before the book launches, I’m making a preview available: the full table of contents, the introduction, and the first chapter. If you read it and something resonates, or something’s missing, I want to hear from you.

And you’ll be the first to know the moment the book goes live.

If you’re reading this because someone on your team sent it to you, pay attention to that. They’re telling you something.

And if you’ve been in this, if you recognized yourself somewhere in these paragraphs, I hope the book gives you language for what you’ve lived through, and a path out.