Observability Isn't Tools, It's How You Think
Observability Isn't Tools, It's How You Think
When an incident hits, most teams don’t lack data. They lack observability. They lack clarity.
Observability isn’t tooling. It’s not about vendors. It’s not dashboards, metrics, or traces. Observability is a practice. The practice of knowing what we understand about production systems. The practice of reasoning from evidence under uncertainty.
What I’m about to walk through isn’t a framework for observability. It IS observability. This is how we practice knowing what we understand about production systems.
The Five Steps ARE Observability
These five steps define observability as a practice:
- Ask falsifiable questions
- Understand your measurements
- Design valid comparisons
- Interpret evidence
- Build knowledge
This is epistemics applied to production systems. How we know what we understand when systems are failing. How we move from confusion to knowledge.
We’ve all seen this: dashboards everywhere, alerts firing, and the same questions keep coming up. What’s going on? Is it the database? Did the deploy break something?
Step 1: Ask Falsifiable Questions
The first failure in most incidents isn’t technical. It’s epistemic.
We ask questions that feel urgent but aren’t answerable:
- “Is the database slow?”
- “Is the network acting up?”
- “Could it be Kubernetes?”
These questions cannot be proven wrong. Which means they can’t be answered, only argued.
What Makes a Question Falsifiable?
A falsifiable question sounds like this: For checkout requests in us-east-1, did the schema migration at 14:05 increase p95 database latency compared to the previous deploy?
Now we have:
- Scope: checkout requests in us-east-1
- Comparison: before and after the schema migration
- Measurable outcome: p95 database latency
Tools That Help
Deploy markers, version labels, trace annotations, these don’t answer questions. They make better questions possible.
Deploy markers in your dashboards, versioning labels in your Prometheus metrics (like the build_info metric), Kubernetes labels and annotations, and tracing all enable this practice.
Common Mistake
If you can’t say what evidence would prove your question wrong, it’s not ready yet. That’s not a valid alert. That’s not a valid comparison.
Step 2: Understand Your Measurements
Once we have a question, the next mistake is assuming our data is telling us the truth.
A metric is not reality. It’s a lossy measurement with assumptions baked in.
Before you use a signal to answer a question, you need to understand three things:
- What it measures
- What it hides
- When it lies
The Latency Example
Take latency. From where to where? Retries included? Client or server side? Sampled or complete? Worse, an average?
Latency from the load balancer might well include retries, but latency from your application likely doesn’t. Is this sampled? Is this a full distribution of our latency measurements? Or worse, is it an average and just actively lying to you?
Tools That Help
Metric definitions (the help text in defining Prometheus metrics), histogram awareness, sampling configs. This is boring stuff, but it’s where most incidents go sideways.
Common Mistake
Green dashboards during outages are often a warning sign, not reassurance.
Absence of evidence is not evidence of absence.
Step 3: Design Valid Comparisons
Numbers by themselves don’t explain systems. Differences do.
Observability data is almost always relative. The question is: Compared to what?
Good comparisons isolate one variable and hold everything else as constant as possible.
Examples of Good Comparisons
- v1.4.7 vs v1.4.6
- Canary vs baseline
- One availability zone vs another
Tools That Help
Canary deploys, version tags, high-cardinality dimensions. These enable clean comparisons. This is why we do canary deploys. Versioning tags, using high-cardinality labels on your logs and your traces allow us to slice and dice and ask specific questions that isolate specific variables.
Common Mistake
Comparing “now” to “yesterday” without acknowledging traffic differences, deploys that have happened, or seasonality.
If everything is different, nothing is explanatory.
Step 4: Interpret Evidence
Once you’ve designed a comparison, the temptation is to jump to conclusions.
But evidence doesn’t eliminate uncertainty. It changes it.
Interpreting evidence means asking: Does this increase confidence? Decrease it? Refute the hypothesis? Or tell us nothing?
Four Ways Evidence Affects Certainty
- Consistent: increases confidence in the hypothesis
- Competing: decreases confidence (suggests alternative without refuting)
- Refutes: contradicts the hypothesis directly
- Inconclusive: no change to uncertainty
Worked Example
Say we hypothesize the schema migration caused the latency spike. The metrics show p95 increased at 14:05, but the issue started at 14:00.
This doesn’t refute the migration, but it’s competing evidence that decreases our confidence.
If reads stayed flat but writes spiked, that would refute a schema migration affecting both.
Tools That Help
Multiple signal types (metrics, logs, traces) help cross-check reality. Being able to correlate multiple signal types, traces, logs, metrics, logs from different systems.
Common Mistakes
Correlation is not causation. Just because two events happened at the same time doesn’t mean they caused each other.
Narrative lock-in. If you have evidence pointing in one direction, and all your SRE buddies are converging on a different conclusion, stick to your evidence. Don’t just conform.
Premature closure. Coming across some evidence that looks reasonable, slapping down a deploy, the graphs look better, and walking away. That doesn’t teach us anything and probably didn’t solve the problem either.
If your data can’t surprise you, you’re not really interpreting it.
Step 5: Build Knowledge
Most incident response frameworks stop here. Fix the issue. Write the postmortem. Move on.
Here’s why that fails: postmortems document what happened, but not how we knew.
Knowledge isn’t timelines or dashboards. It’s what remains when the graphs are gone.
What to Capture
Capture these things in your postmortems:
- What hypotheses were wrong and which ones were correct
- Which signals misled us or led to confirmation of a hypothesis
- Which comparisons worked and which comparisons led to that signal
- Why decisions were made under uncertainty (perhaps time crunch, but know why decisions were made with less than full certainty)
Common Mistake
If learning doesn’t compound, observability becomes just an expense.
Without building knowledge, you’re not practicing observability.
The Practice of Observability
Let’s recap.
Observability isn’t about more data. It’s about knowing what we understand.
These five steps ARE observability:
- Ask falsifiable questions that can be proven wrong
- Understand your measurements (what they measure, hide, and when they lie)
- Design valid comparisons that isolate variables
- Interpret evidence carefully without jumping to conclusions
- Build knowledge that makes the next incident easier
This is how we practice reasoning from evidence under uncertainty. This is how observability becomes leverage, not a tax.
Need help training your team on these practices? Want to get in front of your CFO before that six-figure observability bill hits? Contact me today.