Independent Observability Architect
Observability isn't tax. It's leverage. When it's designed right. But too many teams pay too much for noise, drown in alert fatigue, and build dashboards nobody trusts. I've spent 25 years fixing this. Whether you're an engineer looking to level up or a team ready for hands-on help, I've got you covered.
Available June 1st — The SRE On-Call Review Practice
The SRE On-Call Review Practice
OBSERVABILITY PRACTITIONER SERIES • BOOK 1
JUNE 1STA practical framework for combating alert fatigue and rebuilding on-call trust. Nobody had written the book I kept wishing existed. I finally wrote it down.
Print and PDF editions available June 1st
Learn Observability & SRE
Level up on your own. Free resources from 25 years of hands-on experience. No signup required.
Observability Architecture and Cost Optimization
Free tools and books from 25 years of hands-on experience building observability systems at scale.
Prometheus Alert Generator
Production-ready SLO alerting and error budget rules in minutes, not hours. No account. No install. Paste-ready YAML.
- Multi-window burn rate alerts — fast burn (critical) and slow burn (warning)
- Error budget tracking over 7, 30, or 90-day windows
- Riemann Sum technique for high-cardinality environments — battle-tested at Fortune 500 scale
- Liveness and availability monitoring included
The SRE On-Call Review Practice
A practical framework for combating alert fatigue and rebuilding on-call trust
Preview includes full table of contents:
- Three responses to any alert
- Alert standards and hygiene
- When to silence an alert
- The hidden cost of alert fatigue
- Weekly alert review meetings
- Distributed on-call practices
- Measuring progress
Part of the Observability Practitioner Series
Your feedback shapes the final book. Get the preview, see what's covered, and let me know what resonates.
Independent Observability Architect
Prometheus, Grafana, ClickHouse, OpenTelemetry, OpenSearch, Datadog & Splunk
25 years building observability systems that don't break - and don't break the bank. Battle-tested at Fortune 500 scale.
Led Observability at Fortune 500 Companies:
- Successfully migrated away from Splunk, saving $2.5 million annually
- Supported 300+ engineers
- 1 Billion+ active time series in Prometheus
- 8M+ samples/second, 150TiB logs/day at scale
- Implemented Thanos and Mimir for Prometheus clustering
Built systems that actually work:
- HIPAA, GDPR, FedRAMP compliant Observability Platforms
- Architected Thanos/Grafana cluster: 1B+ unique time series
- Open source contributor: Graphite, Prometheus, Thanos
- Built StatsRelay (multi-million UDP packets/second capacity for StatsD)
Recognition:
- Gertrude Cox Award recipient for innovative teaching with technology
- Host of Cardinality Cloud YouTube channel
- Host of operations.fm podcast
- Conference speaker: Monitorama PDX 2023, Monitorama PDX 2019, All Things Open 2020
- Industry thought leader
Technology-agnostic expertise:
Prometheus • Grafana • Thanos • Mimir • Loki • Tempo • Datadog • ClickHouse • OpenSearch • Splunk • OpenTelemetry • Graphite • StatsD • InfluxDB • Honeycomb • Coralogix
Office Hours
I keep a few slots open each month for conversations worth having: architecture questions, career decisions, tool choices, on-call problems you can't quite name yet. Not a sales call. Just two engineers talking through a hard problem.
Book a SlotPrefer a different way to connect?
Email: jjneely@cardinality.cloud
YouTube: @cardinalitycloud
Podcast: operations.fm