Observability as Automation: A Philosophy for Understanding Systems

 


Your Observability Bill Is Too High

I know folks whose observability bill is higher than their compute bill. No human can deal with a hundred-plus pages per day per shift – or even half that – whether they’re actionable or not.

Do you feel like you’re flying blind in chaos? Do customers tell you about incidents before you even know what’s going on? Does it feel impossible to localize the root cause of a problem?

Does this sound familiar? Let’s fix this.

25 Years of Lessons

Hi, I’m Jack Neely, owner and observability architect at Cardinality Cloud, LLC. I’ve been doing DevOps and SRE work for more than 25 years. Thirteen of those years I’ve spent specifically on observability work for companies big and small. Including startups, mid-sized companies, Fortune 500s, even a couple companies that got bought by Google.

I’ve been there. I’ve seen this.

I’m also an open source contributor with code in Graphite, Prometheus, and Thanos. I’ve presented at conferences. And now I want to use this YouTube series to share what it’s taken me 25 years in the field to learn.

The Revelation: Observability is Automation

In Site Reliability Engineering, we often talk about toil—those repeatable tasks that are usually manual, even if it’s manually running the same script over and over again. It’s repeatable, it’s interrupt-driven, but the key bit that identifies toil is this:

It’s work that doesn’t have any enduring value.

If the application or system before we needed to do the work and after we’ve done the work are in the same state, then that work was toil.

These are tasks we’ve learned to automate. In fact, that’s built out a lot of our careers. Much of the SRE movement is about automating these tasks, standing on top of that automation, automating again, improving, and repeating that same cycle.

The longstanding joke in our career was that our job was to automate ourselves out of a job. That rings a touch hollow now with AI and its promises (or lack thereof). But the point is that we value a learning culture. We value building value. It’s more fun to work on new and exciting projects, to learn new things. It drives our careers and drives the business.

The Scientific Method in SRE

As I was doing observability work, this led to a lightbulb moment for me. I really enjoy distributed systems and modeling them with my observability tools. One day it clicked:

We are scientists and engineers. Software engineers. Site reliability engineers. I have a degree in computer science. So isn’t it our responsibility – isn’t it what we’re hired to do – to use the skills of science and engineering in our field?

And isn’t that the scientific method?

The scientific method is how we:

  1. Phrase a question
  2. Build a hypothesis
  3. Test it with an experiment
  4. Evaluate our results
  5. Determine if that’s acceptable, or if we need to rinse and repeat or start over

The scientific method has been how we as a global civilization have gained knowledge and understanding for more than 3,000 years. I think it’s just as applicable today in site reliability engineering and software engineering.

Building Models of Understanding

When we use the scientific method, we tend to build some sort of model of our understanding. It may not be exactly complete or exactly correct, but we try to build this model so that when something new happens, we can use that model to predict:

  • What should be the solution?
  • What should happen?
  • What is the cause?
  • What is the next step?

It’s that model and prediction that becomes so powerful. In observability, that’s a dashboard. Except today’s dashboards have live data that enables us to do real-time debugging and real-time problem solving in production.

Observability Is Automation for Understanding

Observability has been defined as many things, but the definition of observability doesn’t mean anything unless it drives value to both the engineers and the business.

Here’s what I’ve learned: Like we would use a script to automate a simple task, or a CI/CD pipeline to automate a complex movement of data and code from one end to the other, or a whole software application to automate an entire solution, observability is automation for our understanding of a system.

Observability is not a tax we pay to do our jobs. Observability is leverage – for me, for you, for the business.

A Better Approach

With this mindset, it’s clear that observability is not shoveling endless cash at a vendor and hoping for the best. The last thing you need is another prebuilt dashboard or bit of “AI sprinkling” that doesn’t help you with the custom business logic you’ve created.

I believe there are techniques we can use that are fiscally responsible to implement our observability and still have advanced analytics – and do so in a way that encourages and builds that learning environment where we teach, understand, iterate, and improve.

These techniques:

  • Allow us to integrate with any AI model
  • Allow us to insource that understanding – the model of the system
  • Still let us outsource the toil and the hard parts – database management

When I’ve talked about observability before, a common refrain has been: “Jack, our core competency is not running some crazy database for our observability data.” I’ve done that. I’ve been there and I’ve seen what happens. So yeah, I get that.

I think these techniques can be leveraged to outsource the data lakes – the hard part – but bring in focus on enabling our engineering staff to build those models, build that understanding, and build a learning environment.

The Framework: Crawl, Walk, Run

As we progress through this series, I’m going to break this down into a crawl, walk, and run framework:

Crawl: The Basics

Monitoring and observability basics starting from scratch with a small AWS footprint or as a startup. Foundational data types, alerting principles.

Walk: Everyday Reality

Where I suggest most of us live and breathe in our everyday life. Instrumentation at scale, managing data at scale, enterprise challenges including dealing with cardinality. And it’s not an SRE or observability YouTube channel without Service Level Objectives.

Run: Advanced Techniques

Advanced automation, prediction, advanced analysis, plugging in AI for that acceleration ability that AI gives us. Using observability truly as a force multiplier in our organization.

What’s Next

In the next episode, I plan to crawl through doing logs and setting those up well in a very small AWS footprint – like a startup or very small project. How to get you off on the right foot and in a way that plans for the future and enables the analysis you want to do at any scale.

Need an Expert on Your Team?

Do you need help implementing these techniques? At Cardinality Cloud, observability audits are always free. We’ll help you cut costs 10-20%+ while improving reliability.

Book your free cost audit or contact me directly to get started.