Alex Hidalgo on Reliability Reporting

Dec 21, 2020 | Author: Alex Hidalgo

Avg. reading time: 1 minute

Chock-full of example scenarios you’ve probably experienced yourself, Chapter 17 of Implementing Service Level Objectives by our own Alex Hidalgo covers the ins and outs of SLO-based reliability reporting. Hidalgo walks you through how to make the best use of this approach, why other methods fall short, and how using SLOs can really make a difference for the people in your organization.

SLOs allow for you to think about your service in better ways than raw data ever will

TL;DR: being able to properly report your status based on SLOs benefits the humans behind the data—your external customers and internal coworkers. Download free

In basic reporting, it’s common to keep track of how many times certain incidents of failure occur. Many companies also use severity levels to classify and categorize those incidents into hierarchical levels of seriousness. Once severity levels are established, companies then turn to a metric called MTTX—or “Mean Time to X”—to measure reliability over time.

Although MTTX has some useful applications, Alex runs through a few of the problems with this approach. How do you handle the uniqueness of each incident, the fallacy of using averages for your math, and the subjectivity of your organization’s internal definition of “severity levels”? He resolves that you should be asking yourself a set of different questions:

During which time period have our users experienced more unreliability?
How can we quantify this better?

The answer to both: SLOs.

The chapter also shows you how SLOs can provide a holistic sense of your reliability, better report on the state of your services, and craft more concrete goals for the future. You’ll see examples on basic reporting, DDoS attacks, SLIs, and error budgets, as well as more advanced reporting methods like covering error budget status and reliability burndown dashboards.

If it hasn’t been emphasized enough, Alex’s point is that “SLOs allow for you to think about your service in better ways than raw data ever will.” SLO-based approaches are entirely focused on your users or customers. Measuring the things your users actually care about is the golden ticket to understanding their experiences, learning from them, and making the best decisions you can moving forward.

To get a better look at reliability reporting with SLOs, download the full chapter here and order the book from Amazon.