Originally published at ToolBox.com on January 19, 2021
Let me start by saying I love monitoring. Monitoring is great. Monitoring—which is defined as collecting metrics to improve understanding of how a system behaves— is a significant source of the data we need to practice the customer-centric, SLO-based approach to software reliability. In that sense, monitoring and SLOs are complementary—they go hand in hand.
SLOs can even tell you when your reliability goals may be unnecessarily high or when you should ignore non-important false issues.
Monitoring helps us gather data to get a clearer picture of what’s happening in our software and systems. If you’ve already invested in good monitoring tools like Datadog, Prometheus, New Relic, or others, that’s great. Keep them. You need them. But you need SLOs too, and here’s why.
Can Monitoring Tell You If You’re Meeting Customer Expectations Over Time?
You can be using monitoring and alerts to inform you about the status of a system component at a moment in time, but that information by itself doesn’t tell you whether the system is performing well or poorly from your customer’s perspective.
Let’s look at a simplified example:
How do you determine reliability over time? How about counting incidents, like pager duty alerts? If you have 20 incidents in Q1 and 10 in Q2, can you definitively say that you had higher reliability in Q2? Maybe. Or maybe not.
What if the incidents in Q1 added up to 120 minutes of unreliability but Q2 incidents added up to 150 minutes? In that case, your users would have experienced better reliability in Q1.
What if you took a Mean-Time-to-Resolution (MTTR) approach? Using the same data as before, we can compute that incidents in Q1 averaged 6 minutes each, and incidents in Q2 averaged 15 minutes each. So, does this mean that Q2 was more than 2 times as unreliable as Q1? That’s also not true.
That’s what’s great about SLOs and error budgets: they give you the best way to report on reliability over any window of time, based on the real issues that customers actually care about. Or, put another way, SLOs filter and translate all that monitoring data and help you identify precisely where to expend your energy to achieve the biggest return for the business.
SLOs Guide More Advanced Efforts to Improve Reliability and Efficiency
Once you have your SLOs in place and have error budget to spare, you can justify experimentation and low-risk chaos engineering (e.g., running tests to identify hidden dependencies in systems). SLOs help you determine when to schedule load and stress tests and even when to try turning systems down or off. With SLOs, you can be proactive rather than reactive. Here’s an illustration:
Your monitoring system probably doesn’t page you upon every ‘500’ error code that is thrown. Instead, it is probably set to alert you if X number of error codes are thrown over a Y period of time. Whereas your monitoring system sees this period as one of “no alert needed,” an SLO approach will give you a clearer picture of your risk by revealing that you are, in fact, burning tiny bits of error budget during this time.
For example, let’s assume you have a large service that requires that you do your releases in a rolling manner across 50 pods. Although you’ve never heard complaints during your rollouts, your error budget shows burn during this time. Knowing this, you may be able to proactively improve reliability just by slowing down your pace of rollout across the pods.
SLOs can even tell you when your reliability goals may be unnecessarily high or when you should ignore non-important false issues. Monitoring alone does not provide this kind of insight.
Monitoring Is Not the Best Language for Organizational Collaboration
Most monitoring systems are used by highly technical engineers, and a discussion about monitoring metrics can quickly make non-engineers mentally check out of a conversation. Therefore, monitoring data provides a poor basis for objective discussions with other stakeholders in the company. SLOs, on the other hand, provide a higher-level language of reliability that everyone in the company can understand and use. For example, If you’re trying to communicate with your CEO, SLOs formalize the “squinting at metrics” exercise and elevate the discussion above technical jargon to business outcomes. Furthermore, SLOs are a language engineers and product managers can use to find agreement on threshold targets, prioritize, track progress, and develop valuable reports. (Read more about that here.) By providing a common language, SLOs contribute to healthier cross-team collaboration.
As I’ve said, monitoring is great. Monitoring is critical. But monitoring data is only a piece of the picture. SLOs complete the picture by focusing your attention on what truly matters to your customers.