From Sparse Metrics to Actionable Alerts: An SLO Case Study

Aug 27, 2024 | Author: Natalia Sikora-Zimna

Avg. reading time: 6 minutes

Building an effective monitoring system based on Service Level Objectives requires more than just defining a set of SLOs. It is crucial to avoid the inefficiency of manual monitoring. Once SLOs are established, the next immediate priority is to ensure consistent and accurate alerting. That is why alerting on SLOsis is a crucial element of effective site reliability engineering practices. However, when grappling with sparse data, setting up alerts that trigger appropriately becomes considerably more complex. Sparse metrics, defined by low data point volumes over given periods, can disproportionally highlight or hide events, making alert strategies less effective. Let’s explore the nuances of alerting on SLOs with sparse data and strategies for effective monitoring.

The challenge

To create actionable alerts based on SLOs, you need a reasonable volume of data points to ensure statistical significance. Sparse data lacks this volume, making distinguishing between normal variability and real issues difficult. A single error in a small dataset can appear disproportionately severe, while another issue may go undetected if it doesn’t occur within the sparse sampling windows. This can drastically skew your perceived error rate, leading to false positives or missed alerts.

Let’s take a closer look at the following example. We have an SLO that monitors the performance of a User Management API endpoint dependent on an external identity provider. The endpoint's usage frequency varies depending on the day. During periods of high traffic, we may receive responses several times an hour, whereas at night or on weekends, the frequency may drop to once every few hours or even more rarely.

Image 1: An SLO monitoring User Management API availability

Image 1: An SLO monitoring User Management API availability

To monitor this endpoint, we established an SLO aiming for 99% successful API responses out of the total number of responses. This SLO detected an incident during a low-traffic period caused by an error in the code responsible for caching data. Our on-call engineer received a PagerDuty alert and immediately implemented a hotfix. By promptly responding to the alert, he mitigated the issue and restored API performance to acceptable limits. While this process may seem straightforward, it took some time to make it so by fine-tuning our alerting system to provide accurate information in cases of sparse data.

The first challenge was the "lasts for" parameter used to define our alert policies. Simply put, the "lasts for" specifies the time period an incident must last to validate an alert. Some events were missed with sparse data because the data arrived at much longer intervals. Conversely, extending the "lasts for" duration to accommodate the sparse data resulted in delayed alerts.

The second challenge was finding the appropriate alert policy setting and time window to accommodate the sparse data.

The solution

A standout feature of Nobl9 is its ability to create alerts with different time windows, rather than relying solely on the "lasts for" parameter to handle SLOs with varying data volumes. This flexibility allows teams to tailor their alerting strategies to the specific data characteristics, ensuring alerts are both meaningful and actionable. The first thing we did was replace the problematic "lasts for" parameter with an alerting window in our alert policy.

The difference between using alerting windows and the "lasts for" parameter is clearly visible in the example below. A critical event was missed because it didn't meet the required time ("lasts for") to satisfy the alert conditions.

Image 2: An example of an SLO monitored by the"average burn rate ≥ 130x lasts for 1 hour" alert policy. Due to the sparse data, the conditions for the "lasts for" parameter were not met, and thus no alerts were triggered for this SLO.

However, the critical event did not go unnoticed with an adjusted alerting window, and the appropriate teams were promptly alerted about the potential problem.

Image 3: The same SLO is monitored by an alert policy with the condition "average burn rate ≥ 130x within a 1-hour alerting window." The use of an alerting window allowed data points to accumulate, which triggered an alert.

Image 4: Alert details illustrating the "average burn rate ≥ 130x within a 1-hour alerting window” alert condition that triggered an alert.

Image 4: Alert details illustrating the "average burn rate ≥ 130x within a 1-hour alerting window” alert condition that triggered an alert.

That was the first step, but we needed to adjust alerting windows according to different data volumes to ensure our alerts notified us precisely when expected. The SRE book suggests using different time windows for low and high-traffic periods to balance alert sensitivity and minimize false positives. Increasing the time window for alert evaluation can ensure that alerts are based on significant, sustained issues rather than isolated, minor errors. For example, shorter windows can quickly detect and respond to issues in high-traffic periods. Conversely, extending the window to 30 minutes or even an hour during low-traffic periods helps accumulate enough data for meaningful analysis. These suggested values are just a starting point, and you'll need to adjust time windows according to the usage and needs of your system. Moreover, applying multiple time windows and burn rate policies to an SLO can help distinguish between early indicators of an incident and severe cases that require immediate attention. See how it worked in our case.

Since the usage of the API endpoint fluctuates throughout the day and week, we decided to apply alert policies with multiple time windows. By configuring these various alerting windows, Nobl9 enables more precise event detection and better alignment with business objectives. To account for different severities of events and detect events in spite of sparse data, we attached four alert policies evaluating an alerting window of 12h, 6h, 1h, and 15 mins.

The first alert policy was designed to notify us of significant budget drops within a longer period of 12 hours. A single API request during a lower traffic period was slower than expected, triggering a Slack notification to our SRE team, and indicating that attention might be needed. The SRE team promptly informed the application team, which is responsible for the affected component. This instance of a slower request could be an isolated case, making a Slack notification appropriate. With a satisfactory error budget margin, the application team began gathering the necessary information to determine whether this was a one-time incident or indicative of a more severe problem. There was no need to pause current work, but an investigation was necessary.

Image 5: An alert triggered due to a single slower API request within a 12-hour alerting window.

Image 5: An alert triggered due to a single slower API request within a 12-hour alerting window.

However, after some time, the SRE team received another alert, this time indicating a slow but significant budget burn within a shorter 6-hour window. The recurrence of events in shorter intervals suggested that the first incident might not have been an isolated glitch. Consequently, the team began a more thorough investigation to identify the possible root cause.

Image 6: An alert triggered due to slow, but steady error budget burn within a 6-hour alerting window.

Image 6: An alert triggered due to slow, but steady error budget burn within a 6-hour alerting window.

Our suspicions were confirmed when another alert was triggered, this time from a 1-hour window. It indicated that the budget was burning at a concerning rate due to the increasing frequency of slow requests. The SRE team received a PagerDuty alert. We declared an incident and prioritized its resolution over other tasks.

Image 7: An alert triggered by a concerning budget burn within a 1-hour alerting window.

Image 7: An alert triggered by a concerning budget burn within a 1-hour alerting window.

We also added another policy to this SLO, monitoring short 15-minute windows to alert the on-call engineer in case of serious incidents. The growing number of unsuccessful API requests also triggered this scenario. An alert paged the on-call engineer. By this time, the team had accumulated enough information to address the problem quickly. The root cause was identified, and the fix was ready, so it was just a matter of implementing a hotfix and closing the incident as the budget slowly started to recover.

Image 8: An alert that triggered a PagerDuty notification due to an alarming error budget burn within a short 15-minute alerting window.

Image 8: An alert that triggered a PagerDuty notification due to an alarming error budget burn within a short 15-minute alerting window.

Image 9: An SLO impacted by the incident, showing a gradual recovery of the error budget after the hot fix was applied.

Image 9: An SLO impacted by the incident, showing a gradual recovery of the error budget after the hot fix was applied.

As a side note, we encountered quite an unexpected issue while fine-tuning our alerting. While reviewing our SLOs, not everyone felt comfortable talking about alerting regarding the burn rate. Some team members needed help to translate this value into alert policies. With the new Nobl9 alerting capabilities, we can express the same conditions more intuitively. For instance, instead of dealing with an abstract value like the burn rate, we can specify a concerning scale of budget drop.

The "budget drop" alert condition gave us the same results regarding alert policy outcomes but significantly improved our conversations. Instead of calculating how quickly we’ll burn the budget based on the burn rate, we could simply say, "I want to be alerted when my budget drops by 10%."

Image 10: Alert details showing an alert triggered by an alert policy based on the average error budget burn rate.

Image 10: Alert details showing an alert triggered by an alert policy based on the average error budget burn rate.

Image 11: Alert details showing an alert triggered by an alert policy based on the scale of the error budget drop.

Image 11: Alert details showing an alert triggered by an alert policy based on the scale of the error budget drop.

Using our Budget Drop to Average Burn Rate Converter, you can experiment with the error budget values to find the ones that best match your business context.

Conclusion

Effectively managing SLOs with sparse data requires a more nuanced approach to alerting. Sparse data can lead to challenges in distinguishing between normal variability and real issues, making it difficult to maintain reliable performance monitoring. However, by using flexible alerting strategies and tools like Nobl9, teams can ensure that their alerts are timely, accurate, and actionable, even when data points are scarce.

Here are key takeaways for effective SLO management with sparse metrics:

Tailor alerting strategies: Use different time windows to adapt to the available data. Shorter windows can quickly detect issues during high traffic, while longer windows are beneficial during low traffic to accumulate sufficient data for meaningful analysis.
Leverage advanced tools: Tools like Nobl9 offer the flexibility to create SLO-based alerts with different time windows and use metrics like average burn rate and budget drop conditions. This adaptability ensures that alerts are meaningful and aligned with your business objectives.
Iterative fine-tuning: Fine-tuning the alerting system often takes time before arriving at a simple and effective solution. Continuously monitor and adjust your alert policies to improve their accuracy and reliability, especially in difficult cases like sparse data.
Finding common language: Among other purposes, SLOs are meant to build a shared understanding of organizational reliability among your team members. It’s crucial to ensure that your team feels comfortable discussing these topics. If something can be changed to make these conversations easier, it’s always worth doing.

Adapting to your data's unique characteristics will lead to more effective SLO management and, ultimately, better service reliability.