More by Jeremy Cooper:
Chaos Engineering and SLOs: Taking Reliability to the Next Level| Author: Jeremy Cooper
Avg. reading time: 3 minutes
I’m sure you’ve heard it before from management, “how was our uptime on Black Friday?” Depending on your business, you might get asked about a specific day, week, month, quarter, or year – or all of the above. How do you balance actionable operations metrics and providing management answers to their business questions?
Most teams set up Service Level Objectives (SLOs) based on a rolling window because it's more actionable, especially when trying to get ahead of incidents. Many engineering teams work in sprints, and rolling time windows match that model well. To answer management questions, you might attempt to extrapolate and re-purpose rolling time window SLOs, which can be tricky and leads to over-or-under reporting of uptime for your services. If you’re using rolling time windows and trying to extrapolate calendar data, there’s a better way!
Calendar Aligned SLOs
Defining SLOs to align directly to a calendar makes it much easier to provide reporting for uptime. As the name implies, calendar-aligned SLOs are tied to a specific window of time on the calendar with a clear start and stop date. Calendar-aligned SLOs reset at the end of each window and never regain the error budget in a time window. Rolling time window SLOs, on the other hand, never reset but “earn” the error budget back as old “bad” occurrences drop off the back end of the time window. You can easily repurpose existing SLIs from existing rolling-time-window SLOs.
Screenshot of Nobl9 with both calendar-aligned (top) and rolling-time-window (bottom) SLOs using the same Service Level Indicator (SLI) query.
We’re not advocating you throw out your rolling-time-window SLOs. These are incredibly useful, especially with error budget alerts that bring your attention to at-risk SLOs at the right level of urgency.
In Nobl9, you can easily configure calendar-aligned SLOs in the Nobl9 web console (see image). You can export the SLO definitions as YAML using sloctl. After that, you can add the SLOs-as-code to a source control repository like GitHub or Gitlab. To see detailed instructions on setting up calendar-aligned SLOs, see our documentation.
Operational Alerts vs. Management Reports
Even looking at the same data, different audiences have slightly different needs in how actual results compare to goals. Platform teams, reliability engineers, application developers, and operations folks need to know what’s happening now and the trend, ignoring arbitrary month-end boundaries. Using 28-day rolling windows for SLOs is a convenient and consistent way to ensure that you ignore minor blips while significant trends can be dealt with proactively.
On the other hand, management and business-focused users like customer support, procurement, and executive teams need to see how well the services (including 3rd party vendors) are operating compared to a calendar. Your organization can use this for planning, compliance reporting, and holding vendors accountable to SLAs.
Case Study: Are We Achieving our Uptime Goals?
A Nobl9 customer asked us to help them create a management report using their existing SLOs. Since they run a large e-commerce platform, they had primarily used SLOs to ensure critical user journeys like catalog display and checkout were working correctly for customers and had set up rolling time-window SLOs.
They tried to use their existing SLO data to answer questions from management about overall uptime. Still, they were stuck merely estimating because the rolling windows had to convert the data to calendar dates after the fact.
To simplify their setup, they set up calendar-aligned SLOs using the same SLI queries, which made the configuration trivial and the reporting very clear. Because the customer already had error budget alerts (via Slack and Pagerduty, depending on severity) on their rolling-time-window SLOs, the new calendar-aligned SLOs required no error budget alerting policies for the calendar-aligned SLOs.
Instantly, they could provide up-to-the-minute uptime reports without data manipulation or extra reporting steps. The combination of SLOs gave all audiences what they needed: the platform, reliability engineering, operations, and application development teams got proactive alerts when their services started seeing elevated errors. And the management and customer support teams could see precisely how well the service worked in any given month, quarter, and year.
Without SLOs, the organization had trouble understanding if they were meeting SLAs and were constantly reacting to incidents and outages. Now, they have a much clearer picture of what was happening. They could adjust resources to ensure data-driven decisions around contractual SLAs, user experience, and the velocity of features.
Get it on the Calendar
If you’re already on Nobl9, try creating a new calendar-aligned SLO with an existing SLI query. You can quickly see how the rolling and calendar-aligned SLOs compare.
You don’t have to add an alert policy to the calendar-aligned SLO; use this as informational for reporting. These SLOs appear in the Nobl9 Grid View, Service Health Dashboard, and Historical Reports. You can also export any SLO (calendar-aligned or rolling time window) for deeper analysis.
You can set up SLOs using existing monitoring and observability data or follow this step-by-step, hands-on lab using free synthetic monitoring.
Do you want to add something? Leave a comment