Managing Reliability of a Complex Data Processing Pipeline with Composite Service Level Objectives

Dec 11, 2024 | Author: Jakub Gruszecki

Avg. reading time: 4 minutes

Composite SLOs is a new tool available in the Nobl9 platform that allows you to define SLOs in terms of other SLOs. Composite SLOs are a flexible and versatile way to express your reliability expectations for complex scenarios where multiple things contribute to the reliability of an entire system, user journey, or business process, and no single SLI can easily be used to measure it.

For example, here at Nobl9, we are using Composite SLOs to manage the reliability of our SLI processing pipeline. The SLI processing pipeline is the data streaming component of the Nobl9 platform responsible for ingesting all SLI metrics from agents and direct integrations and turning them into error budget charts, sending alert notifications, detecting data anomalies, etc. Logically, the SLI processing pipeline can be represented like this:

nobl9-sli-pipeline-data-processing-chart

First, “Data intake” ingests SLI metrics sent by agents or queried directly from data sources. Here the pipeline splits into two parallel branches. The first branch starts with “Anomaly detection” which can detect issues with incoming “SLI metrics”. If an anomaly is detected, an event is sent, and “Anomaly notifications” sends a notification to users who have it configured.

The second branch starts with “Error budget calculation” where error budgets, burn rates, and other auxiliary metrics are calculated based upon incoming SLIs and parameters configured within an SLO. Once error budgets are calculated, they are downsampled, enriched with various metadata in “Post-processing,” and saved to a time series database from which they are served to the UI and available via the status API. In parallel to that, error budgets are ingested by “Alerts detection” which evaluates all configured Alert Policy conditions. When all conditions are met, an “alert event” is raised which triggers three other things in parallel: sending notifications to all configured alert methods, adding an annotation about alerts on charts, and persisting an alert so that it can be retrieved using the UI or sloctl.

An important question to ask from the reliability standpoint is always: What is the current user experience? This can be considered more precisely by asking about specific user-facing features that correspond to all ends of the pipeline:

Are alert or anomaly notifications delayed?
Are the data on charts up-to-date?
Are alerts annotated on charts without delay?
Are alerts displayed on the alert list without delay?

The first challenge in managing the reliability of these journeys is the fact that different parts of the SLI processing pipeline were created and are maintained by different teams:

The integrations team, responsible for “Data intake”;
The processing team, responsible for “Error budget calculations”;
The alerting team, responsible for all notification-related things for both alert policies and anomalies; and finally:
A separate team is responsible for presenting data on charts and taking care of all “Post-processing”.

The second challenge is the fact that people in different roles – or even the same person working in different contexts – want to know different things:

Development and operations teams are interested in how a particular service is performing, especially the ones that are maintained by that particular team.
Team managers often ask questions on what the real impact is for end users when making decisions regarding priorities of maintenance tasks, deciding how much time to devote to delivering new features or when to invest more time into bug fixes and optimizations, or deciding on staffing. (We often “borrow” developers between teams to ease our current workloads and to keep knowledge fresh for ourselves).
On-call staff during an incident need to quickly identify which services are contributing to the outage.

Composite SLOs help us answer all of these questions and become a common ground when different roles and different teams are discussing reliability. In the case of data freshness and delay at different stages of the SLI processing pipeline, we started with configuring separate throughput SLOs for each service. Then we grouped these SLOs into logical groups approximately the way things are presented in the first diagram to hide some complexity and create a separate composite SLO for each group. For example, alert notifications are not delivered by a single service, but each alerting method has its own service. As a last step, we created separate composite SLOs for each E2E path that goes through the SLI processing pipeline, composed of the lower-level SLOs mentioned before. These E2E paths correspond with specific user-facing features.

Here is a diagram showing what the Composite structure looks like:

composite-service-level-objective-structure

Composite SLOs are themselves composable, which allows for the creation of arbitrarily complex structures. These structures can represent any concept around which we want to define our reliability requirements:

System architecture and dependencies between services
Logical steps in a multi-step user journey
Organizational structure with different teams, departments, and areas of responsibility

In the above case, we are building meaningful composite SLOs for user-facing features on top of SLOs reflecting our system architecture and organizational structure.

There are parts of the pipeline on which all of the features depend, and there are parts on which only some – or just one feature – depends. A single SLO can be a component of multiple Composite SLOs. We are re-using basic throughput SLOs configured for lower-level components in every E2E composite SLO that depends on them. Each development team has only one set of SLOs to focus on and they can tell what impact their services have on each user-facing feature. This is especially useful when we see a degraded performance of a single service and we need to decide whether to declare an incident and inform our users about it via our status page.

Layers of composed SLOs help us answer questions on different levels:

How is the specific service doing?
How is the reliability of a part of the system that is maintained by team X performing?
What is the actual user experience of feature Y?

When the reliability measurement of an entire composite goes down, an on-call engineer can use a “Component impact” flame graph to tell which underlying services are mostly responsible. The same information can be quickly found on the “Structure” tab in the “Highest impact” table or by sorting the “Composite structure” table by impact in descending order. Component impact is a value calculated for each descendant of a composite SLO which indicates what portion of a burned budget of a composite can be attributed to a given component.

Component impacts are used during our internal SLO review meetings and help us decide on the urgency of further investigation of a given SLO's state. Even when the budget of a given SLO is at risk, but its impact on higher-level composites is relatively small, we might decide to not investigate things at the moment. The ability to decide on what not to do is crucial for our small development teams. It helps us divert our efforts to places where they’ll have the most impact, and allows us to make data-driven decisions on how to minimize the amount of maintenance work while keeping reliability at an acceptable level.

In summary: Composite SLOs provide a powerful framework for managing reliability across complex systems, and enabling teams to assess and optimize user experience by aggregating and analyzing service-level data. Key benefits include:

End-to-End Reliability Tracking: Monitor the performance of entire user-facing features by combining multiple lower-level SLOs into composite views.
Cross-Team Coordination: Facilitate collaboration by aligning diverse teams around shared reliability goals and insights.
Impact Prioritization: Identify services with the greatest impact on overall reliability, helping focus efforts where they matter most.
Simplified Maintenance: Reduce maintenance overhead by reusing and layering SLOs for consistency and scalability.
Data-Driven Decision Making: Support informed decisions on incident management and resource allocation, balancing reliability with development priorities.

Hopefully you find composite SLOs as useful as we have. Leave a comment below or on our LinkedIn page with any questions or your own interesting use cases!