Chaos Engineering and SLOs: Taking Reliability to the Next Level

Mar 16, 2021 | Author: Jeremy Cooper

Avg. reading time: 3 minutes

Everyone in SRE (Site Reliability Engineering) is talking about chaos engineering. The term may bring to mind an evil genius mastermind, but in actuality, chaos engineering is a weapon wielded by reliability heroes.

Andre Newman of Gremlin put it nicely in his article for DevOps.com: “The goal of chaos engineering isn’t to add chaos, but to mitigate chaos.”

SLOs give you a much better analysis of whether your chaos engineering efforts are identifying critical issues that actually affect the user experience.

Simply put, chaos engineering is the practice of intentionally “breaking things” in order to proactively find weaknesses in your applications and in your response models. The goal is to expose and fix weaknesses before they can cause real damage. Chaos is an evolution of software testing, applied to production.

Examples might be adding latency, shutting down some Kubernetes pods or ports, deleting certificates, dropping database connections, and so forth. The idea is to thoughtfully schedule service disruptions to gauge how your system holds up, how customer experiences might be impacted, and to determine if your alert and response protocols are working sufficiently.

It is important to note that, despite what the name suggests, chaos engineering events are planned, and a certain level of system maturity and stability is warranted before SRE teams start injecting chaos. If you are constantly in chaos already, don’t add more, or, to put it another way, if your warehouse is on fire, it’s not the time to set a new blaze.

Also, chaos engineering efforts will be most fruitful and productive if you have Service Level Objectives (SLOs) in place. Here’s why:

SLOs Measure You How Your System Is Performing Without Chaos

To understand how resistant your system is to unexpected failure, you should have a goal for your service levels and how the system is performing before you initiate chaos. If you don’t have SLOs in place, how will you know if the chaos engineering exercise is telling you anything about your service resiliency? If you’re basing your chaos engineering results on monitoring alerts alone, you’re not getting an accurate view of how users are impacted.

Without SLOs, when something breaks, you can only say “this is broken” and determine how long it takes to fix. In a cloud-native world, we expect the overall system to be resilient to small failures. The important thing is knowing if this particular thing breaking can be overcome by the inherent resiliency. And is it something that needs to be fixed immediately? Is it something that can be fixed a little bit later on? Your monitoring and alerts may tell you a service is in a very deprecated or broken state, but SLOs tell you if the actual user experience is significantly impacted. If it’s not, the issue may not be as critical as you think it is. Instead of sounding the alarm for every little thing, wait for a real issue and send up the bat signal.

SLOs give you a much better analysis of whether your chaos engineering efforts are identifying critical issues that actually affect the user experience.

SLOs Help You Decide When to Insert Chaos

Another important role SLOs play in chaos engineering is helping you choose the best time and place to conduct chaos experiments. You want to be careful and thoughtful about what kind of chaos and how much chaos you apply to the system. Using error budgets, you can tell if you’re in a good place to be able to implement chaos in your production environment. If you’re in a stable state and things are under control, that’s when you have the bandwidth to implement chaos engineering. You also have better insight into where to direct your investigations. As my colleague, Nobl9 SRE Alex Hidalgo says, “Using error budgets is a perfect way to determine when and where you should apply chaos.”

Chaos engineering can boost your SLO efforts and vice versa. Invoked chaos offers the perfect opportunity to test out your SLOs. It can help you validate that you are looking at the right Service Level Indicators (SLIs). You may realize you missed the perfect indicator. Chaos engineering is also a great chance to test out the workflows that are in place for responding to critical alerts from both SLOs and your various monitoring tools.

SLOs and chaos engineering are both valuable SRE tools, and like the “Dynamic Duo,” they are most beneficial when working together. The Nobl9 SLO Platform is perfectly suited to help you unleash these superhero powers to improve reliability and the user experience.

Interested in learning more about SLOs? Contact us and remember to follow Nobl9 on twitter!

Image Credit: Ali Kobab on Unsplash