Steve McGhee, Reliability Advocacy Engineer for Google, spoke to us about SLOs and our upcoming SLOconf – the first Service Level Objective Conference for Site Reliability Engineers May 17-20. Steve will be there, speaking and attending.
The following interview has been lightly edited and condensed.
Tell us about your talk at SLOConf?
The basic concept of SLOs is something most people can understand but the idea of where to apply SLOs is more difficult. One SLO for each product or one for each company? It’s about getting to the idea of how many SLOs are helpful and where they’re helpful and once you get there, how do you interpret them? I also wrote two articles about SLOs: “Defining SLOs” and “Adopting SLOs” to help people get started with SLOs, which pulls from many sources, including the SRE Book.
So the gist of my talk is about “SLO Math”. I like to think of this math as really, really basic electronics. When you have serial and parallel circuits you can make complicated circuits out of those two basic building blocks. Systems can be built in similar ways and what you’re doing is measuring the goodness of each component of those systems. That’s really the entire talk. It’s simply, how do you measure across serial systems versus parallel ones, and what’s the math we can use to have it make sense at a higher level. Also, how do we do something more complicated? Turns out, it’s much easier once you have these basic formulas in place. I even made some diagrams because saying math out loud doesn’t always make sense but showing a picture really lands for most people.
Why do you think Google SRE is sponsoring SLOConf (which is very exciting btw)?
SLOs are foundational to SRE. SRE came out of Google and it’s being widely adopted. This is just my opinion, but I feel like we as Google SRE have been doing SLOs for a long time and we’ve made some mistakes along the way, so we want to help people not have to make those same mistakes. Learning from failure is the SRE way! SLOs are a powerful concept that can be explained pretty quickly but they’re also super nuanced and very easy to screw up. We feel like it’s important to not give people the abstract concept and then walk away. We want to make sure they can do good work with the internet. If you have a good app or website and it’s working, you should be able to keep that going. And it turns out today that keeping things working is pretty hard and anything we can do to help is for the common good.
Why are SLOs so important?
SLOs are the common currency between operations and product development. In the past, you had to make a trade-off between velocity or stability. You could change a lot and make it unstable or change nothing and keep it stable. But now we have that common currency which is the error budget. You know it’s bad for customers when a product is down for some amount of time and you have defined how much time is okay so you get to decide how to spend that budget. If we (the operations team) run out of this “money”, we know who to talk to about whether we should get a bigger budget, or maybe we should slow down on spending down this budget. SLOs are important because they define that budget.
Basically, you have two historically warring factions and introducing a way they can trade with each other means they can make peace. SLOs are for diplomacy. That’s my short answer.
What about smaller companies – not at Google’s scale? How do SLOs apply to them?
I think size is irrelevant. If you have a single human in the room you probably don’t need SLOs. In my opinion, SLOs are about communication between people. My favorite use for SLOs is for prioritizing engineering decision-making (prioritizing risk mitigation). If you have more than one person working for your company, you have to make trade-offs around what to work on first. If something is the highest risk to our company then fix that one first. By doing this, you’re going to have a higher impact. This is all based on the Pareto Analysis. If you have infinite work in front of you but you sort it based on its impact, you begin by chipping away at the most impactful problems.
Why do you think this (SREs and SLOs) is catching on?
Operating complicated systems has been somewhat solved but within the past 10-15 years what was complicated has become complex. Something which has “unknown unknowns” is complex. Something that has “known unknowns” is complicated. A television is complicated. Something that is adaptive and changing and has many people working on it, and is constantly evolving is complex.
If you’re successfully managing something complex, you can say that all outages are novel. Anything that breaks is the first time it’s broken in that way. Nowhere in our history did we predict that this failure would happen. Being able to work this way is different. This is referred to as the OODA loop – observe, orient, decide and act. And an SRE will constantly be doing those things because there will never be a playbook. All new outages are novel. SRE evolved out of the need for this type of operations and it’s caught on because it solved the next layer of problems. As cloud computing came out, there needed to be new processes. Observability and post mortems are all side effects of this new way of working – people were already doing it a little, but it wasn’t organized. Google wrapped it into a package, and from there, it took off.
How are business and tech folks using SLOs to make better decisions?
Well, I can tell you how I hope they’re using them. I hope they’re using SLOs to make data-driven decisions – mostly around prioritizing resilience engineering work compared to feature engineering work. Making these trade-offs of how much time we should spend on the resilience of the platforms vs. the new features customers are asking for. Without SLOs, error budgets, and risk analysis in place, it’s just a gut feeling and a gut feeling is not data-driven – you can’t count on it.
I also hope companies start using SLOs as their primary source of alerts for operations teams. I’m talking about symptoms-based alerts vs. cause-based alerting. When you go to the doctor you tell them your symptoms – you don’t guess at the causes. The doctor starts with symptoms and diagnoses a cause based on an understanding of medicine and your environment. And “cause” is how a lot of alert systems work today, unfortunately. SLOs however are inherently symptom-based systems. Using them as your alerting mechanism is a great place to start. It will make a more sustainable operations culture and won’t burn people out. Almost every time you have an SLO-based alert, something can be fixed – it’s actionable.
What is the biggest challenge you face when it comes to reliability?
Realizing that reliability is a large landscape prepares you for the journey ahead. Recognize what you’re already doing and deciding where you want to go are the key steps. Unless you’re doing pacemakers as a service, you probably don’t need all the nines. You have a current position and a destination based on your customers’ expectations, but how do you get there? Being able to do all that planning is complicated. Being able to show this complexity and showing people there is a path through all that madness is the SRE goal. The high-level understanding of reliability is straightforward, but the underlying nuances are complex. And those two worlds haven’t met yet. The next ten years is exposing that and making an industry out of it. I think reliability is a lot like how security was, maybe 20 years ago.
How do you define reliability?
I call it r9y (or Ronny). Let’s make that trend (everyone laughs). I’d define Ronny as the ability not to be surprised when things go wrong. Being able to predict the future of your service is what reliability is about.
How do you think the pandemic has affected reliability?
I think reliability is going to play an even larger role in future enterprise software development and covid accelerated it. I don’t think it’s going back to where it was before the pandemic. I think it’s going to stay accelerated – maybe not where it was before and there will be a fall off, but they’re going to be a lot more open to it. And this forcing function has been equal across all generations.