More by Kit Merker:
SLOconf Speaker Profile: Alina Anderson SRE 101: The SRE Toolset Optimizing Cloud Costs through Service Level Objectives You’re Not Google. And, Yes, You Still Need SLOs SLO Many Talks About Reliability at KubeCon: Here Are Our Picks Going to KubeCon in Search of Reliability Talks? The Ultimate Guide. Announcing SLOconf 2023: A Global Event With a Local Feel Differentiating Services with SLOs to Grow Revenue What is Five 9s Availability? Do you really need 99.999% Server Uptime? Measuring and Optimizing CPU Performance SLOconf Speaker Profile: Steve McGhee Creeping Latency Metrics: Review of a Subtle Kubernetes Serverless Scalability Bug An Easy Way to Explain SLOs and SLAs to Business Executives Want a Reputation for Reliability? Keep it Simple. Interview with Matt Klein SREs: Stop Asking Your Product Managers for SLOs| Author: Kit Merker
Avg. reading time: 1 minute
Congratulations! You are ready to sit down with your team and establish your first Service Level Objective (SLO).
You might be wondering where to start. Here’s an outline of how you could approach your first SLO-setting discussion with your developer and operations teams:
- Share a user story. Suppose you have an e-commerce user story that says the user expects to be able to add things to their cart and immediately check out. Your user has a certain latency threshold for checkout, and when checkout takes longer than that, your user gets upset and abandons their cart.
- Phrase this customer experience issue more precisely as an SLO. What proportion of users should be able to add items to their cart and check out within X amount of time?
- Identify and quantify the risks. What happens if a customer isn’t able to check out within that time frame? What does it cost when the SLO is missed?
- Brainstorm the risk categories together. What are the things that can go wrong that would cause us not to be able to meet the SLO? Your team will respond with a wide variety of risks, likely including “our underlying infrastructure might go down,” “maybe we pushed a buggy update,” “we didn’t anticipate so much demand all at once,” and more.
- Ask “how could we mitigate these risks?” When considering the resources/costs required to mitigate the risk versus the cost of failure, what do you leave to chance and what do you take a proactive approach to? Use this information to determine the service level indicators (SLIs) you will use to measure and track your ability to meet the SLO.
As you might imagine, this can be a fairly involved discussion, and all the stakeholders need to contribute their perspective in order to have buy-in in the end. It may take a while to find agreement, but when you do, you will find that among the developers and operators there is much more genuine understanding of (a) what the customer wants, (b) what new product features will truly cost, including operational support and capacity, and (c) how to prioritize and make customer-centered decisions when tradeoffs are necessary.
We’d love to hear how your first SLO-setting discussion goes. Let us know on twitter @nobl9inc.
Image credit: Chris Lawton on Unsplash
Do you want to add something? Leave a comment