More by Kit Merker:
Nobl9 Demo: Kubernetes Cluster Failover Scenario Nobl9 Demo: Setting up a Prometheus SLO with the Web UI Want a Reputation for Reliability? Keep it Simple. Interview with Matt Klein Reliability Evolution from Datacenter to Cloud: Interview with Less Lincoln, SRE at Microsoft What is an SLO? Explained in 90 Seconds Nobl9 & Adobe Systems: Let’s Talk SLOs for OpenStack Tame the YAML in 2021 Nobl9 Has Joined The Cloud Native Computing Foundation Nobl9 Demo: GitOps Ready sloctl and SLO YAML Nobl9 and Datadog: Better Data Makes Better SLOs Kubernetes Knative Serverless Latency Metrics: Interview with Matt Moore Delivering the Right Data for Better SLOs with Nobl9 & New Relic Announcing SLOconf 2023: A Global Event With a Local Feel Why Your Marketing Site Needs Reliability Targets (SLOs) Too Creeping Latency Metrics: Review of a Subtle Kubernetes Serverless Scalability Bug| Author: Kit Merker
Avg. reading time: 1 minute
Congratulations! You are ready to sit down with your team and establish your first Service Level Objective (SLO).
You might be wondering where to start. Here’s an outline of how you could approach your first SLO-setting discussion with your developer and operations teams:
- Share a user story. Suppose you have an e-commerce user story that says the user expects to be able to add things to their cart and immediately check out. Your user has a certain latency threshold for checkout, and when checkout takes longer than that, your user gets upset and abandons their cart.
- Phrase this customer experience issue more precisely as an SLO. What proportion of users should be able to add items to their cart and check out within X amount of time?
- Identify and quantify the risks. What happens if a customer isn’t able to check out within that time frame? What does it cost when the SLO is missed?
- Brainstorm the risk categories together. What are the things that can go wrong that would cause us not to be able to meet the SLO? Your team will respond with a wide variety of risks, likely including “our underlying infrastructure might go down,” “maybe we pushed a buggy update,” “we didn’t anticipate so much demand all at once,” and more.
- Ask “how could we mitigate these risks?” When considering the resources/costs required to mitigate the risk versus the cost of failure, what do you leave to chance and what do you take a proactive approach to? Use this information to determine the service level indicators (SLIs) you will use to measure and track your ability to meet the SLO.
As you might imagine, this can be a fairly involved discussion, and all the stakeholders need to contribute their perspective in order to have buy-in in the end. It may take a while to find agreement, but when you do, you will find that among the developers and operators there is much more genuine understanding of (a) what the customer wants, (b) what new product features will truly cost, including operational support and capacity, and (c) how to prioritize and make customer-centered decisions when tradeoffs are necessary.
We’d love to hear how your first SLO-setting discussion goes. Let us know on twitter @nobl9inc.
Image credit: Chris Lawton on Unsplash
Do you want to add something? Leave a comment