More by Kit Merker:
Delivering the Right Data for Better SLOs with Nobl9 & New Relic Nobl9 and Datadog: Better Data Makes Better SLOs Driving SLO Adoption through CICD Measuring Technology ROI: SLOs for CFOs Tame the YAML in 2021 Nobl9 Demo: Kubernetes Cluster Failover Scenario Nobl9 Demo: GitOps Ready sloctl and SLO YAML Nobl9 Has Joined The Cloud Native Computing Foundation Kubernetes Knative Serverless Latency Metrics: Interview with Matt Moore Reliability Evolution from Datacenter to Cloud: Interview with Less Lincoln, SRE at Microsoft An Easy Way to Explain SLOs and SLAs to Business Executives What is an SLO? Explained in 90 Seconds How to Convince Your Boss to Adopt Service Level Objectives Naming Conventions for Labeling SLOs Nobl9 Demo: Setting up a Prometheus SLO with the Web UI| Author: Kit Merker
Avg. reading time: 4 minutes
“How much unreliability can we get away with?“
Seems like a strange question for a business leader to ask, doesn’t it? And yet, every single part of a business relies on this question. Providing the level of service our customers demand while keeping costs under control and risks managed is the ultimate goal of a profitable business. Asking about unreliability is a huge part of this equation.
Still, the temptation to think of this as a “lazy” question is real. We’re taught from an early age to strive for perfection. We want it all: quality, efficiency, performance, speed. Despite the allure, “having it all” is a mirage—perhaps even a lie.
Excellence means doing better than expected. It means doing so consistently over many repetitions.
Even if you have a small team – maybe one person delivering a service to a few customers – striving for perfection is overkill. If you have very few “moving parts,” mistakes happen: you’ll be late to a meeting, forget a customer’s name, or drop the ball in some other way. Of course, you can’t do this all the time or you’ll get a reputation for being too unreliable. But many of those everyday mistakes will be acceptable to your customers. And in most cases, your response after a screw-up matters more than your mistake.
Even with unlimited resources, we can’t attain perfection. How are we expected to reach it on a budget? If we set the goal to put a human on Mars and we’re given nation-state levels of resources to put towards safety, redundancy, testing, contingency, and overall reliability what happens then? While many missions will succeed, some will, sadly, end in failure.
Perfection is not only elusive: it’s unattainable, impractical, and too lofty an ideal to discuss in an abstract sense. If NASA can’t do it, why would you expect your company to be able to?
In Defense of the Reasonable Goal
So if striving for perfection is quixotic, what then? Should we avoid lofty goals? That sounds equally ludicrous!
The ideal situation is to have a reasonable goal – one that’s slightly out of reach. Let’s call it “excellence” (a term that is a bit easier to define than perfection but not nearly as difficult to achieve).
Excellence means doing better than expected. It means doing so consistently over many repetitions. It also means improving and handling unexpected situations with ease. Excellence is achieving results in a way that goes far beyond chance or casual management.
One example of excellence is the Michelin star reviews we often associate with high-end restaurants like The French Laundry. The rating system starts with quality and ends with consistency of experience from one visit to the next. As with Michelin stars, we should strive to consistently deliver the “core product”.
It’s hard, sure. But doing this “at scale” is even harder. Today, this is all part of our software systems that support our business. And providing excellent service in a way that creates a competitive moat (meaning it is impossible to replicate by competitors or newcomers) is the bar to get over here. While excellence—consistently creating a premium experience—is an incredibly high bar and one that may be expensive to deliver, it is an infinitely lower bar than perfection.
Designing to Approach the Edge of Excellence
Business has gone digital and now we need to address excellence in our customer facing software systems. Achieving excellence cannot be done by simple choice or determination. Excellence must be designed. There are six critical design principles to abide by to live on the edge of excellence:
- Set reasonable goals based on customer expectations. We begin by identifying the interactions or services we’re providing to customers and defining clear good/bad, pass/fail for each. In other words, how large can the “excellence-to-perfection gap” be? Informed by data, experience, and a degree of “gut instinct,” we set performance goals (aka Service Level Objectives, or SLOs) as an acceptable percentage of good vs. total for each service. I was recently talking with Niall Murphy, an SRE lead in Microsoft Azure. He told me something instructive: “You want your customers to have an experience that you as the business deliberately decided to give them, and SLOs let you be much more deliberate about how you treat customers, not leaving anything to accident.”
- Track metrics vs reasonable goals over time. We measure our ability to deliver to standards consistently, and we express this as a percentage of total interactions. Doing something well once might give you a good feeling—we were 100% successful!—but it tells you nothing about consistency. Consistency can only be proven over many repetition (remember, that’s how you get a Michelin star).
- Systematically handle the unexpected. When things go wrong, We want to proactively prevent and rapidly respond. Again, the operative question is, “How much unreliability can we get away with?” We want to live at the edge of “a mechanism for overcoming mistakes or situations outside our control (failures) that put our service at risk of dipping below the excellence line” and “we want to design rapid response systems that anticipate falling below the excellence line and correct the situation before there is meaningful deterioration to the service.” If we can isolate our possible failures to avoid ripple effects (cascading failures), we can prevent small local failures from becoming broader impacts overall.
- Embrace mistakes. I would argue that the most important aspect of excellence is mistakes—creating opportunities to have them, to learn from them, to improve, and do this in a meaningful, organized, and persistent way, ideally in private or low-stakes environments. Embracing mistakes is a crucial step to building a company of excellence.
- Seek broad-based, incremental improvements. I think it was in the movie “Jiro Dreams of Sushi,” where they said excellence comes not from doing one thing 100% better, but from doing 100 things 1% better. A complex system with many moving parts will benefit more from incremental, near-diminishing-return improvement across every aspect than it will from massive gains in one or two areas.
- Constantly tune your goals. The expectations of your customers are a moving target based on what you’ve done before and also the market landscape around you. As the general public gets more used to the industry norm of a product or service’s quality, they will judge you more harshly for failing to meet that bar. On the other hand, if you’re outperforming a competitor, your customers might not be totally happy with your service. However, waiting to improve it might not be justified until the competition catches up.You may be delivering something beyond excellence and therefore cutting into your margins, which is not good for business.
Much like striving for perfection, you can’t just exist on the edge of excellence without making constant adjustments. Instead, you will cycle back and forth between expensive over-delivery and slightly under-delivering. So set your sights on maintaining a slightly-better-than-excellent service and design your system to sit in that sweet spot just out of sight.
Seek Excellence, Not Perfection
As you can see, perfection is almost impossible to discuss, much less design or implement. As long as you believe your company – even theoretically – can deliver a perfect service, the longer you’ll be in denial. Instead, embrace failure as a reality of the universe you cannot escape and build a system to handle the risks of failure. Design your business to live on the edge of excellence, consistently servicing beyond expectations, even by a little.
If you want to learn more about how SLOs and Error Budgets can help your organization achieve consistent levels of excellence at a reasonable cost, let’s talk.
Follow @nobl9 on twitter!
Image Credit: Louis Hansel on Unsplash
Do you want to add something? Leave a comment