As we discussed in the previous post, Site Reliability Engineering (SRE) is an operating model that helps your organization grow and innovate with velocity while maintaining infrastructure reliability for service levels that keep customers happy. (It’s like DevOps on steroids.) The benefits of SRE are many. The end result is customer satisfaction, and the bottom line is efficient revenue growth. Conceptually, it’s a no-brainer.
But in order to champion SRE, a CEO also needs to understand the decision from a dollars-and-cents viewpoint and be able to defend it with all the stakeholders in the organization.
- How much is adopting SRE going to cost?
- How do you put metrics on the benefits you’ll receive?
- How do you calculate the ROI?
At its very core, SRE is a framework for thinking about ROI and risk, so you might say that the dollars-and-cents analysis you need is built right in, in the form of SLOs and error budgets.
In SRE, service level objectives (SLOs) are defined for many different aspects of IT service to mark the precise level of service that needs to be achieved in order to avoid unacceptable levels of risk of displeasing the customer. An error budget, then, is the inverse of the SLO; it is the error rate we will tolerate for a given set of services, because we expect that error rate will not upset customers enough to warrant prevention. (The beauty of SRE is that these service-level objectives* are not arbitrary—they are tied directly to business outcomes.)
Think of SRE as Smart Resource Engineering: a data-informed approach to delivering what customers want, within the bounds of the imperfections they’re willing to accept.
Let’s use availability as an example. When we talk about how often your infrastructure is available (uptime), we typically speak in terms of “nines.” If your infrastructure is available “four nines” or 99.99% available, it will be unavailable 52.6 minutes a year. However, if your infrastructure achieves “five nines,” then your system is up and working 99.999% of the time—that is, it’s down only 5.26 minutes a year. Take note that once you have multiple overlapping services and redundant regions, you need to calculate uptime differently as the proportion of customers served successfully. Measuring uptime in minutes may be overstating your actual reliability.
Avoiding Gold-Plated Infrastructure
In an ideal world, we’d want our infrastructure to achieve as many nines as possible; however, moving from one class of nines to the next higher class is roughly ten times more expensive (you’ll incur significant people and infrastructure costs to make the leap to the next level). And, when you consider the inherent limitations of physics and the architecture of public networks, approaching five nines of reliability consistently can actually become very nearly impossible.
So how many nines are good enough? At what point on the “nines class scale” do your customers become unhappy with their service, that is, at what point do they notice and complain, or even walk away? In this case, the SLO is the uptime goal, and the error budget is a small acceptable allowance for the system being down.
SLOs and error budgets keep your customers happy while balancing the competing interests of product stakeholders who want to rapidly launch new features/products and IT operators who want to maximize infrastructure uptime. Here are two examples:
- In the case of a new deployment, both product teams and operators have to ask, if this deployment causes an outage, will we be within our error budget? If yes, then developers have the leverage to deploy; if no, IT operators have the leverage to say no. SRE gives you a green light to build features when reliability is under control.
- What if IT operators want to invest in upgrades for the sake of improved reliability? The question must be asked, “Is it worth eroding our margins for the sake of better reliability?” Or, put another way, “Am I throwing money at an issue that isn’t a real problem?” (Am I paying for gold-plated infrastructure?)
You might be concerned about putting in an error budget system that you can’t overrule. Think about error budget as a fiat currency, and you (and upper management) are the central bank. You can always “print” more error budget, but do it too much and you will devalue the currency!
So, in terms of dollars and cents, here’s your cost/benefit equation:
- How much is it worth to you to retain a customer? Conversely, how much does it cost when you lose a customer? What value do you place on reducing customer churn? That’s one side of the ROI equation you must consider when adopting SRE.
- The other side of the equation is the cost to adopt the SRE approach, and, to be fair, there will be operational costs incurred, primarily in terms of staffing and training. A good training option is the SLO Boot Camp.
Like everything else your exec team evaluates for organization-wide implementation, approach SRE from a cost/benefit perspective, and consider your risk profile. Properly implemented and operated, SRE frees your application teams to focus on delivering accelerated value to your customers. And, they can do this with new features and capabilities, within the risk-adjusted guardrails of SLOs that define what customers are willing to tolerate before bolting. At the same time, it gives your IT operations teams the freedom to make decisions about infrastructure management, unencumbered by the unachievable “never suffer an outage” standard that accomplishes nothing more than lining the pockets of your service providers and frustrating the best talents of your product managers and application developers.
In short, think of SRE as Smart Resource Engineering: a data-informed approach to delivering what customers want, within the bounds of the imperfections they’re willing to accept. It’s an approach that makes dollars…and sense!*SLOs are so important that we’ve built Nobl9 on that premise. Our business is about helping our customers to precisely quantify the experience of the user, then statistically and rationally translate that knowledge into wise tradeoffs and informed resource allocation decisions. If that sounds like something that solves a problem in your organization, there are more useful resources here in the Nobl9 blog.