What tools do you need to get started with SRE?

Much has been written, especially by the founders of the Site Reliability Engineering (SRE) concept at Google, about the benefits of all parties in an organization working together to balance features and reliability. We recognize that there are two opposing forces that each make perfectly good sense: on the one hand, your company (especially your application developers and software engineers) wants to maximize the speed and agility of deploying new applications and product features. On the other hand, your company (especially your IT operators) also wants to maximize the reliability and availability of its infrastructure and minimize downtime that could adversely affect customer satisfaction.

Unfortunately, in far too many organizations, these two equally valid objectives are being pursued by siloed departments at cross purposes—and that operating model produces substandard results, frustrated employees, exasperated senior execs, and, worst of all, unhappy customers.

Service Level Objectives (SLOs) are the key to the ultimate value proposition for any business: products and services that keep customers happy.

It doesn’t have to be that way. SRE builds a bridge between product managers/application developers and IT operations. SRE is a practical, data-centered way to find common ground between Dev and Ops, so that both functions can align on objectives that achieve optimal customer satisfaction. 

To learn more about how SRE came to be, I’ll refer you to the source, but here, I’d like to get down to brass tacks: identifying the three most essential tools in the SRE toolset—observability/monitoring, incident response, and service level objectives (SLOs).

1. Observability/Monitoring

Even if your role is not IT/ops, you probably understand the concept of monitoring a system to gain insight. Monitoring is about collecting metrics from a production system to gain an understanding of what’s really going on. In cloud operations, monitoring is particularly helpful for debugging and triggering alerts when something needs attention. The challenge of monitoring is separating the true signal (the few, critical things that actually need attention) from the noise (the many false signals that are at best a distraction and at worst another reason to hate pager duty). Of course, all this becomes even more difficult as your system scales. 

The concept of observability is similar to but slightly different from monitoring. Observability is a measure of how well we can understand the internal state of a system by solely looking at its outputs. In other words, it is how well we can deduce internal causes by observing external symptoms. The more observable our infrastructure is, the more success we will have in diagnosing and curing problems that arise. One way to improve the observability of a system by providing more context in the log outputs, giving our monitoring systems greater insight into what’s really going on.

2. Incident Response

One of the core tenets of SRE is making systems that are automatic, not just automated. That is, we aim for systems that run and repair themselves, and we want to intentionally minimize human involvement in the system so that operations can scale. Therefore, it may seem odd that incident response (by humans) is a primary tool in the SRE toolkit. 

Here’s why: Failures are inevitable*. You can automate responses to anticipated failures, but there will always be new causes for a new type of failure you didn’t anticipate—a new software release, a spike in demand, an outage in a third-party system. When you’re dealing with infrastructure, Murphy’s Law applies: Anything that can go wrong will. And Finagle’s Corollary applies as well: Anything that can go wrong will, and at the worst possible moment. The point of having a good incident response tool in your toolkit is to minimize incidents, prevent staff burnout, respond to incidents as efficiently and effectively as possible, and conduct post-mortem analysis to enhance staff training and fuel continuous improvement efforts. 

3. Service Level Objectives (SLOs)

You’ve probably been tracking right along with me so far in this discussion (“Monitoring, yep. Incident response, got it.”). So, don’t let me lose you here, because this is by far the most important tool of the three. 

In fact, SLOs are Job 1 for SREs. 

Let me ask you this: does your IT operations team have an availability goal? Is it realistic? Is it too high or too low? Is it intrinsically aligned with the desires of customers? Are you confident that your IT infrastructure can scale to meet the needs of your customers and the needs of your business? Most importantly, how did you set that availability SLO? What process did you use to get there? 

These are all great questions. And, if you’re like most people in your shoes, you’ve not thought much about them. Or, if you have, it’s highly unlikely you’ve incorporated a formal process into cultivating answers to these questions that feed into your process.

Keeping infrastructure available (reliable) is an essential task—without it, no one can really trust the infrastructure in the first place! Yet, in many organizations today, we can’t seem to get ahead of inevitable reliability issues, and the only way we are learning to improve the system is when it breaks.

Well-constructed SLOs are the tools you need to solve this problem. SLOs are strong, understandable promises made to customers that their applications will run reliably and with adequate performance on a cloud. What makes SLOs uniquely valuable and powerful is that they are created by a collaborative team of application developers, product managers, IT operators, and other stakeholders in a business by strategically considering priorities and tradeoffs.

What do I mean by priorities and tradeoffs? Here is a non-IT example that illustrates the concept: Suppose you need to rent a room for a party. How large does the room need to be, and how much of your party budget can you allocate to room rental? If you opt for a smaller, cheaper room, what are the downsides if it turns out to be too small to hold all of your guests?

In this (admittedly abstract) example, we must make a judgment about the tradeoff in capacity versus cost. In the world of cloud computing and software, we have capacity versus cost tradeoffs too. We also have tradeoffs between speed of releasing new features versus avoiding errors/failures in the system. We are constantly making judgment decisions about tradeoffs, and too often we’re making those decisions based on gut instinct, without the input of all the company stakeholders, and without understanding how it will impact the customer. SLOs, in stark contrast, help us make tradeoff decisions based on data, with the input of all the company stakeholders and centered on what it takes to delight the customer.

The SLO tool is the key to the ultimate value proposition for any business: products and services that keep customers happy.  

In fact, SLOs are so important that we’ve built Nobl9 on that premise. Our business is about helping our customers precisely quantify the experience of the user, then statistically and rationally translate that knowledge into wise tradeoffs and informed resource allocation decisions. Or, in more practical terms, we’re about:

  • saving IT operators time and hassle 
  • helping developers accelerate time to market 
  • reducing spend on unimportant infrastructure initiatives
  • redirecting IT investment to initiatives with the highest ROI
  • …all while achieving a level of performance that delights the customer

If that sounds like something that solves a problem in your organization, there are more useful resources here in the Nobl9 blog. Product managers might be interested in this post, and anyone helping teach the C-suite about SRE might want to read this one.

* Read more about how Netflix adopted a new mindset of software failure as the rule, not the exception, and survived the big 2015 AWS outage.

Image source: Matthew Henry on Unsplash

Related Blogs