Objective Launch Decision Making with SLOs


If you are a product manager or tech lead, you typically face three key decision points in the process of issuing a new software release: 

  1. What’s in the release? 
  2. Go or No-go?
  3. Who takes responsibility for ongoing operations? 

SLOs can play a helpful role in each of these decisions.

Using SLOs to Decide What’s in the Release

In most cases, SLOs may guide you to simply cut the lowest-priority feature from your feature stack to make room for reducing some technical debt. 

In each release, you can work on features, bug fixes or technical debt, but you can’t do them all at 100% each. Instead, each of these activities receives only a piece of your resource pie. Your challenge is figuring out how to allocate scarce resources in order to have the biggest impact for your customers. 

What’s the best way to plan what the engineers will work on sprint to sprint? Some product managers talk to all the stakeholders and let them vote. Some do squeaky-wheel-style release planning, focusing on the issue favored by the vocal minority. And other PMs like to chase the new shiny features, especially when a big customer has asked for them. Unfortunately, product managers can face all sorts of biased pressures that are not in the best interest of the business.

The danger of succumbing to those biased forces is that they tend to deprioritize—or even ignore completely—the very important but often invisible elements of release planning such as reliability, cost efficiency, cost to serve, security, privacy, compliance, and performance. Raise your hand if you’ve ever skipped these sections in a PRD? These things are not necessarily visible… until they are! And then you have a problem.

There’s a better way: SLOs offer visibility and objectivity into these “silent assassins” of customer satisfaction. SLOs help you prioritize your release mix and avoid potential crises by calling to your attention the very important but often invisible “technical debt” issues that need to be addressed in your release cycle. 

You may not even know exactly what technical debt you have until put your ear to the ground to hear the train coming. As your team starts to listen to their SLOs— setting objectives, measuring, and examining your scorecard—SLOs will alert your team to the exact technical debt issues that need your urgent attention. SLOs make the invisible visible.

Another way of looking at it is that SLOs give the important task of retiring technical debt or improving performance a respected seat and equal voice at the decision-making table right alongside new features. These aren’t necessarily things that are easy to present in a customer roadmap meeting unless you have the data to back it up as well. 

Keep in mind that SLOs should not be used to create hard-and-fast rules. At one of our recent SLO Bootcamps, a company described how they had set a two-thirds to one-third rule of devoting two-thirds of each release to features and one-third to retiring technical debt. As it turns out, this rule was actually causing them to focus too much time on technical debt when they weren’t having reliability issues to warrant it, and this “over-engineering” of reliability was stifling innovation and agility. On the other hand, there’s also a temptation, particularly among startups, to take the opposite approach: “move fast and break things.” Taken to extreme, this disregard for reliability can be just as dangerous to your business, especially if your product is in a market that places a premium on stability. 

The point is, don’t use SLOs to create draconian laws that defeat their purpose; rather, use SLOs as a framework to bring feature velocity and reliability priorities in balance, and help teams make the right decisions.

Using SLOs to Make Go/NoGo Release Decisions

In most cases, SLOs may guide you to simply cut the lowest-priority feature from your feature stack to make room for reducing some technical debt. 

In extreme cases, such as when you experience a serious reliability issue in the previous period, you may take the approach that Google advocates and actually block your team from releasing any new features until the technical debt is resolved. 

Even in this era when continuous delivery is the favored approach to software development, Google SREs have been known to intentionally extend the release staging time for four or five days just to gather sufficient reliability metrics to guarantee a reliable launch in its network, even when concern about the software itself was relatively low. At Google, there’s no doubt that reliability and customer satisfaction are paramount. But not everybody is Google.

It’s important to keep in mind that Google, one of the early innovators in SRE, is at the far high end of the spectrum of SRE best practices. Google knows what’s possible for their organization and, frankly, with over one billion active monthly users, they have a lot to lose if they don’t deliver. (For most of those one billion users, Google IS the internet.) Google has a really, really high internal expectation about what reliability can look like, and that colors their perspective of what a team should do to deliver a service. Google also has a well-honed, universally accepted SRE culture that invests in operational readiness. Google PMs expect to put effort into reliability, even before release planning begins, and Google SREs are perfectly comfortable drawing the line to protect the company against risk. 

If you’re not Google, and you don’t yet have a culture fastidiously devoted to reliability, you need to adjust your approach to what’s right for your organization. What is your highest reliability capability? Keep it in the realm of possibility. (But, yes, you still need SLOs.)

Using SLOs to Establish Operational Responsibilities

When it’s time to move a product release from development to production, the “elephant in the room” is often the question “who is going to take responsibility for production?” SLOs are a great way to guide the decision of who takes the pager.

Of course, companies can organize their development and operation teams in many different ways. Some may take a “you-build-it-you-run-it” approach, which means the development team is directly incentivized to run the application in production efficiently. Many organizations divide labor between those who build stuff and those who run stuff. This approach might be a holdover from “the way things have always been done,” or it might be a strategic decision based on the skillsets of existing personnel or an attempt to optimize team interests and focus.

Regardless of why or how engineering teams are organized, SLOs are great for creating alignment during hand-offs to set clear expectations of reliability and to set a threshold for the minimum reliability for a given product or service. 

For example, suppose the dev team comes to the ops team and says, “This is mission critical; please run it at five nines.” SLOs give the ops team cause and justification to push back: Have you been able to achieve that historically? Can we break down the reliability goal a bit and understand which pieces really need five nines? Is the company prepared to deal with the necessary costs and slowdowns to meet that goal?

SLOs also are invaluable after handoff. Perhaps a team has already handed off a release, and a series of reliability issues have cropped up over time. The operational team (or the SRE team, same difference) can use SLO data to open a conversation about how to handle the issues and the shrinking error budget. Maybe the dev team joins the rotation to gain operational empathy or even “takes back the pager” for a while. Another option might be putting new features on hold or mostly on hold until reliability goals are met. Yet another approach would be to have a serious discussion about changing the reliability goals! 

In summary, SLOs are ideal for helping product managers and tech leads address the three key decisions that need to be made with each release: 

  1. What’s in it? SLOs help you give technical debt its due consideration.
  2. Go or No Go? SLOs help you set the reliability bar at the proper height for your organization.  
  3. Who’s in charge of ops? SLOs help your dev and ops teams align before, during and after hand-off to set reliability objectives and resolve issues in production.

SLOs should be a part of the logic and decision-making process of every product manager as well as the development and operation teams. In fact, SLOs should be an integral component of the “blood stream” that runs through your entire organization from customer engagement to service delivery.


Image Credit: SpaceX on Unsplash

Related Blogs