Could Service Level Objectives Have Prevented AT&T's Outage?

Feb 23, 2024 | Author: Daniel Ruby

Avg. reading time: 2 minutes

att-oops Reliability is part of your product, and yesterday AT&T discovered just how devastating a breakdown in reliability can be. For nearly six hours, the telecommunications giant suffered a massive outage in their cell network, affecting more than 1.7 million customers and disrupting 911 services. As rumors swirled about potential cyberattacks, the company announced that the cause of the outage was a software update - described by a spokesperson as "the application and execution of an incorrect process used as we were expanding our network."

An incident of this magnitude is every SRE's nightmare. Beyond site reliability, though, an incident of this magnitude is the nightmare of every CEO, CTO, VP of Marketing, PR representative, Director of Finance - it doesn't really matter where in the business you stand, a severe breakdown in reliability will leave you scrambling to either get or share answers.

So how to prevent this? Any reliability professional will see the words "application and execution of an incorrect process" and groan; AT&T's cellular offering is comprised of countless internal products and systems, all working together, all contributing to (or detracting from) the overarching reliability of the offering itself. I'm certainly not claiming to know AT&T's reliability stack, but having an offering-level dashboard of service level objectives, pulling from the various reliability and observability platforms used across different parts of the organization, could quite possibly have prevented this outage from occurring.

If nothing else, being able to see error budgets in real time and historically across a customer offering and being able to drill down into any systems that spiked in errors leading up to an outage (as well as any annotations corresponding to the system's spike) can help speed up root-cause analysis and bring services back to customers in minutes rather than hours.

Reliability has real-world customer impacts, and needs to be understood across an organization. Too often it's viewed as a cost center - how much do we pay for this, and how can we pay less without it breaking? In situations like this - and I am by no means suggesting AT&T has this perspective - the customer experience suffers as SREs and their managers work to justify their budget to executives.

As seen with AT&T's outage, reliability must be viewed as part of a company's product. Not reliability looked at in segments unconnected to each other, but the overarching reliability of what a customer interacts with, with disparate metrics normalized into an ongoing view of how a company's product is performing.

The best way to do this is via Service Level Objectives, and the most impactful way to implement SLOs across your organization is with Nobl9.

Let us show you exactly how Nobl9 can level up your reliability and user experience

Book a Demo

Front Page, Error Budget, Service Level Objectives (SLO), Site Reliability Engineering, Industry News

AI-Ops and SLOs | Webinar with Experts on how AI Reshapes Engineering

Strategies for Startup Product Management | A Nobl9 Webinar

Could Service Level Objectives Have Prevented AT&T's Outage?

See It In Action

Let us show you exactly how Nobl9 can level up your reliability and user experience

Do you want to add something? Leave a comment

Sign up for article updates!

AI-Ops and SLOs | Webinar with Experts on how AI Reshapes Engineering

Strategies for Startup Product Management | A Nobl9 Webinar

Could Service Level Objectives Have Prevented AT&T's Outage?

See It In Action

Let us show you exactly how Nobl9 can level up your reliability and user experience

Do you want to add something? Leave a comment

Read On

CIO.com Interview with Nobl9 Co-Founder Brian Singer

Composite Service Level Objectives 2.0 Now Available

LTIMindtree and Nobl9 Forge Groundbreaking Partnership to Enhance Service Reliability

Nobl9 Recognized in 2023 Gartner Hype Cycle Report For SREs

Sign up for article updates!