More by Kit Merker:
Nobl9 Demo: Setting up a Prometheus SLO with the Web UI Tame the YAML in 2021 Nobl9 and Datadog: Better Data Makes Better SLOs Nobl9 Demo: Kubernetes Cluster Failover Scenario Nobl9 Has Joined The Cloud Native Computing Foundation Reliability Evolution from Datacenter to Cloud: Interview with Less Lincoln, SRE at Microsoft Want a Reputation for Reliability? Keep it Simple. Interview with Matt Klein The Ultimate Guide to Reliability Talks at re:Invent 2020 How to Convince Your Boss to Adopt Service Level Objectives What is an SLO? Explained in 90 Seconds SREs: Stop Asking Your Product Managers for SLOs Nobl9 Demo: GitOps Ready sloctl and SLO YAML Kubernetes Knative Serverless Latency Metrics: Interview with Matt Moore How do we measure the customer experience? An Easy Way to Explain SLOs and SLAs to Business Executives| Author: Kit Merker
Avg. reading time: 4 minutes
The most persistent question in the world of SRE-hopefuls: How do you get developers in your organization to adopt service level objectives (SLOs)?
A case in point: I was talking to an SRE recently about how to get their internal app teams to adopt SLOs. They have been at it for a while and have about 10% coverage of SLOs throughout their thousand-developer-strong organization. The basic process they’ve followed is to reach out to each team, explain what SLOs are, help them define them, set up the instrumentation, and follow up weekly until they finally set up real SLOs they can view in their SLO tool. At this pace, they expect to be at full compliance — conservatively — within 3 years.
It gave me flashbacks to earlier in my career, working on build and engineering systems for Windows, and later Bing, at Microsoft. This is going back aways, so I’m not going to reveal anything super secret or modern, but there was a very specific — and successful — approach we took to drive compliance with expected practices. That method: build warnings.
Instead of chasing down teams to get them to adopt SLOs, insert SLO adoption right into their workflow.
Let me explain. Without constant analysis and improvement, our build system would quickly bloat and engineers would do Bad Things(™) that might be efficient for them (elegant hacks) but had consequences for build times, repeatability, consistency, and just plain hygiene. We’d look for these patterns and create rules to sniff them out at build time and offer clean alternatives that made the system run better.
When you define a new build rule, you don’t want to just break the build on failure. That would lead to tens of thousands of surprise build breaks that would require going through the entire code base and fixing up everything. Feature delivery and testing would grind to a halt. On the other hand, putting out a doc and telling people “please update your rules” doesn’t work either; engineers are lazy and will ignore you.
Chasing these breaks down is incredibly frustrating for everyone involved.
So our solution was to create build warnings. First, we would define the patterns we wanted people to follow and make sure we had them clearly documented and easy to find. Second, we would add tests that identified non-compliance with our desired behavior and would print warnings that linked to our instructions. The warnings in some circumstances would also be logged into our build statistics database, and tell us precisely which components had warnings and how many times they repeated, and for how many recurring builds. Helpfully, the build warning messages would include links to documentation and also a countdown until the warning would become a full-fledged breaking error. We would publish the reports of warnings with the countdown to management, and gamify adoption of new rules.
So what happened?
The early adopters who had simple, relatively recent code, and believed in good hygiene quickly changed their rules and got into compliance and never saw another warning. The teams under a lot of pressure would ask for exceptions or help, and negotiate a plan to resolve their warnings. The teams with gnarly old code, strange customizations, or legacy features would ask for permanent exceptions, which we would grant on a case-by-case basis. It didn’t make sense to force teams in maintenance mode to modify code — and introduce risk – for the sake of compliance. And finally the stubborn teams — the last 3-5% of the org —- would play schedule chicken, make excuses, and would eventually get burnt when the build rules went into full effect and broke their build. At which point we could point to months of data, examples of other adopters, and multiple offers to help them fix up their build.
“What has this got to do with SLOs?” you ask. Well, as Alex Nauda, our CTO at Nobl9, likes to say, “Widespread adoption of SLOs will result in net lower reliability in large orgs — with more feature velocity. Change my mind.”
Challenge accepted. The adoption of a new coding or technology practice within a company doesn’t happen by magic. And it certainly doesn’t happen by talking about it in meetings. If you want to see SLO adoption, you’ll want to give teams the ability to self serve, know what they need to do, and use internal social pressure to ultimately complete the adoption work.
By creating SLOs-as-code — like OpenSLO — you can start to enforce rules in your CICD pipeline that check for SLOs and send warnings. You can scan source repos, collect warning data, and provide reporting to show which teams are pre-SLO, and which ones have at least defined a baseline. As more teams adopt SLOs, you can turn those warnings into errors and drive the last bit of compliance. Maybe you have to grant some exceptions, but now teams are coming to you and making their case.
Instead of chasing down teams to get them to adopt SLOs, insert SLO adoption right into their workflow. Break their build if they don’t have SLOs for each feature.
Another use of this type of configuration can be for enforcing or auditing error budget policies. One common pattern is to block or warn on deployments when a team has already exhausted error budget. There’s a catch though: the changelist may be a reliability improvement — and therefore “allowed” to be released. If you set up an exception process, you can give them they a special auditable keyword to include in their CL. However strict your organization wishes to be is up to you. You could ratchet up the exception requirement for repeat offenses, or use “tell mode” and “ask mode” policies depending on release conditions.
The point is to make the rules of the game clear, and make the system enforce the rules. If your goal is to drive adoption of SLOs, cut out the meetings and cajoling, and think systematically about the kinds of behaviors you expect from developers across the org. By setting up rules, documentation, warnings, errors, and logging on the artifacts of SLO adoption, you can scale up the rate of adoption really fast. No one wants to break the build.
Do you want to add something? Leave a comment