More by Alex Hidalgo:
You Can See More From the Top What is Reliability Anyway? Announcing SLOconf 2023 Local Events Take a Peek at SLOconf 2022 The future is bright Site Reliability Engineering in Six Words Alex Hidalgo on Reliability Reporting: Painting Big Picture for SLOs Error Budgets Are for Asking Questions Service Levels in the Real World The SLO Book Cheat Sheet| Author: Alex Hidalgo
Avg. reading time: 4 minutes
If there is one end-of-the-year conference I think you should check out it’s definitely the virtual edition of SRECon Americas, running from December 7th through the 9th. The program committee has truly outdone themselves with their selection of talks, with a wide range of speakers presenting on plenty of interesting and important topics. Luckily the sessions will be recorded, because I just don’t see how one person could possibly attend everything worth seeing this year! I wanted to write about why I’m excited for every single one of these talks but at some point I had to whittle down the list. So, I decided to focus on talks that will best help you on your SLO journey.
Identifying Hidden Dependencies – Liz Fong-Jones, honeycomb.io
I’ve been around for a while, and I can remember when incredible uptime numbers were a badge of honor. “This DNS server has been up without a reboot for 4 years,” we used to boast! This kind of thinking was always a bit misguided, but never more so than today as our systems have become more complex and interdependent. If you never turn systems off, how are you supposed to know what happens if they go away due to some failure? (And if you don’t have an error budget, how can you know when is a reasonable time to find out?) You should never miss a chance to see Liz speak and her story about how honeycomb.io was able to identify hidden dependencies in their stack by just turning stuff off should be a compelling one.
Avoiding Goodhart’s Law – Use SLO’s as Tools Not Cudgels – Marco Coulter, AppDynamics
Marco will be presenting about one of my favorite topics — and I don’t just mean SLOs in general! If you’ve read my book or ever heard one of my talks you’ll know I’ve been beating this same drum for a long time. SLOs give you better data to have better discussions to make better decisions. Using them as mandates can easily steer you down the wrong path. I’m very excited to hear Marco’s take on this.
Off the Beaten Path: Moving Observability Focus from Your Service, to Your Customer – Mohit Suley, Microsoft
Only a few hours into day one of the conference, we run into our first conflict — and why we should feel lucky that everything is going to be recorded! Presenting at the same time as Marco, Mohit will be talking about how to think about things from your customers’ point of view. Adopting SLO-based approaches to reliability is all about thinking about your users first, and you can’t do that without measuring what they care about. This should be a great overview of how you can help shift this thinking within your organization.
Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit – Dave Stanke, Google
At the heart of SLO-based approaches to reliability is the concept that your users determine how reliable you are — not your backend metrics. But this leads to the question: how can you ever even know what your users actually think? And even if you can approximate that, how can you use that information to make better decisions? Dave is here to help you figure that out and improve your business at the same time.
Why SREs can’t afford to NOT do Chaos Engineering – Mikolaj Pawlikowski, Bloomberg
Direct from the abstract Mikolaj asserts: “Chaos Engineering is steadily transforming from a gimmick to a serious, scientific discipline focused on observing and measuring the effects of the failure in systems of all shapes and sizes, in order to verify their behavior experimentally.” I couldn’t agree more! We all need to get better about examining the outcomes of potential failures, and don’t forget that SLOs are often the best way to measure and understand those failures! I’m excited to hear a talk that will cover systems of various complexity and size and how you can learn more about them.
Production Population Control: My Cattle are Rabbits! – Alex Nauda, Nobl9
As people responsible for the reliability of services we’re frequently told to think of these services as cattle and not pets. That is to say: no one should get too invested or emotionally tied to their systems. With the advent of cloud computing and container orchestration platforms this was supposed to become even more true; however, we all know that isn’t really the case. No matter what we do, someone will end up getting too attached to a system or service. Join Nobl9’s very own Alex Nauda as he discusses how you can use SLOs to better measure exactly what people’s sentiments are about your services and environments!
Latency and Availability Error Budgets Done Right at Scale – Fred Moyer, Zendesk
SLO-based approaches can be easy to explain but difficult to actually implement in a meaningful manner. This difficulty will also multiply when you’re dealing with large organizations and large systems. How do you implement a new way of thinking about reliability in these situations, and how do you make it as easy as possible for people to adopt this thinking? In this talk Fred is going to tell you all about how to develop successful formulae to do just that.
A Bartender’s Guide to Network Monitoring – John Blaho, Catchpoint
As a former bartender, I appreciate John’s analogy that mixing a drink is much like thinking about service reliability. It doesn’t matter if you think you’ve gotten all of the ingredients combined correctly, what matters is what your customers think about the drink. Starting with a base of meaningful SLO measurements, this talk will walk you through practical examples of how to make sure your users are getting the best possible experience — or cocktail!
Learning from Adaptations to Coronavirus – Panel
While this panel is not strictly about SLOs, I would be remiss if I didn’t spend some time convincing you to attend this closing session moderated by Nora Jones of Jeli.io. You cannot miss Fred Hebert from Postmates, Lorin Hochstein from Netflix, and Vanessa Huerta Granda from Enova talking about what they’ve learned during our ongoing pandemic. All three panelists work for organizations that had to rapidly scale due to the massive changes in how our world operates when we’re on lockdown and working from home. We can all learn from tragedy and these experts will step you through their own stories of doing so.
Final Thoughts
I really cannot impress on you how blown away I am by this entire lineup. While I chose to mostly focus this post on talks that concern SLO-based approaches, the entire program is brilliant. In addition to recommending these talks, I’d implore you to not miss Dr. Maguire talking about what she’s learned about incident response, Alex Elman examining how we can use better data to discover and communicate reliability concerns (I think error budgets are great here, btw!), J. Paul Reed discussing where automation should end and humans should begin, and so much more! Judging from the program I truly believe SRECon Americas 2020 is going to be a special event. I’ll be on the SRECon Slack for all three days. Come say “Hi!” and let’s learn from each other, together!
Do you want to add something? Leave a comment