Resilience First: A Customer Success Strategy for Nobl9

May 10, 2023 | Author: Natalia Sikora-Zimna

Avg. reading time: 3 minutes

At Nobl9, we believe that a high-performing product is critical to our customers’ success. Over the last several years of delivering SLOs to our customers, we’ve learned a lot about making SLO calculations resilient. In terms of delivering a service that our customers rely on for their own reliability, I think of this a bit like using an oxygen mask in an emergency: we’re told that if a problem occurs, we should put on our own masks before attempting to help those around us so that we’re able to help them effectively. This is exactly what we do at Nobl9; when we find that we’re burning our error budgets too quickly, we make sure that our oxygen mask is on first. In practical terms, this means that we might rearrange the roadmap to focus on fixes and improvements to ensure better, more accurate insights for customers from their data.

Reliability is a broad topic, and requires constant investment. We could apply all the resources we have to fix bugs, improve performance, enhance the user experience, and so on, and still not meet our goals. So, recently we have been focused on the critical path of our data: data input and data processing.

SLOs rely on real-time data, and any delay or inaccuracy in processing of that data can result in missed opportunities to detect and address issues before they have a significant impact. Processing delays can also lead to frustrating false positives. Since data processing is the heart of our system, we’ve improved calculation precision. We’re also addressing technical debt that can lead to irregularities, such as occasional spikes in reliability charts in Nobl9 to show over 100% or under 0%. Improving data processing and calculation precision will give our customers more accurate, reliable, and actionable insights into the performance of their services.

Because the outcome of any system relies heavily on the data coming in, we’ve also invested in hardening our integrations and improving our ability to detect irregularities in the data we’re gathering. We did this to safeguard ourselves from any data issues that may skew the SLO calculations.

In addition, we’ve implemented a toolset to improve the transparency of end user data within Nobl9 and help users troubleshoot any potential issues. As you can see, we’re addressing this complex problem from multiple angles.

Mistakes happen, and they are a normal part of the creative process. But undetected mistakes can be painful, especially if their consequences are not visible for some time. To alleviate some of this pain, we’re introducing Query Checker, a capability that temporarily puts newly created SLOs in a “testing” state. During this period, Nobl9 automatically checks the query used in the SLO to make sure that it’s correct and supported. With Query Tester, you will know almost instantly if a query won’t be able to deliver meaningful results and needs revisiting. Currently Query Tester is available for New Relic, Datadog, and Dynatrace, but we’ll be expanding the list of supported data sources soon.

Once an SLO is created, other anomalies may affect its results. For example, a data source outage might stop or delay data flow to your SLOs. To proactively inform customers about such problems, we’re releasing Metrics Health Notifier, a tool that detects anomalies in SLI data, notifies you about issues, and provides information about how to address them. The first version of Metrics Health Notifier will inform users about missing data, and in the future it will be expanded with more anomaly categories.

For our enterprise customers, we’ve added the option to deploy a dedicated Nobl9 instance on Google Cloud or AWS. Adding this choice allows customers to customize their cloud environment and optimize for reliability, performance, or other preferences.

In addition to these major changes, we’re also introducing some smaller but impactful improvements aimed at increasing the visibility of potential data source issues and giving customers more control over the data flow. For example:

For customers who need to troubleshoot issues related to their Direct data source connections, we’re now providing access to Direct data source logs.
Nobl9 tries to pull data from a previous minute; however, depending on the data source, a data point may not be available. A missing data point could cause skewed calculations, but with Custom Query Delay, you can control the delay while pulling data.

Reliability is a team sport – Nobl9’s resilience directly translates into the reliability of our customers’ products and services. We’re constantly striving to improve our product and invest in reliability as an essential part of the user journey. We believe that by doing so, we can help our customers be more successful.