More by Natalia Sikora-Zimna:
Resilience First: A Customer Success Strategy for Nobl9 Your Reliability Tools Are Down? Nobl9 Is Here to Help! SLOs Made Easier with Nobl9 and CloudWatch Metrics Insights From Sparse Metrics to Actionable Alerts: An SLO Case Study SLOs On Your Terms: Nobl9s Error Budget Adjustments Put You In Control The Success of Small Steps: Improving the User Experience Within Nobl9| Author: Natalia Sikora-Zimna
Avg. reading time: 3 minutes
Update Nov 15: Replay supported data sources now also include Lightstep and New Relic.
If you’re an engineer like me, you hate waiting, you want to start whatever you’re doing now! Today, we launched Replay as part of the Nobl9 Platform so you can create Service Level Objectives (SLOs) with historical data in minutes instead of days or weeks. You now have an accurate picture immediately by drawing on past data from a specified period of time.
The SLO journey starts with asking what is important to measure and which Service Level Indicators (SLIs) will help you better understand your services’ reliability. You then define reliability targets based on these indicators that you think are appropriate for your services at present. As each service evolves, however, those targets may change. The true value of SLOs stems from monitoring how a service performs over time, using that information to have more informed business conversations, and adjusting your targets whenever needed. To do this, you’ll need a decent sample of data. Thanks to Replay, you can access up to 30 days’ worth of historical data minutes after creating an SLO, allowing you to draw conclusions and make adjustments much faster.
Replay pulls in the historical data while your SLO collects new data in real-time. The historical and current data are merged, producing an error budget calculated for the entire period. SLO diagrams are generated as soon as the data ingestion is complete to provide a seamless overview of past and present data.
Setting up Replay for your SLOs is simple. To enable historical data retrieval for SLOs linked to data sources that support this feature (see below for details), you need to configure two parameters in the data source configuration wizard:
- Maximum Period for Historical Data Retrieval
- Default Period for Historical Data Retrieval
These values were introduced to safeguard the user experience by protecting against exceeding the data source’s API rate limits, as the requests for historical data will be counted along with the requests to fetch current data. The allowed Maximum Period for Historical Data Retrieval will differ from source to source depending on their data retention policies. For example, for Datadog it can be anything up to 30 days.
Once you have configured your data source, you can create an SLO and warm-start it with historical data. In the SLO wizard, you’ll see a Period for Historical Data Retrieval field whose value will be set to the Default Period for Historical Data Retrieval configured for the data source (note that this field will only appear if the selected data source supports Replay). You can override this value, but you will not be able to exceed the Maximum Period for Historical Data Retrieval set for the data source.
If you are creating an SLO for a data source configured prior to the introduction of Replay, the Default and Maximum Period for Historical Data Retrieval will each be set to 0 days by default. To use this feature, you will need to reconfigure the data source before adding the SLO.
Once you’ve completed the SLO configuration and saved your SLO, you’ll be able to see your SLI and error budget charts integrating the historical data for the specified period within just a few minutes.
You can use Replay to pull in metrics data from the past for newly created SLOs. The ability to pull in historical data for existing SLOs and use that data to recalculate their error budgets will be added in a future release; for now, you can achieve these results by re-creating those SLOs with Replay enabled.
A common use case is observability tooling migration. For example, Replay allows customers to accelerate migration to Amazon Managed Service for Prometheus by honing critical SLO in minutes not months.
“With existing metrics from AWS observability, Nobl9 lets users quickly and easily implement, maintain, monitor, and automate actions for SLOs without requiring a systems rewrite or expensive change management.” - Toshal Dudhwala, Global Business Development and GTM, AWS and Imaya Kumar Jagannathan, Principal Solution Architect, AWS. See AWS Observability Services and Nobl9 Provide Quick and Easy SLO Monitoring
Replay for AWS Managed Prometheus, Datadog, Graphite, Prometheus, and Splunk customers (using either the Agent or Direct connection method) is available in beta. Nobl9 will expand Replay to all of the many data sources we already support.
Replay is now available to all Nobl9 customers. If you’d like to try Nobl9 and see how it can help your business set up actionable reliability goals, sign up for Nobl9 Free Edition.
Do you want to add something? Leave a comment