Deloitte’s Observability Systems Approach - Evolving Reliability Practices Beyond MTTR

More by Ganesh Seetharaman:

This is a guest blog written by Ganesh Seetharaman, Tech Resiliency Market Offering Leader | Deloitte Consulting LLP

The traditional ways of measuring and ensuring system reliability are failing us. As digital architectures become more distributed and interconnected, old metrics — focused on averages and after-the-fact analysis — can't keep pace with modern complexity. Fewer than 10% of reliability issues are even detected through support tickets or social media, leaving organizations blind to the vast majority of user experience problems1.

The stakes are exceptionally high because users attribute negative digital experiences to the brand itself, not just the application. Moreover, reliability issues often occur during sessions with "available" applications, silently eroding trust without causing noticeable crashes or failures. This new reality demands a fundamental shift in how we approach system reliability.

Organizations should evolve past simple monitoring and mean-time metrics to a comprehensive framework combining deductive analysis, user perception, and early warning systems. By treating system health more like preventive medicine than emergency response, organizations can build resilient digital services that maintain customer trust even as complexity increases.

The evolution of reliability measurement tells an interesting story

What began with essential system monitoring and time-series data has matured through several stages, from simple correlation analysis to modern contextual observability powered by AI/ML. 

Traditional metrics such as Mean Time to Detect (MTTD), Mean Time to Repair (MTTR), and Mean Time to Resolution (MTTR) primarily focus on averages, measuring how swiftly teams identify and resolve issues post-occurrence. These metrics and Key Performance Indicators (KPIs) are essential for assessing the performance of your reliability practices over time and at scale. However, Black Swan incidents or internal changes can significantly impact these indicators. Therefore, it is crucial to delve into the daily operations to gain a comprehensive understanding of your Site Reliability Engineering (SRE) team's processes and performance.

While these metrics have provided useful historical baselines, they often suffer from significant limitations:

  • Lagging indicators: Historical data often highlights problems too late to prevent disruptions.
  • Lack of user-centricity: Traditional uptime metrics focus on system health rather than user experience.
  • Over-simplification: MTTR doesn't account for transient issues or degraded performance.
  • Operational blindness: Reliance on firefighting rather than continuous improvement.
  • Masked patterns: Averages obscure meaningful patterns in system behavior.

Deloitte's Observability Systems Approach

Deloitte's systems approach reliability through three distinct but interconnected lenses:

  1. Perception: Understanding how users experience and interact with the system.
  2. Deduction: Systematic analysis of system behavior patterns, like running targeted diagnostics based on symptoms.
  3. Silent failures: Identifying and addressing issues before they impact users, comparable to early warning signs of health issues.

Choosing between heterogeneous and unified observability tooling architecture depends on your organization's specific needs and goals. Each approach has its advantages, and the right choice will be influenced by factors such as system complexity, organization size, and available resources.

Think of this toolchain architecture and observability approach as analogous to modern preventive healthcare. Essential monitoring, like an annual checkup, provides a general health overview. Perception is like advanced diagnostics — running specific tests based on observed symptoms. The observability 2.0, SDLC integrated observability framework combines these insights into holistic health management, continuously validating system health and building technological immunity, including capturing silent and transient failures.

Defining meaningful SLOs is particularly challenging given the multiple telemetry pipelines, diverse data sources, and considerable noise in metrics, logs, events, and traces. Implementing this approach without disrupting existing tooling and processes requires sophisticated tooling.  

Deloitte’s approach follows a streamlined, systematic process that can help ensure clarity and actionable insights for improving system reliability.2:

  • Data Integration: Seamlessly access specific data from multiple sources, ensuring a non-disruptive approach to system monitoring3.
  • Persona-Based Holistic View: Consolidate and present a high-level understanding of service health and risks, drawing on diverse, hybrid data sources to guide decisions around Service Level Objectives (SLOs)4.
  • Assumption Testing: Test reliability assumptions through "what-if" scenarios, helping bridge the gap between IT teams and business objectives to align on realistic expectations (such as Nobl9’s SLI Analyzer).
  • Error Budget Management: Track error budgets and align resources accordingly to maintain the balance between innovation and reliability.
  • Operationalizing SLOs – "High-Quality Signals": Define policy-driven actions for SLOs and error budget management, reducing noise from traditional alerts while optimizing operations (ref: Nobl9 Alert Center).
  • Reliability Posture Assessment: Evaluate system health across multiple dimensions, from operational to executive-level dashboards, ensuring an accurate view of system performance.
  • Threshold Optimization: Fine-tune alert thresholds to minimize fatigue from constant notifications and ensure timely issue resolution.
  • Prioritizing Modernization and Tech Debt Projects: Using a data-driven approach, identify and prioritize modernization initiatives, helping IT teams focus on strategic business growth.

To address this specific need, we analyze Service Level Objectives (SLOs) and Indicators (SLIs) correlating incident tickets with historical data, establish temporal causation patterns, and move beyond simple cause-and-effect analysis to enable proactive issue detection.

This systematic approach enables organizations to create and test "what-if" scenarios, implement effective error budgeting, save time and resources through automated analysis, learn from historical patterns while maintaining forward-looking insights, and focus on solving real problems rather than chasing perceptions.

Most importantly, the approach aligns technical reliability measures with user journeys and business outcomes. Our objective is to minimize engineering effort while maximizing impact on customer experience — the accurate measure of system reliability. Organizations need to move beyond traditional metrics to build genuinely trustworthy digital services. This means:

  • Adopting SLOs and SLIs as primary metrics.
  • Implementing proactive monitoring and analysis.
  • Building system immunity through engineering resilience.
  • Focusing on user experience rather than just system uptime.

The time to act is now.

You're not too late to start, but the time to act is now. Incremental changes can deliver real impact, creating a cycle of enhanced reliability, improved customer experience, and reduced operational risk.

Ready to take the next step?

Learn more about Deloitte’s approach and how we can help you build a robust, reliable technology foundation. Our team will guide you through assessing your resilience, identifying key improvements, and implementing targeted solutions that drive measurable value.

Don't wait for the next crisis to expose vulnerabilities—act now to build resilient, future-proof technology. Contact us today!

 

 1The State of Service Level Objectives 2023, Dimension Research [ref]

 2Market Guide for Site Reliability Engineering Tooling, Gartner, 17 December 2024 - ID G00818313

3Nobl9 Reliability Center Platform Integrations

4Nobl9 System Health View

See It In Action

Let us show you exactly how Nobl9 can level up your reliability and user experience

Book a Demo

Do you want to add something? Leave a comment