- Multi-chapter guide
- It incident management
- Reliability vs availability
Reliability vs Availability: A Best Practice Guide
Table of Contents
Like this article?
Subscribe to our Linkedin Newsletter to receive more educational content
Subscribe nowReliability and availability offer insights into system performance and are often mistaken for being the same thing. However, there are critical reliability vs. availability differences. Teams that understand those differences can better achieve outcomes that optimize user experience.
Availability focuses on whether a system is accessible when needed and is defined as the percentage of time the system remains operational and reachable. Reliability focuses on how the system will perform its intended function for users. It is the probability that a system or component will perform its required tasks without failure over a specified period.
Even if a service is up and available, it can't be reliable if the users' requests are handled with high latency. An availability metric would overlook this negative user outcome. Teams that want to overcome this gap can measure service reliability. The right service level objectives (SLOs) and service level indicators (SLIs) help teams measure whether an application is reachable and the real-life user experience.
This article will explain reliability vs. availability in a simple way, discuss how to measure and evaluate both, the most important SLIs to use, common challenges, and optimization recommendations to improve user experience.
Summary of key reliability vs. availability concepts
The table below summarizes the reliability and availability concepts this article will explore.
Concept |
Description |
Availability |
The percentage of time a system or an application remains operational and reachable when required. |
Reliability |
The probability that a system or component will perform its required functions without failure over a specified period under real-world conditions. |
Service level objective |
A measurable target for the performance or reliability of a service, such as uptime or response time that a provider aims to meet over a specified period. It helps teams monitor service health and make informed trade-offs between reliability and innovation. |
Service level Indicator |
The foundation of service level objectives (SLOs) that represent the metrics upon which they are built. |
Error Budget |
The allowable amount of unreliability over a specific period. |
Reliability challenges |
Common challenges to setting a reliable system include:
|
Reliability and availability best practices |
Recommendations for a reliable and available system include:
|
Understanding availability
Availability is the time a system remains operational and reachable when required. It reflects its ability to perform its function, remain functional despite failures of some components, and recover quickly from outages.
Availability is also a broad term; different organizations can define it differently. For example, one organization can define an outage based on the application not being reachable for users, while another may define it when a critical component becomes unavailable.
System availability is expressed as a percentage. Simply put, it is the ratio of total system uptime to the total sum of uptime and downtime.
The “nines of availability” table
The availability targets are expressed in nines, a common way for service providers, stakeholders, and customers to communicate availability percentages. For example, Amazon Web Services (AWS) promises its users a “four nines of availability” for EC2 instances, which means the instances are up 99.99% of the time, and users can expect only 4.32 minutes of monthly downtime.
Availability % |
Nines of Availability |
Allowed down time |
||
Per month |
Per quarter |
Per year |
||
90% |
One Nine |
3 days |
9 days |
36.5 days |
99% |
Two Nines |
7.2 hours |
21.6 hours |
3.65 days |
99.9% |
Three Nines |
43.2 minutes |
2.16 hours |
8.76 hours |
99.99% |
Four Nines |
4.32 minutes |
12.96 minutes |
52.6 minutes |
99.999% |
Five Nines |
26.9 seconds |
1.30 minutes |
5.26 minutes |
How to measure availability
There are several metrics and approaches that can be used to quantify system availability. The next sections break down specific measures and calculations that can help organizations track their systems’ availability.
Uptime
Uptime is the first metric that must be considered when measuring an application's availability. It refers to the duration an application was up and running and is usually represented as a percentage. This percentage is calculated over a period of time, like a month, a quarter, or a year, and is reduced whenever the application is down or not operational.
Here is the formula to calculate availability with uptime:
Availability% = Uptime / Total time * 100
Availability = Uptime / Total time. (Source)
For example, if an application has been operating as expected for 8.75 hours of the last 10 hours.
Availability% = (8.75 / 10) * 100 = 87.5%
Downtime
This is similar to the previous method with uptime, but if you want to use a different metric, like the number of failed requests, you can use downtime to determine availability. Here is the calculation formula:
Availability% = ((Total time – downtime) / Total time) * 100
For the same previous example:
Availability% = ((10 – 1.25) / 10) * 100 = 87.5%
Based on requests
For some services, you might have a different metric, like the number of HTTP requests completed successfully; in this case, it might be more convenient to calculate your availability percentage based on requests instead of uptime or downtime. In this case, the following calculation can be used:
Availability% = (Successful Responses / Total Requests) * 100
Series dependency
A series dependency system is designed so that its components rely on each other. This means that for the system to be operational, all the components must be up, and any interruption in one of the components will lead to the whole system going down.
For example, consider an e-commerce application with these three components:
- A Web server with an availability of 99%
- An application server with an availability of 99%
- A database server with an availability of 99%
The application availability depends on the two series-dependent services.
The availability calculation for the whole system would be the product of the availability of each of its single components, meaning that the overall availability of a system will decrease as the number of components in the system increases. Let’s do that calculation for our example:
Availability% = A1% * A2% * A3% = 99% * 99% * 99% = 97.03%
It is worth noting that this calculation assumes complete independence between the service components. In real-world scenarios where failures are correlated, service dependencies should be carefully considered when designing a system's reliability.
Parallel dependency
A parallel dependency system means that the system can be functional as long as one of its components is working, so the only case it can be unreachable is if all the components are down simultaneously.
For example, A system with a load balancer and three web servers behind it to handle user requests in parallel:
- Webserver 1 with an availability of 99%
- Webserver 2 with an availability of 99%
- Webserver 3 with an availability of 99%
Application availability depends on the availability of any web server
The availability of such a system will be as follows:
Availability% = 100% - ((100% - A1%) * (100% - A2%) * (100% - A3%))
= 100% - (1% * 1% * 1%)
= 99.9999%
It is worth noting that this calculation assumes perfect failover between components with no downtime during the switching process. In real-world scenarios, failover mechanisms may introduce delays or errors, slightly reducing the system's availability.
In contrast to series dependency systems, this time, we notice that a system's overall availability will increase as the number of components in the system increases.
Integrate with your existing monitoring tools to create simple and composite SLOs
Rely on patented algorithms to calculate accurate and trustworthy SLOs
Fast forward historical data to define accurate SLOs and SLIs in minutes
Five factors affecting availability
Multiple variables influence service availability and can impact how teams implement systems and contingencies. Let’s take a look at the five primary factors that affect system availability
Planned maintenance
Periods of planned outages to update the system or apply security patches are recommended to be included when evaluating availability, as users will be using the application at those times. However, this also leads to a reduced availability percentage.
Let’s suppose the team for a global shopping application has decided to schedule important system updates at midnight, but users in a different time zone still face interruptions, affecting their ability to reach the application.
Infrastructure quality
Outdated software and faulty hardware are common causes of downtime. Some of these factors can be:
- Connectivity: Network issues from and to your infrastructure can prevent users from accessing the application.
- Compute Resources: Server outages can cause an application to be unreachable.
- Storage: Storage and database failures can disrupt the data access or storage, affecting the availability
Geographic distribution
Systems distributed across multiple regions may have differing levels of availability due to differences in infrastructures and network performance. For example, a global streaming platform might provide a faster experience for users in countries with better network infrastructure than countries with limited bandwidth.
System architecture
The design of a system and how the components are dependent on each other impact its overall uptime. Parallel dependencies tend to have higher availability than series dependencies, as in parallel dependencies, the system can be functional as long as one of its components is working, while in series dependencies, all the components must be up for the system to work.
For example, a content delivery network (CDN) replicates the load across multiple nodes. If one node fails, another can serve the request without affecting the application.
Traffic load
High traffic load in peak times can affect the system's availability, leading to reduced uptime. For example, Amazon faced an outage on Prime Day in 2018 due to the high unexpected traffic; they couldn't secure enough servers to handle the traffic surge, causing issues with loading pages and completing purchases. This caused Amazon to lose an estimated $72 million in sales before the problem was resolved.
Understanding reliability
Reliability refers to the probability that a system or component will perform its required functions without failure over a specified period under real-world conditions. It indicates how likely a system will work with good performance and how likely it will keep working without failure or outages.
A real-world example of a reliable service is Amazon Web Services (AWS), which delivers highly reliable cloud services. AWS ensures service reliability, such as Amazon S3, RDS, and DynamoDB, through Service Level Agreements (SLAs). These services are distributed across multiple Availability Zones (AZs) to ensure redundancy and fault tolerance.
In contrast, the rollout of the healthcare.gov website, launched in 2013, is an example of poor reliability. The website was planned to handle users for healthcare plans, but users faced frequent outages, slow response, and inconsistency of data; this was caused because the website was not scalable enough to serve the high number of users (250,000 users, which is five times more than expected), which caused the website to go down within two hours of launch. Also, the platform was rushed without fulfilling the required testing process.
Achieving reliability requires more than a good system design; it requires continuous system performance measurements to ensure alignment with the user's expectations. That's why organizations often rely on Service Level Objectives (SLOs) to ensure and measure reliability. SLOs provide clear, quantifiable targets for system performance and serve as a critical tool to align expectations and track the reliability of services and applications over time.
How service level objectives (SLOs) measure reliability
SLOs represent measurable targets for systems’ reliability over a specified period, often expressed in percentages. These objectives define the satisfactory experience for end users and how a reliable system can be in a quantifiable way.
For example, a team can define an SLO that specifies that an application must be available 99.5% of the time over one year. Another SLO could state that 97% of web page load time from users should be less than 3 seconds within a rolling 30-day period. These measurable targets help align teams on what represents reliable performance and provide a benchmark for tracking progress over time.
Composite SLOs
Individual SLOs are effective for tracking the performance of specific components, but complex applications may consist of multiple services, and each service may have its own SLO. That's why determining the application's overall reliability requires more than just looking at those individual SLOs, and this is where composite SLOs come into play.
Composite SLOs combine the reliability targets of multiple services to represent the overall reliability of a system or application. For example, imagine an application composed of three microservices: a database, an API, and a front-end service. Each of these has its own SLO. The overall reliability will be calculated based on the composite SLO of the whole application.
Composite SLO is the combination of the individual services SLOs from one or more data sources
The weighting method
For a complex application with several services, the importance of each service for the user experience will vary, and that should be considered when calculating the composite SLO. Weighting allows critical components of the application to have a more significant influence on the composite SLO target, ensuring their importance is appropriately reflected.
However, it is essential to recognize whether the evaluated services are independent or have dependencies. In hard or series dependencies where the failure of one critical service, such as a frontend, renders other services, such as payment processing, it might be more practical to monitor the critical service independently rather than as part of a composite. While for soft or parallel dependencies, where multiple components contribute to the overall user experience but are not strictly sequential, the weighted average method used in composite SLOs can be helpful.
Use case 1: SLOs with equal weights
Let’s consider an application with four services, A, B, C, and D, with SLOs 99%, 98%, 95%, and 97 $, respectively. Here, we assign equal weights with a weight of 1 for each service.
Normalized weight = Weight of service / Sum of all weights * 100 = 1 / 4 * 100 = 25 %
|
SLO A |
SLO B |
SLO C |
SLO D |
Target |
99% |
98% |
95% |
97% |
Weights |
1 |
1 |
1 |
1 |
Normalized Weights |
25% |
25% |
25% |
25% |
In this case, the composite SLO can be set as follows:
Composite SLO = (99% * 25% + 98% * 25% + 95% * 25% + 97% * 25%) * 100 = (0.2475 + 0.245 + 0.2375+ 0.2425) * 100 = 97.52%
Use case 2: SLOs with varying weights
For the same previous example, we will assign different weights for services A, B, C, and D as 8,6,4 and 2, respectively.
|
SLO A |
SLO B |
SLO C |
SLO D |
Target |
99% |
98% |
95% |
97% |
Weights |
8 |
6 |
4 |
2 |
Normalized Weights |
40% |
30% |
20% |
10% |
In this case, the composite SLO can be set as follows:
Composite SLO = (99% * 40% + 98% * 30% + 95% * 20% + 97% * 10%) * 100 = (0.396 + 0.294 + 0.19+ 0.097) * 100 = 97.7%
How to assign weights
There's no universal rule about weighting individual SLOs. The best approach will depend on your system's context and specific requirements.
Here are some common approaches for weight assignments:
- Availability-critical SLOs: In most cases, service availability is more important than service response time, as a slow but functional service is better than a completely unavailable service. In such cases, it is recommended to assign a higher weight to the availability SLOs than the latency SLOs.
- Latency-critical SLOs: In some specific services, such as online meetings and video conference tools, the response time will be as important as the service's availability. In such cases, it is recommended that equal weights be assigned for availability and latency SLOs.
- Based on business requirements: When creating a composite SLO for a service or application that includes a multi-step user journey, it's recommended to assign weights based on the relative business importance of each step. The steps that have more influence on the business should have higher weights. For example, ecommerce sites might prioritize order and payment steps.
Reliability SLIs
SLIs are the foundation of SLOs and represent the metrics upon which they are built to measure reliability and performance. Those metrics come from analyzing data points from different data sources, so a monitoring tool is essential for collecting them.
It’s also worth noting that not every single metric can be used as an SLI metric; only those that reflect the user experience need to be involved.
The table below shows the main SLI metrics to measure the reliability:
Reliability SLI Metric |
Description |
SLO Example |
Latency |
The time taken by a service or a webpage to respond to a user request. |
95% of web page loads should be completed within 2 seconds over a 30-day period |
Error rate |
The percentage of requests for a service or website that result in error codes or failures over a period of time. |
The error rate should be less than 0.1% of all transactions made during a week. |
Throughput |
The number of requests that an API or service can handle in a time period is often referred to as RPM (requests per minute) |
The website should handle at least 1,000 requests per second on average during peak hours over a 7-day period. |
Freshness |
This metric measures how recently the information accessed by the user has been updated. |
95% of data presented in reports should be updated within 1 hour of the latest available information over a 30-day period. |
Correctness |
The proportion of records coming into the pipeline that take in data and perform computations on it that result in the correct value being output. |
99.8% of records processed by the data pipeline should produce correct outputs over a 7-day period. |
Coverage |
It is the proportion of jobs or records that were successfully processed within a defined time window. |
A batch job should successfully process 99% of the jobs over the week. |
Durability |
The proportion of stored data that remains readable and retrievable without being corrupted over a given period. |
Successful data backup restoration of 99.5% of the stored data over the last month. |
Error budget
An error budget is another way to express a system's reliability. It is defined as the acceptable amount of service unreliability before the SLO is breached.
Think of an error budget as your monthly household budget. You have money to cover essential expenses like rent, utilities, and groceries. This is similar to the reliability targets defined by your SLOs. If you’re staying within your budget, you can take risks, like buying a TV or decorating a room, similar to developers focusing on new features or innovations. However, if unexpected expenses arise and you’re close to overspending, you might need to skip the luxuries and focus on the critical needs; similar to the developers, when they near consuming the error budget, they stop further development and apply code freezes to reduce any further risks.
It's also worth noting that, unlike financial budgets, error budgets don't roll over, as at the start of a new cycle, the error budget resets.
The error budget is calculated with this formula:
Error Budget % = 100% - SLO %
For example, if your SLO is 99.5%, your error budget is 0.5% of the total duration or 0.5 % of the total expected events over a defined time period.

The importance of SLOs in achieving reliability
You might think monitoring your application or system guarantees reliability, but monitoring alone is insufficient. Tools like Prometheus, CloudWatch, or Datadog can help you gain better insights and gather performance metrics from different data sources. The challenge in monitoring is to be strategic, focusing on what truly matters to customers rather than being a random data collection.
SLOs transform the collected monitoring data into actionable targets, bridging the gap between metrics and reliability by defining measurable goals that track what matters most, such as response times, error rates, and availability. That's how monitoring and SLOs together create a complete reliability strategy. As monitoring tools gather the essential data, SLOs provide the framework for understanding and improving system performance.
For example, Nobl9, a SaaS SLO management platform, helps to apply this strategy by integrating with different monitoring systems and then helps to establish SLOs in a consistent, policy driven manner. Nobl9 also enables, alerting on SLOs' error budgets, and creating reliability reports for the whole system.
Nobl9 integrations with various monitoring and observability tools. (source)
Reliability reports in Nobl9
Nobl9 creates dashboards and reports that visualize reliability scores based on the metrics collected from your defined SLOs.
Reliability Roll-Up report
The Reliability Roll-Up report allows you to visualize the overall Reliability Score, which is calculated based on measuring the SLO attainment across a variety of services. You can also drill down into each SLO to find more details.
The Reliability Roll-Up report visualizes your overall system Reliability Score (source)
Error Budget Status
This report in Nobl9 helps you check the status of your SLOs and Error Budgets in a single display.
The error Budget status report provides an easy way to verify the status of SLOs (source)
SLO History
This report type allows you to track the history performance of your SLOs for a period of time you choose.
SlO history report allows you to check the performance of your SLOs (source)
System Health Review
The report indicates the service health based on the remaining error budget of your SLOs. The health is classified into four categories:
- Healthy: SLOs with enough error budget left within their time window.
- At risk: SLOs with at-risk error budgets of being fully consumed within their time window.
- Exhausted: SLOs with error budget fully exhausted within their time window.
- No data SLOs: SLOs that haven't gathered any data.
System health review report provides a simplified way for reporting reliability and performance data (source)
The relation between availability and reliability
The best scenario for any application is to be both highly available and highly reliable. The matrix below shows examples of how different applications can behave between reliability and availability.
Low availability |
High availability |
|
Low reliability |
An e-commerce website hosted on outdated on-premises servers with no backup. Users are facing repetitive outages and extended downtimes in peak times. |
A video streaming platform with backup servers ensures uptime, but users face several buffering and playback issues. |
High reliability |
An online grocery store where users never encounter any issues with order or payment processing. However, this system requires scheduled maintenance every weekend. Leading the platform to be unreachable for extended periods. |
A global online banking application hosted on the cloud in different regions and availability zones with backup and a disaster recovery strategy. The platform is always reachable, and the user experience is satisfying. |
Challenges to achieving availability and reliability
While every team wants their service to be in the lower right quadrant of “high reliability and high availability”, real-world constraints can make it impractical. Here are three common challenges that can prevent teams from achieving high system availability or reliability.
Cost
Achieving higher levels of reliability and availability involves an increase in cost. It’s not that simple and will require a whole process of applying stricter testing mechanisms in the development phase, continuous updates, and automated recovery from failures. That is why organizations must strike a balance between costs and service quality.
Metrics selection
Reliability requires tracking the most meaningful SLI metrics accurately representing service performance and user experience. However, selecting the right metrics can be particularly challenging.
For example, An application can have hundreds of available metrics; these metrics are highly specialized and technical, meaning that it may be challenging for anyone outside the application team to understand which metrics mean to the customer.
Microservices and external dependencies
Measuring reliability in microservices-based applications can be challenging, consisting of several interconnected components with different reliability SLOs.
Services relying on third-party APIs or platforms must also consider potential disruptions outside their control.
Best practices to achieve availability
The strategies and tactics required to achieve reliability vs. availability can vary. Let’s start by exploring four best practices to achieve high service availability.
Build redundant systems
Redundancy is duplicating the system's services and components to remain operational in case of failure. Achieving redundancy requires a three-pronged approach.
Data replication
Use data replication and create copies from your critical data; for example, database replication ensures that multiple copies of the database exist in different data centers or regions, so even if one copy becomes unavailable, other replicates will be available.
Multi-region deployment
Deploy the application in different servers across multiple regions to reduce the impact of any region outage and reduce the latency for end-users by serving them from their closest region. For example, Netflix deploys its services across multiple AWS regions to ensure global availability. If one region experiences issues, traffic is rerouted to another region.
Hardware redundancy
Back up your hardware, including servers, routers, and switches, and avoid any single points of failure in your infrastructure. For example, A company might use dual power supplies and redundant network connections for each server in their data center.
Implement automatic failover
Failover is automatically switching to a backup system when the primary one fails. The two most popular types of failover are:
- Apply active-passive failover: If the primary component fails, the backup component will handle the workload. For example, in an application using two web servers, one is primary and active, and the other is a backup in an idle state; if the primary server is down, the backup server will automatically take over.
The traffic is routed to a backup server in case of failure (source)
- Apply active-active failover (load balancing): In this approach, the traffic is distributed among all the components, and if one of the components fails, others continue to handle the workload. For example, in an application with traffic distributed among two webservers behind a load balancer, the other webserver will handle the load if one webserver crashes.
The traffic is shared among the servers. (Source)
Design for scalability
An increase in utilization can cause application performance issues that lead to downtime and hurt availability metrics. The practices below help teams build and maintain systems that can remain highly available as they scale.
Create scalable architecture
Build your application and design the architecture to allow later expansion with additional servers, databases, or network devices.
Implement auto-scaling
An Auto-scaling policy triggered by CloudWatch alarms for CPU utilization. (Source)
Auto-scaling allows systems to scale based on demand. These steps can help teams get auto-scaling right:
- Create auto-scaling groups in your cloud infrastructure, such as AWS auto-scaling groups or Kubernetes HPA, to add or remove instances based on traffic patterns.
- Define scaling policies based on alerts and thresholds. For example, if the CPU utilization crosses a defined threshold, a new instance will be introduced.
Use a microservices architecture
It’s essential to avoid the single point of failure that a monolithic architecture might cause, in which the whole application is integrated as a single component. If one part of the service or component fails, it will bring down the whole system. This is why a microservices architecture is recommended. It breaks down the application into independent, self-contained services to isolate failures.
In a microservices architecture, failures in one service do not impact the availability of other services.
For example, a payment service may be experiencing failure in an online shopping application. However, other services, such as user authentication and product catalogs, will still be available, so users can still log in to the application and browse products.
Visit SLOcademy, our free SLO learning center
Visit SLOcademy. No Form.Best practices to achieve reliability
Availability best practices create a solid foundation for a reliable system, but availability on its own isn’t enough. These four best practices can help teams improve system reliability and meet (or exceed) users’ performance expectations.
Perform maintenance and updates regularly
Regular patching and upgrades are critical to system performance and security. These steps can help teams keep their systems up-to-date and performant:
- Perform hardware maintenance and inspections for servers, storage devices, and network components to identify signs of wear and prevent breakdowns. For example, monthly data center inspections are applied to check the server cooling system and replace the aging storage drivers.
- Regularly update all the software and firmware to avoid errors and ensure stability. For example, update the network firmware on network switches.
- Apply security patching as soon as reasonable to protect the system and address vulnerabilities and bugs.
- Prioritize database optimization by scheduling cleanup tasks such as indexing, removing stale data, and optimizing queries.
Follow standard reliability design patterns
In many cases, reliability problems stem from architecture and design decisions. Following standard design best practices early can help organizations avoid hard-to-fix reliability problems in the future.
Retry patterns
These patterns automatically retry operations that fail because of errors like network timeouts; the retries are done with exponential backoff to avoid overwhelming the system with many retries.
For example, suppose an application performs an API request. The initial request fails with an HTTP response code 500. The application will keep retrying the operation until the request succeeds with an HTTP response code 200.
def retry_with_backoff(max_retries):
def retry_with_backoff(max_retries):
retries = 0
while retries < max_retries:
try:
return api_request()
except Exception as e:
retries += 1
wait_time = 2 ** retries
print(f"Retry {retries}: {e}. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
return "Max retries reached. Operation failed."
In the previous code snippet, The retry_with_backoff(max_retries) function retries an "api_request" operation using an exponentially increasing wait time (2 ** retries) between attempts. If the API request keeps failing after that, it stops retrying after reaching the maximum number of attempts and returns an error message: "Max retries reached. Operation failed".
Circuit breaker patterns
They prevent the large impact of failures on your system if a service fails by temporarily disabling the sending of calls to it to avoid causing further issues across the system and allow the service to recover.
For example, in a streaming application, if the recommendation service fails, the circuit breaker will stop further calls to it until it recovers to prevent a system crash.
The Circuit Breaker pattern has three states:
- Closed State: In this state, the circuit breaker allows all requests to pass through and monitors their success or failure.
- Open State: When the circuit breaker detects a predefined number of failures or errors, it automatically transitions to the open state, in which the circuit breaker prevents requests from executing and disconnects the failing component or service.
- Half-Open State: After a specific time, the circuit breaker transitions to the half-open state, which allows a limited number of test requests. If these test requests succeed, the circuit breaker transitions back to the closed state, and if the test requests fail, it returns to the open state.
Different circuit breaker states. (Source)
Bulkhead patterns
Those patterns isolate the components or services at risk of failure to prevent the failure from spreading throughout the system.
For example, in an e-commerce application with a payment service and database, if the payment service experiences high traffic, it won't exhaust the database connections, ensuring the database remains stable.
An overview of a system designed with and without a bulkhead pattern. (Source)
Graceful degradation
This approach ensures that in the event of failures, the system will continue to perform its partial functionality instead of a total system crash.
For example, if the database in a news application is down, the system will retrieve the cached articles until database access is restored.
def get_article_from_database():
# Simulates a database failure
raise Exception("Database is down!")
def get_article():
try:
return get_article_from_database()
except Exception as e:
print(f"Error: {e}. Serving cached content.")
return cache.get("article")
article = get_article()
print(article) # Output: Cached article content
In the previous code snippet, the "get_article" function implements the graceful degradation pattern. The function tries to retrieve an article from a database, and if the database is unavailable, it uses a cached version of the article instead
Prioritize testing and automation
Quality assurance and regular testing can empower teams to detect and address minor issues before they become major reliability problems. Here are three specific testing and automation practices that can reduce the number of problems a system experiences in production.
Testing before deployment
Systems should be tested as early as practical. Two types of tests that can help teams “shift left” are:
- Unit tests: They are early tests during the development phase that check the application's services and components. For example, testing the authentication service in a shopping application confirms that it accepts valid credentials and rejects invalid ones.
- Integration tests: They involve testing different software components as a group to ensure proper interaction between them. For example, the tests run to check the interaction between the ordering and payment services in an online shopping application to confirm the transactions are processed accurately.
Simulation and stress testing
Many reliability issues only emerge when a system is under load. Engineers should stress and load test systems with practices such as:
- Simulate high loads and real-world traffic: Use tools like Apache JMeter to simulate the application's performance under real-world traffic and understand high load. For example, simulating 1000 concurrent users log into the website at the same time to identify potential issues.
- Run stress tests that gradually increase load: Apply stress testing by gradually increasing the network traffic on a website or a load balancer to test its behavior under extreme conditions.
Apply chaos engineering
Chaos engineering involves intentionally injecting faults into your system to predict its future vulnerabilities and allow the operations teams to be prepared for future failures. This is done with tools like Gremlin or Chaos Monkey.
Monitor with a focus on user outcomes
Monitor your system closely with one of the monitoring or observability tools to identify potential problems before they affect the reliability and user experience. These tools and techniques monitor the system in different ways:
- Metrics collection: Tools like Prometheus, Nagios, Cloudwatch, or Datadog will extract your service metrics, such as uptime duration, successful requests, and failures.
- Heartbeat tests: Tests like API and endpoint tests are regularly used to check if the service or the application URL is up and responding.
- Synthetic tests: These involve simulating scenarios for the actions that users take on your website, such as logging into a website and then making a transaction.
Real user monitoring (RUM): A type of monitoring to track your website's performance from the user's perspective, allowing you to see how your website behaved in real-life user sessions.
Learn how 300 surveyed enterprises use SLOs
Download ReportConclusion
Reliability and availability are essential to ensuring the highest user experience and system performance. Availability measures a system's operational uptime, which can be measured using different methods, such as uptime, downtime, and the number of requests.
On the other hand, reliability is the ability of a system or application to perform its intended functions to users; it can be measured with different metrics, such as latency, throughput, and error rate. SLOs, SLIs, and error budgets are essential to evaluate the reliability of your system.
Organizations that take steps to go beyond high availability and also achieve high reliability can delight users and ensure their systems meet business objectives.
Navigate Chapters: