High Availability

Availability is one of the most fundamental requirements for system design. You probably heard the term “high availability” a lot in tech talks. People like to boast their service as “high available”. But what does it mean exactly?

What is Availability?

High availability refers to the ability of a system to operate continuously, without failure, for a significantly long period. It's a crucial element of system design that ensures systems can provide services to users as intended without unplanned downtimes.

Let's dumb down a bit and use a pizza example. If it's open when you expect it to be and serves you pizza whenever you want, then it has high availability. In computer systems, this means the system is working and can be used whenever it's needed.

Measuring Availability

Availability is often defined as the percentage of time that a system is operational and able to provide service as expected.

If your favorite pizza place is open 24/7 then it's 100% availability. 100% availability is virtually impossible in computing environment.

Here's the formula to calculate availability:

Availability (%) = (Total operational time / Total time) x 100

Here are steps on how to calculate it:

  1. Identify the Total Time: This is typically the total amount of time for which you're evaluating the system's availability. It could be a day, week, month, year, etc.

  2. Identify the Downtime: This is the total amount of time during the period under evaluation that the system was not operational. This could be due to system crashes, maintenance, network issues, etc.

  3. Calculate the Operational Time: This is done by subtracting the downtime from the total time.

    Total Operational Time = Total Time - Downtime

  4. Calculate Availability: Divide the operational time by the total time, then multiply the result by 100 to get a percentage.

    Availability (%) = (Total Operational Time / Total Time) x 100

For example, if you are evaluating system availability over the course of a year (which has 3652460 = 525,600 minutes), and the system was down for a total of 500 minutes during that time:

Total Operational Time = 525,600 minutes - 500 minutes = 525,100 minutes Availability (%) = (525,100 minutes / 525,600 minutes) x 100 ≈ 99.905%

This means the system was available approximately 99.905% of the time.

What is “High” Availability?

Now how good is 99.905% availability? Is it considered “highly” available?

High availability is often talked about in terms of "nines". If something has "five nines" (99.999%) availability, it's like saying your pizza place is open 99.999% of the time. This means it's only closed about 5 minutes a year! That is a pretty highly available system. Here’s a table from Wikipedia showing availability percentage to the downtime to give you a feel of how the numbher of 9s would require.

availability percentage

But how many nines is good enough for my system or is it even necessary? To answer that we first have to discuss Service Level Agreement (SLA).

Service Level Agreement (SLA)

A Service Level Agreement (SLA) is a formal, written agreement between a service provider and a customer that sets expectations regarding the level of service the provider will deliver. This often includes specific metrics and targets the service provider agrees to meet, such as system uptime, response time, and resolution time.

For example, an SLA might stipulate that a system will have an uptime of 99.9% (three nines), and any breach of this may result in penalties, often in the form of service credits to the customer.

SLAs are important because they clearly establish expectations for both the service provider and the customer, offer a standard to measure service performance, and set out remedies or penalties if the service levels aren't met.

For example, here's a screenshot from Google Cloud's compute engine service's SLA:

google cloud sla

It specifies the availability uptime in different scenarios and what Google offers to compensate the customers if it fails to achieve them.

Cloud providers typically publish SLAs for their services:

Note that availability is not only property that can be specified in an SLA. For example, durability might be specified as part of the provider's commitments. For example, Amazon S3, the cloud storage service from Amazon Web Services (AWS), boasts an impressive data durability of 99.999999999% (eleven nines). This essentially means that if you store 10,000,000 objects in Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years. We will discuss more about durability in the coming articles.

How to Achieve High Availability?

Fundamentally high availability is about continuing to operate without failures. There are two types of failures:

Expected failures. This could be:

  • Hardware failures. Despite the best efforts, hardware components can and do fail unexpectedly. This can be due to various reasons such as manufacturing defects, wear and tear, or environmental conditions.
  • Resource Exhaustion: If your server's disk space or memory may get filled up at a particular load, that's a predictable failure. These can be mitigated by monitoring resource usage and scaling or cleaning up resources as necessary.

Unexpected failures: This could be

  • A misbehaving client. For example, a hacker who tries to exauhst our resources by sending a flood of requests to our API endpoint. An rate limiter is typically deployed in such cases to limit the amount of calls a client can make.
  • A failure in our service’s dependency: Our system has a dependency on another system if it needs to interact with an external system. A failure of the external system should not cause a failure in our system. For example, a hotel booking service may have to interact with 3rd party partner systems to complete a hotel booking. Our system should be able to handle failures in the 3rd party system.

Handling Expected Systems Failures

Expected failures can be mitigated by setting up proper redundancy. This can be done through setting up your service across in multiple regions, load balance your stateless servers and setting up automatic failover.

Redundancy

Redundancy is a critical part of achieving high availability in a system. It's all about having backup resources, like servers or databases, that are ready to take over if the primary resources fail. When dealing with cloud environments, there are two key concepts related to redundancy: availability zones and regions.

What is an availability zone?

An Availability Zone (AZ) is a distinct location within a cloud provider's region that is insulated from failures in other availability zones. Each availability zone runs on its own physically separate, independent infrastructure, and is engineered to be highly reliable. In the event of a failure, services can be transitioned to a different zone within the same region.

A Region is a geographical area that consists of multiple, isolated availability zones. Deploying applications across multiple regions provides greater fault tolerance and latency reduction as each region operates independently. This means a problem in one region doesn't affect another region.

To achieve high availability, services are often deployed across multiple availability zones within a region, which protects from single points of failure. For an even higher level of redundancy, services can be deployed across multiple regions. This can protect against larger-scale issues like natural disasters that might affect an entire region.

Load Balancing

Load balancing refers to the distribution of network traffic across multiple servers. It prevents any single server from becoming overworked (a bottleneck), which could lead to a system failure.

In cloud environments, load balancers can be deployed within a single availability zone or across multiple zones and regions, ensuring traffic is evenly distributed and providing an automatic failover mechanism. If a server or entire zone fails, the load balancer can redirect traffic to the remaining operational servers or zones.

Data Replication

Data replication involves maintaining copies of your data on different databases or database servers. If your primary database fails, one of the replicated databases can take over, ensuring your system continues to have access to its data.

In a cloud environment, you can configure data replication across multiple servers within an availability zone, across multiple zones, or even across regions, providing an additional level of redundancy and high availability.

Health Monitoring and Auto-Recovery Systems

Health monitoring and recovery systems automatically check the health of servers and other system components. If they detect a failure, they can automatically initiate recovery procedures, such as restarting a service or triggering a failover to a backup resource.

Cloud providers often offer services that monitor the health of your applications and automatically recover failed instances. For example, Amazon EC2 Auto Recovery automatically recovers instances when a system impairment is detected.

In practice, we often use a services or products that are a combination of the above. Let’s take a look at the tech stacks commonly used for high availability.

Tech Stacks to Achieve High Availability

Let’s take a look at the tech frameworks and tech stacks we can use to achieve high availability. This section is “good to know”. In a real interview, you wouldn’t be expected to dive too deep into specific technologies but it’s nice to know the tech stacks so you are not just arm waving.

Let’s start with Open Source ones:

Open Source

HAProxy

This is a free, open-source software that provides a high availability load balancer and proxy server for TCP and HTTP-based applications. It's used by many high-profile businesses handling significant levels of web traffic (like Twitter, Airbnb, and GitHub) to improve the performance and reliability of their servers.

HAProxy distributes the workload across multiple servers to optimize resource usage, maximize throughput, minimize response time, and avoid overloading any single server. It also provides automatic failover, meaning if one of your servers goes down, HAProxy will automatically redirect traffic to the remaining operational servers.

Keepalived

This is another open-source software that provides simple and robust facilities for load balancing and high availability. The load balancing framework relies on the well-known and widely used Linux Virtual Server (IPVS) for network-based load balancing, while the high availability component is based on VRRP protocol.

Keepalived also has health-checking mechanisms that can help monitor real servers and, in case of failure, it can decide to change the LVS topology and remove or add real servers.

Now let’s take a look at the big 3 cloud providers and their HA services:

Amazon Web Services (AWS)

  1. Elastic Load Balancer (ELB): This service automatically distributes incoming application traffic across multiple targets, such as Amazon EC2 instances. It improves the fault tolerance of your applications as the traffic is spread across several resources.
  2. Amazon RDS (Relational Database Service): The Multi-AZ deployments option for the RDS enables you to run mission-critical databases with high availability and built-in automated fail-over from your primary database to a synchronously replicated secondary database in case of a database instance failure.
  3. Amazon EC2 Auto Scaling: This service helps ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. In combination with ELB, Auto Scaling enables your applications to handle increases in traffic and withstand failures of an instance or availability zone.

Google Cloud Platform (GCP):

  1. Cloud Load Balancing: A fully distributed, software-defined, managed service for all your traffic. It offers autoscaling, and it's not just a load balancer but also an intelligent traffic distributor, with cross-region load balancing including automatic multi-region failover which steers your traffic to available healthy instances across multiple regions.
  2. Cloud SQL: This is a fully-managed database service that makes it easy to set up, maintain, manage, and administer your relational databases on Google Cloud Platform. It offers high availability configuration to protect your database from zonal failures.
  3. Managed Instance Groups (MIGs): These ensure that your deployed services are distributed across multiple, isolated failure domains and can scale up or down with intelligence. This helps in auto-healing and maintaining high availability of your application.

Microsoft Azure:

  1. Azure Load Balancer: This built-in load balancing feature for Azure deployments can distribute traffic among similar systems, improving system responsiveness and availability.
  2. Azure SQL Database: It's a fully managed relational database service that offers SQL Server engine compatibility and built-in high availability solution. It uses a technology called Always On availability groups to ensure that a group of databases is available and to provide failover support for single or multiple databases.
  3. Azure Virtual Machine Scale Sets: This allows you to create and manage a group of identical, load balanced, and autoscaling VMs. The number of VM instances can automatically increase or decrease in response to demand or a defined schedule, offering high availability to your applications.

Availability vs Fault Tolerance

Fault tolerance and availability are two related concepts in system design, which together contribute to the reliability of a system. Here's how they relate:

Availability:

Availability, as we discussed, is a measure of the system's uptime. It refers to the time a system or a service is up and running, or available for use. It’s usually defined as the percentage of time that a system is operational and able to provide service as expected.

Fault Tolerance:

Fault tolerance refers to the ability of a system to continue operating correctly even in the event of partial system failures. This might involve hardware or software redundancy, error handling, retry mechanisms, or self-healing processes.

In essence, a fault-tolerant system is designed to eliminate single points of failure. So, even if one component of the system fails, the system can continue functioning without interruption. Fault-tolerant systems are able to detect and repair faults automatically without human intervention.

By this definition, you can consider fault tolerance to mean 100% available, which isn’t realistic.

The Relationship between Availability and Fault Tolerance:

Fault tolerance directly contributes to the system's high availability. You can consider high availability to the result and making the system fault tolerance to be one way of achieving it. By ensuring the system continues to operate correctly even when some components fail, fault tolerance prevents system-wide downtime and therefore increases the overall availability of the system.

For the purpose of sytem design interviews, we can consider fault tolerance and high availability almost identical concepts. When the interviewer asks for fault tolerance, he probably just means availability.


TA 👨‍🏫